TextCL: a Python package for NLP preprocessing tasks

Detalhes bibliográficos
Autor(a) principal: Petukhova, Alina
Data de Publicação: 2022
Outros Autores: Fachada, Nuno
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10437/12937
Resumo: Preprocessing text data sets for use in Natural Language Processing tasks is usually a time-consuming and expensive effort. Text data, normally obtained from sources such as, but not limited to, web scraping, scanned documents or PDF files, is typically unstructured and prone to artifacts and other types of noise. The goal of the TextCL package is to simplify this process by providing multiple methods suited for text data preprocessing. It includes functionality for splitting texts into sentences, filtering sentences by language, perplexity filtering, and removing duplicate sentences. Another functionality offered by the TextCL package is the outlier detection module, which allows to identify and filter out texts that are different from the main topic distribution of the data set. This method allows selecting one of several unsupervised outlier detection algorithms, such as TONMF (block coordinate descent framework), RPCA (robust principal component analysis), or SVD (singular value decomposition) and apply it to the text data. Keywords: Natural language processing ; Text filtering ; Outlier detection
id RCAP_81333a507c3e4e7b1de38bcc2222b056
oai_identifier_str oai:recil.ensinolusofona.pt:10437/12937
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling TextCL: a Python package for NLP preprocessing tasksINFORMÁTICAPROCESSAMENTO DE DADOSPROCESSAMENTO DE TEXTOLINGUAGEM NATURALLINGUAGEM PYTHONCOMPUTER SCIENCEDATA PROCESSINGWORD PROCESSINGNATURAL LANGUAGEPYTHON PROGRAMMING LANGUAGEPreprocessing text data sets for use in Natural Language Processing tasks is usually a time-consuming and expensive effort. Text data, normally obtained from sources such as, but not limited to, web scraping, scanned documents or PDF files, is typically unstructured and prone to artifacts and other types of noise. The goal of the TextCL package is to simplify this process by providing multiple methods suited for text data preprocessing. It includes functionality for splitting texts into sentences, filtering sentences by language, perplexity filtering, and removing duplicate sentences. Another functionality offered by the TextCL package is the outlier detection module, which allows to identify and filter out texts that are different from the main topic distribution of the data set. This method allows selecting one of several unsupervised outlier detection algorithms, such as TONMF (block coordinate descent framework), RPCA (robust principal component analysis), or SVD (singular value decomposition) and apply it to the text data. Keywords: Natural language processing ; Text filtering ; Outlier detectionElsevier2022-06-17T09:31:01Z2022-07-01T00:00:00Z2022-07-01info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10437/12937eng2352-7110Petukhova, AlinaFachada, Nunoinfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-03-09T14:03:29Zoai:recil.ensinolusofona.pt:10437/12937Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T17:11:41.184282Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv TextCL: a Python package for NLP preprocessing tasks
title TextCL: a Python package for NLP preprocessing tasks
spellingShingle TextCL: a Python package for NLP preprocessing tasks
Petukhova, Alina
INFORMÁTICA
PROCESSAMENTO DE DADOS
PROCESSAMENTO DE TEXTO
LINGUAGEM NATURAL
LINGUAGEM PYTHON
COMPUTER SCIENCE
DATA PROCESSING
WORD PROCESSING
NATURAL LANGUAGE
PYTHON PROGRAMMING LANGUAGE
title_short TextCL: a Python package for NLP preprocessing tasks
title_full TextCL: a Python package for NLP preprocessing tasks
title_fullStr TextCL: a Python package for NLP preprocessing tasks
title_full_unstemmed TextCL: a Python package for NLP preprocessing tasks
title_sort TextCL: a Python package for NLP preprocessing tasks
author Petukhova, Alina
author_facet Petukhova, Alina
Fachada, Nuno
author_role author
author2 Fachada, Nuno
author2_role author
dc.contributor.author.fl_str_mv Petukhova, Alina
Fachada, Nuno
dc.subject.por.fl_str_mv INFORMÁTICA
PROCESSAMENTO DE DADOS
PROCESSAMENTO DE TEXTO
LINGUAGEM NATURAL
LINGUAGEM PYTHON
COMPUTER SCIENCE
DATA PROCESSING
WORD PROCESSING
NATURAL LANGUAGE
PYTHON PROGRAMMING LANGUAGE
topic INFORMÁTICA
PROCESSAMENTO DE DADOS
PROCESSAMENTO DE TEXTO
LINGUAGEM NATURAL
LINGUAGEM PYTHON
COMPUTER SCIENCE
DATA PROCESSING
WORD PROCESSING
NATURAL LANGUAGE
PYTHON PROGRAMMING LANGUAGE
description Preprocessing text data sets for use in Natural Language Processing tasks is usually a time-consuming and expensive effort. Text data, normally obtained from sources such as, but not limited to, web scraping, scanned documents or PDF files, is typically unstructured and prone to artifacts and other types of noise. The goal of the TextCL package is to simplify this process by providing multiple methods suited for text data preprocessing. It includes functionality for splitting texts into sentences, filtering sentences by language, perplexity filtering, and removing duplicate sentences. Another functionality offered by the TextCL package is the outlier detection module, which allows to identify and filter out texts that are different from the main topic distribution of the data set. This method allows selecting one of several unsupervised outlier detection algorithms, such as TONMF (block coordinate descent framework), RPCA (robust principal component analysis), or SVD (singular value decomposition) and apply it to the text data. Keywords: Natural language processing ; Text filtering ; Outlier detection
publishDate 2022
dc.date.none.fl_str_mv 2022-06-17T09:31:01Z
2022-07-01T00:00:00Z
2022-07-01
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10437/12937
url http://hdl.handle.net/10437/12937
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv 2352-7110
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Elsevier
publisher.none.fl_str_mv Elsevier
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799131213910245377