TextCL: a Python package for NLP preprocessing tasks
Autor(a) principal: | |
---|---|
Data de Publicação: | 2022 |
Outros Autores: | |
Tipo de documento: | Artigo |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10437/12937 |
Resumo: | Preprocessing text data sets for use in Natural Language Processing tasks is usually a time-consuming and expensive effort. Text data, normally obtained from sources such as, but not limited to, web scraping, scanned documents or PDF files, is typically unstructured and prone to artifacts and other types of noise. The goal of the TextCL package is to simplify this process by providing multiple methods suited for text data preprocessing. It includes functionality for splitting texts into sentences, filtering sentences by language, perplexity filtering, and removing duplicate sentences. Another functionality offered by the TextCL package is the outlier detection module, which allows to identify and filter out texts that are different from the main topic distribution of the data set. This method allows selecting one of several unsupervised outlier detection algorithms, such as TONMF (block coordinate descent framework), RPCA (robust principal component analysis), or SVD (singular value decomposition) and apply it to the text data. Keywords: Natural language processing ; Text filtering ; Outlier detection |
id |
RCAP_81333a507c3e4e7b1de38bcc2222b056 |
---|---|
oai_identifier_str |
oai:recil.ensinolusofona.pt:10437/12937 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
TextCL: a Python package for NLP preprocessing tasksINFORMÁTICAPROCESSAMENTO DE DADOSPROCESSAMENTO DE TEXTOLINGUAGEM NATURALLINGUAGEM PYTHONCOMPUTER SCIENCEDATA PROCESSINGWORD PROCESSINGNATURAL LANGUAGEPYTHON PROGRAMMING LANGUAGEPreprocessing text data sets for use in Natural Language Processing tasks is usually a time-consuming and expensive effort. Text data, normally obtained from sources such as, but not limited to, web scraping, scanned documents or PDF files, is typically unstructured and prone to artifacts and other types of noise. The goal of the TextCL package is to simplify this process by providing multiple methods suited for text data preprocessing. It includes functionality for splitting texts into sentences, filtering sentences by language, perplexity filtering, and removing duplicate sentences. Another functionality offered by the TextCL package is the outlier detection module, which allows to identify and filter out texts that are different from the main topic distribution of the data set. This method allows selecting one of several unsupervised outlier detection algorithms, such as TONMF (block coordinate descent framework), RPCA (robust principal component analysis), or SVD (singular value decomposition) and apply it to the text data. Keywords: Natural language processing ; Text filtering ; Outlier detectionElsevier2022-06-17T09:31:01Z2022-07-01T00:00:00Z2022-07-01info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10437/12937eng2352-7110Petukhova, AlinaFachada, Nunoinfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-03-09T14:03:29Zoai:recil.ensinolusofona.pt:10437/12937Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T17:11:41.184282Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
TextCL: a Python package for NLP preprocessing tasks |
title |
TextCL: a Python package for NLP preprocessing tasks |
spellingShingle |
TextCL: a Python package for NLP preprocessing tasks Petukhova, Alina INFORMÁTICA PROCESSAMENTO DE DADOS PROCESSAMENTO DE TEXTO LINGUAGEM NATURAL LINGUAGEM PYTHON COMPUTER SCIENCE DATA PROCESSING WORD PROCESSING NATURAL LANGUAGE PYTHON PROGRAMMING LANGUAGE |
title_short |
TextCL: a Python package for NLP preprocessing tasks |
title_full |
TextCL: a Python package for NLP preprocessing tasks |
title_fullStr |
TextCL: a Python package for NLP preprocessing tasks |
title_full_unstemmed |
TextCL: a Python package for NLP preprocessing tasks |
title_sort |
TextCL: a Python package for NLP preprocessing tasks |
author |
Petukhova, Alina |
author_facet |
Petukhova, Alina Fachada, Nuno |
author_role |
author |
author2 |
Fachada, Nuno |
author2_role |
author |
dc.contributor.author.fl_str_mv |
Petukhova, Alina Fachada, Nuno |
dc.subject.por.fl_str_mv |
INFORMÁTICA PROCESSAMENTO DE DADOS PROCESSAMENTO DE TEXTO LINGUAGEM NATURAL LINGUAGEM PYTHON COMPUTER SCIENCE DATA PROCESSING WORD PROCESSING NATURAL LANGUAGE PYTHON PROGRAMMING LANGUAGE |
topic |
INFORMÁTICA PROCESSAMENTO DE DADOS PROCESSAMENTO DE TEXTO LINGUAGEM NATURAL LINGUAGEM PYTHON COMPUTER SCIENCE DATA PROCESSING WORD PROCESSING NATURAL LANGUAGE PYTHON PROGRAMMING LANGUAGE |
description |
Preprocessing text data sets for use in Natural Language Processing tasks is usually a time-consuming and expensive effort. Text data, normally obtained from sources such as, but not limited to, web scraping, scanned documents or PDF files, is typically unstructured and prone to artifacts and other types of noise. The goal of the TextCL package is to simplify this process by providing multiple methods suited for text data preprocessing. It includes functionality for splitting texts into sentences, filtering sentences by language, perplexity filtering, and removing duplicate sentences. Another functionality offered by the TextCL package is the outlier detection module, which allows to identify and filter out texts that are different from the main topic distribution of the data set. This method allows selecting one of several unsupervised outlier detection algorithms, such as TONMF (block coordinate descent framework), RPCA (robust principal component analysis), or SVD (singular value decomposition) and apply it to the text data. Keywords: Natural language processing ; Text filtering ; Outlier detection |
publishDate |
2022 |
dc.date.none.fl_str_mv |
2022-06-17T09:31:01Z 2022-07-01T00:00:00Z 2022-07-01 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10437/12937 |
url |
http://hdl.handle.net/10437/12937 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
2352-7110 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Elsevier |
publisher.none.fl_str_mv |
Elsevier |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799131213910245377 |