TextCL: a Python package for NLP preprocessing tasks

Petukhova, Alina; Fachada, Nuno

TextCL: a Python package for NLP preprocessing tasks

Detalhes bibliográficos
Autor(a) principal:	Petukhova, Alina
Data de Publicação:	2022
Outros Autores:	Fachada, Nuno
Tipo de documento:	Artigo
Idioma:	eng
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	http://hdl.handle.net/10437/12937
Resumo:	Preprocessing text data sets for use in Natural Language Processing tasks is usually a time-consuming and expensive effort. Text data, normally obtained from sources such as, but not limited to, web scraping, scanned documents or PDF files, is typically unstructured and prone to artifacts and other types of noise. The goal of the TextCL package is to simplify this process by providing multiple methods suited for text data preprocessing. It includes functionality for splitting texts into sentences, filtering sentences by language, perplexity filtering, and removing duplicate sentences. Another functionality offered by the TextCL package is the outlier detection module, which allows to identify and filter out texts that are different from the main topic distribution of the data set. This method allows selecting one of several unsupervised outlier detection algorithms, such as TONMF (block coordinate descent framework), RPCA (robust principal component analysis), or SVD (singular value decomposition) and apply it to the text data. Keywords: Natural language processing ; Text filtering ; Outlier detection

Metadados do item

id	RCAP_81333a507c3e4e7b1de38bcc2222b056
oai_identifier_str	oai:recil.ensinolusofona.pt:10437/12937
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	TextCL: a Python package for NLP preprocessing tasksINFORMÁTICAPROCESSAMENTO DE DADOSPROCESSAMENTO DE TEXTOLINGUAGEM NATURALLINGUAGEM PYTHONCOMPUTER SCIENCEDATA PROCESSINGWORD PROCESSINGNATURAL LANGUAGEPYTHON PROGRAMMING LANGUAGEPreprocessing text data sets for use in Natural Language Processing tasks is usually a time-consuming and expensive effort. Text data, normally obtained from sources such as, but not limited to, web scraping, scanned documents or PDF files, is typically unstructured and prone to artifacts and other types of noise. The goal of the TextCL package is to simplify this process by providing multiple methods suited for text data preprocessing. It includes functionality for splitting texts into sentences, filtering sentences by language, perplexity filtering, and removing duplicate sentences. Another functionality offered by the TextCL package is the outlier detection module, which allows to identify and filter out texts that are different from the main topic distribution of the data set. This method allows selecting one of several unsupervised outlier detection algorithms, such as TONMF (block coordinate descent framework), RPCA (robust principal component analysis), or SVD (singular value decomposition) and apply it to the text data. Keywords: Natural language processing ; Text filtering ; Outlier detectionElsevier2022-06-17T09:31:01Z2022-07-01T00:00:00Z2022-07-01info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10437/12937eng2352-7110Petukhova, AlinaFachada, Nunoinfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-03-09T14:03:29Zoai:recil.ensinolusofona.pt:10437/12937Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T17:11:41.184282Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	TextCL: a Python package for NLP preprocessing tasks
title	TextCL: a Python package for NLP preprocessing tasks
spellingShingle	TextCL: a Python package for NLP preprocessing tasks Petukhova, Alina INFORMÁTICA PROCESSAMENTO DE DADOS PROCESSAMENTO DE TEXTO LINGUAGEM NATURAL LINGUAGEM PYTHON COMPUTER SCIENCE DATA PROCESSING WORD PROCESSING NATURAL LANGUAGE PYTHON PROGRAMMING LANGUAGE
title_short	TextCL: a Python package for NLP preprocessing tasks
title_full	TextCL: a Python package for NLP preprocessing tasks
title_fullStr	TextCL: a Python package for NLP preprocessing tasks
title_full_unstemmed	TextCL: a Python package for NLP preprocessing tasks
title_sort	TextCL: a Python package for NLP preprocessing tasks
author	Petukhova, Alina
author_facet	Petukhova, Alina Fachada, Nuno
author_role	author
author2	Fachada, Nuno
author2_role	author
dc.contributor.author.fl_str_mv	Petukhova, Alina Fachada, Nuno
dc.subject.por.fl_str_mv	INFORMÁTICA PROCESSAMENTO DE DADOS PROCESSAMENTO DE TEXTO LINGUAGEM NATURAL LINGUAGEM PYTHON COMPUTER SCIENCE DATA PROCESSING WORD PROCESSING NATURAL LANGUAGE PYTHON PROGRAMMING LANGUAGE
topic	INFORMÁTICA PROCESSAMENTO DE DADOS PROCESSAMENTO DE TEXTO LINGUAGEM NATURAL LINGUAGEM PYTHON COMPUTER SCIENCE DATA PROCESSING WORD PROCESSING NATURAL LANGUAGE PYTHON PROGRAMMING LANGUAGE
description	Preprocessing text data sets for use in Natural Language Processing tasks is usually a time-consuming and expensive effort. Text data, normally obtained from sources such as, but not limited to, web scraping, scanned documents or PDF files, is typically unstructured and prone to artifacts and other types of noise. The goal of the TextCL package is to simplify this process by providing multiple methods suited for text data preprocessing. It includes functionality for splitting texts into sentences, filtering sentences by language, perplexity filtering, and removing duplicate sentences. Another functionality offered by the TextCL package is the outlier detection module, which allows to identify and filter out texts that are different from the main topic distribution of the data set. This method allows selecting one of several unsupervised outlier detection algorithms, such as TONMF (block coordinate descent framework), RPCA (robust principal component analysis), or SVD (singular value decomposition) and apply it to the text data. Keywords: Natural language processing ; Text filtering ; Outlier detection
publishDate	2022
dc.date.none.fl_str_mv	2022-06-17T09:31:01Z 2022-07-01T00:00:00Z 2022-07-01
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/article
format	article
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10437/12937
url	http://hdl.handle.net/10437/12937
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	2352-7110
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Elsevier
publisher.none.fl_str_mv	Elsevier
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799131213910245377

TextCL: a Python package for NLP preprocessing tasks

Registros relacionados