AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS

Detalhes bibliográficos
Autor(a) principal: Afonso, Alexandre Ribeiro
Data de Publicação: 2014
Outros Autores: Duque, Cláudio Gottschalg
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Journal of Information Systems and Technology Management (Online)
Texto Completo: https://www.revistas.usp.br/jistem/article/view/84679
Resumo: This article reports the findings of an empirical study about Automated Text Clustering applied to scientific articles and newspaper texts in Brazilian Portuguese, the objective was to find the most effective computational method able to cluster the input of texts in their original groups. The study covered four experiments, each experiment had four procedures: 1. Corpus Selections (a set of texts is selected for clustering), 2. Word Class Selections (Nouns, Verbs and Adjectives are chosen from each text by using specific algorithms), 3. Filtering Algorithms (a set of terms is selected from the results of the preview stage, a semantic weight is also inserted for each term and an index is generated for each text), 4. Clustering Algorithms (the clustering algorithms Simple K-Means, sIB and EM are applied to the indexes). After those procedures, clustering correctness and clustering time statistical results were collected. The sIB clustering algorithm is the best choice for both scientific and newspaper corpus, under the condition that the sIB clustering algorithm asks for the number of clusters as input before running (for the newspaper corpus, 68.9% correctness in 1 minute and for the scientific corpus, 77.8% correctness in 1 minute). The EM clustering algorithm additionally guesses the number of clusters without user intervention, but its best case is less than 53% correctness. Considering the experiments carried out, the results of human text classification and automated clustering are distant; it was also observed that the clustering correctness results vary according to the number of input texts and their topics.
id USP-33_760cf657c4d81465e5ad89245dd7ff16
oai_identifier_str oai:revistas.usp.br:article/84679
network_acronym_str USP-33
network_name_str Journal of Information Systems and Technology Management (Online)
repository_id_str
spelling AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODSText MiningText ClusteringNatural Language ProcessingBrazilian PortugueseEffectiveness.This article reports the findings of an empirical study about Automated Text Clustering applied to scientific articles and newspaper texts in Brazilian Portuguese, the objective was to find the most effective computational method able to cluster the input of texts in their original groups. The study covered four experiments, each experiment had four procedures: 1. Corpus Selections (a set of texts is selected for clustering), 2. Word Class Selections (Nouns, Verbs and Adjectives are chosen from each text by using specific algorithms), 3. Filtering Algorithms (a set of terms is selected from the results of the preview stage, a semantic weight is also inserted for each term and an index is generated for each text), 4. Clustering Algorithms (the clustering algorithms Simple K-Means, sIB and EM are applied to the indexes). After those procedures, clustering correctness and clustering time statistical results were collected. The sIB clustering algorithm is the best choice for both scientific and newspaper corpus, under the condition that the sIB clustering algorithm asks for the number of clusters as input before running (for the newspaper corpus, 68.9% correctness in 1 minute and for the scientific corpus, 77.8% correctness in 1 minute). The EM clustering algorithm additionally guesses the number of clusters without user intervention, but its best case is less than 53% correctness. Considering the experiments carried out, the results of human text classification and automated clustering are distant; it was also observed that the clustering correctness results vary according to the number of input texts and their topics.TECSI - FEA - Universidade de São Paulo. Faculdade de Economia, Administração, Contabilidade e Atuária2014-08-21info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionapplication/pdfhttps://www.revistas.usp.br/jistem/article/view/8467910.4301/10.4301%2FS1807-17752014000200011Journal of Information Systems and Technology Management; v. 11 n. 2 (2014); 415-436Journal of Information Systems and Technology Management; Vol. 11 No. 2 (2014); 415-436Journal of Information Systems and Technology Management; Vol. 11 Núm. 2 (2014); 415-4361807-1775reponame:Journal of Information Systems and Technology Management (Online)instname:Universidade de São Paulo (USP)instacron:USPenghttps://www.revistas.usp.br/jistem/article/view/84679/87393Copyright (c) 2018 JISTEM - Journal of Information Systems and Technology Management (Online)info:eu-repo/semantics/openAccessAfonso, Alexandre RibeiroDuque, Cláudio Gottschalg2014-09-16T13:25:36Zoai:revistas.usp.br:article/84679Revistahttp://www.scielo.br/scielo.php?script=sci_serial&pid=1807-1775&lng=pt&nrm=isoPUBhttps://old.scielo.br/oai/scielo-oai.php||jistem@usp.br1807-17751807-1775opendoar:2014-09-16T13:25:36Journal of Information Systems and Technology Management (Online) - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS
title AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS
spellingShingle AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS
Afonso, Alexandre Ribeiro
Text Mining
Text Clustering
Natural Language Processing
Brazilian Portuguese
Effectiveness.
title_short AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS
title_full AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS
title_fullStr AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS
title_full_unstemmed AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS
title_sort AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS
author Afonso, Alexandre Ribeiro
author_facet Afonso, Alexandre Ribeiro
Duque, Cláudio Gottschalg
author_role author
author2 Duque, Cláudio Gottschalg
author2_role author
dc.contributor.author.fl_str_mv Afonso, Alexandre Ribeiro
Duque, Cláudio Gottschalg
dc.subject.por.fl_str_mv Text Mining
Text Clustering
Natural Language Processing
Brazilian Portuguese
Effectiveness.
topic Text Mining
Text Clustering
Natural Language Processing
Brazilian Portuguese
Effectiveness.
description This article reports the findings of an empirical study about Automated Text Clustering applied to scientific articles and newspaper texts in Brazilian Portuguese, the objective was to find the most effective computational method able to cluster the input of texts in their original groups. The study covered four experiments, each experiment had four procedures: 1. Corpus Selections (a set of texts is selected for clustering), 2. Word Class Selections (Nouns, Verbs and Adjectives are chosen from each text by using specific algorithms), 3. Filtering Algorithms (a set of terms is selected from the results of the preview stage, a semantic weight is also inserted for each term and an index is generated for each text), 4. Clustering Algorithms (the clustering algorithms Simple K-Means, sIB and EM are applied to the indexes). After those procedures, clustering correctness and clustering time statistical results were collected. The sIB clustering algorithm is the best choice for both scientific and newspaper corpus, under the condition that the sIB clustering algorithm asks for the number of clusters as input before running (for the newspaper corpus, 68.9% correctness in 1 minute and for the scientific corpus, 77.8% correctness in 1 minute). The EM clustering algorithm additionally guesses the number of clusters without user intervention, but its best case is less than 53% correctness. Considering the experiments carried out, the results of human text classification and automated clustering are distant; it was also observed that the clustering correctness results vary according to the number of input texts and their topics.
publishDate 2014
dc.date.none.fl_str_mv 2014-08-21
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://www.revistas.usp.br/jistem/article/view/84679
10.4301/10.4301%2FS1807-17752014000200011
url https://www.revistas.usp.br/jistem/article/view/84679
identifier_str_mv 10.4301/10.4301%2FS1807-17752014000200011
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv https://www.revistas.usp.br/jistem/article/view/84679/87393
dc.rights.driver.fl_str_mv Copyright (c) 2018 JISTEM - Journal of Information Systems and Technology Management (Online)
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Copyright (c) 2018 JISTEM - Journal of Information Systems and Technology Management (Online)
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv TECSI - FEA - Universidade de São Paulo. Faculdade de Economia, Administração, Contabilidade e Atuária
publisher.none.fl_str_mv TECSI - FEA - Universidade de São Paulo. Faculdade de Economia, Administração, Contabilidade e Atuária
dc.source.none.fl_str_mv Journal of Information Systems and Technology Management; v. 11 n. 2 (2014); 415-436
Journal of Information Systems and Technology Management; Vol. 11 No. 2 (2014); 415-436
Journal of Information Systems and Technology Management; Vol. 11 Núm. 2 (2014); 415-436
1807-1775
reponame:Journal of Information Systems and Technology Management (Online)
instname:Universidade de São Paulo (USP)
instacron:USP
instname_str Universidade de São Paulo (USP)
instacron_str USP
institution USP
reponame_str Journal of Information Systems and Technology Management (Online)
collection Journal of Information Systems and Technology Management (Online)
repository.name.fl_str_mv Journal of Information Systems and Technology Management (Online) - Universidade de São Paulo (USP)
repository.mail.fl_str_mv ||jistem@usp.br
_version_ 1809284036693065728