AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS
Autor(a) principal: | |
---|---|
Data de Publicação: | 2014 |
Outros Autores: | |
Tipo de documento: | Artigo |
Idioma: | eng |
Título da fonte: | Journal of Information Systems and Technology Management (Online) |
Texto Completo: | https://www.revistas.usp.br/jistem/article/view/84679 |
Resumo: | This article reports the findings of an empirical study about Automated Text Clustering applied to scientific articles and newspaper texts in Brazilian Portuguese, the objective was to find the most effective computational method able to cluster the input of texts in their original groups. The study covered four experiments, each experiment had four procedures: 1. Corpus Selections (a set of texts is selected for clustering), 2. Word Class Selections (Nouns, Verbs and Adjectives are chosen from each text by using specific algorithms), 3. Filtering Algorithms (a set of terms is selected from the results of the preview stage, a semantic weight is also inserted for each term and an index is generated for each text), 4. Clustering Algorithms (the clustering algorithms Simple K-Means, sIB and EM are applied to the indexes). After those procedures, clustering correctness and clustering time statistical results were collected. The sIB clustering algorithm is the best choice for both scientific and newspaper corpus, under the condition that the sIB clustering algorithm asks for the number of clusters as input before running (for the newspaper corpus, 68.9% correctness in 1 minute and for the scientific corpus, 77.8% correctness in 1 minute). The EM clustering algorithm additionally guesses the number of clusters without user intervention, but its best case is less than 53% correctness. Considering the experiments carried out, the results of human text classification and automated clustering are distant; it was also observed that the clustering correctness results vary according to the number of input texts and their topics. |
id |
USP-33_760cf657c4d81465e5ad89245dd7ff16 |
---|---|
oai_identifier_str |
oai:revistas.usp.br:article/84679 |
network_acronym_str |
USP-33 |
network_name_str |
Journal of Information Systems and Technology Management (Online) |
repository_id_str |
|
spelling |
AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODSText MiningText ClusteringNatural Language ProcessingBrazilian PortugueseEffectiveness.This article reports the findings of an empirical study about Automated Text Clustering applied to scientific articles and newspaper texts in Brazilian Portuguese, the objective was to find the most effective computational method able to cluster the input of texts in their original groups. The study covered four experiments, each experiment had four procedures: 1. Corpus Selections (a set of texts is selected for clustering), 2. Word Class Selections (Nouns, Verbs and Adjectives are chosen from each text by using specific algorithms), 3. Filtering Algorithms (a set of terms is selected from the results of the preview stage, a semantic weight is also inserted for each term and an index is generated for each text), 4. Clustering Algorithms (the clustering algorithms Simple K-Means, sIB and EM are applied to the indexes). After those procedures, clustering correctness and clustering time statistical results were collected. The sIB clustering algorithm is the best choice for both scientific and newspaper corpus, under the condition that the sIB clustering algorithm asks for the number of clusters as input before running (for the newspaper corpus, 68.9% correctness in 1 minute and for the scientific corpus, 77.8% correctness in 1 minute). The EM clustering algorithm additionally guesses the number of clusters without user intervention, but its best case is less than 53% correctness. Considering the experiments carried out, the results of human text classification and automated clustering are distant; it was also observed that the clustering correctness results vary according to the number of input texts and their topics.TECSI - FEA - Universidade de São Paulo. Faculdade de Economia, Administração, Contabilidade e Atuária2014-08-21info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionapplication/pdfhttps://www.revistas.usp.br/jistem/article/view/8467910.4301/10.4301%2FS1807-17752014000200011Journal of Information Systems and Technology Management; v. 11 n. 2 (2014); 415-436Journal of Information Systems and Technology Management; Vol. 11 No. 2 (2014); 415-436Journal of Information Systems and Technology Management; Vol. 11 Núm. 2 (2014); 415-4361807-1775reponame:Journal of Information Systems and Technology Management (Online)instname:Universidade de São Paulo (USP)instacron:USPenghttps://www.revistas.usp.br/jistem/article/view/84679/87393Copyright (c) 2018 JISTEM - Journal of Information Systems and Technology Management (Online)info:eu-repo/semantics/openAccessAfonso, Alexandre RibeiroDuque, Cláudio Gottschalg2014-09-16T13:25:36Zoai:revistas.usp.br:article/84679Revistahttp://www.scielo.br/scielo.php?script=sci_serial&pid=1807-1775&lng=pt&nrm=isoPUBhttps://old.scielo.br/oai/scielo-oai.php||jistem@usp.br1807-17751807-1775opendoar:2014-09-16T13:25:36Journal of Information Systems and Technology Management (Online) - Universidade de São Paulo (USP)false |
dc.title.none.fl_str_mv |
AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS |
title |
AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS |
spellingShingle |
AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS Afonso, Alexandre Ribeiro Text Mining Text Clustering Natural Language Processing Brazilian Portuguese Effectiveness. |
title_short |
AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS |
title_full |
AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS |
title_fullStr |
AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS |
title_full_unstemmed |
AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS |
title_sort |
AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS |
author |
Afonso, Alexandre Ribeiro |
author_facet |
Afonso, Alexandre Ribeiro Duque, Cláudio Gottschalg |
author_role |
author |
author2 |
Duque, Cláudio Gottschalg |
author2_role |
author |
dc.contributor.author.fl_str_mv |
Afonso, Alexandre Ribeiro Duque, Cláudio Gottschalg |
dc.subject.por.fl_str_mv |
Text Mining Text Clustering Natural Language Processing Brazilian Portuguese Effectiveness. |
topic |
Text Mining Text Clustering Natural Language Processing Brazilian Portuguese Effectiveness. |
description |
This article reports the findings of an empirical study about Automated Text Clustering applied to scientific articles and newspaper texts in Brazilian Portuguese, the objective was to find the most effective computational method able to cluster the input of texts in their original groups. The study covered four experiments, each experiment had four procedures: 1. Corpus Selections (a set of texts is selected for clustering), 2. Word Class Selections (Nouns, Verbs and Adjectives are chosen from each text by using specific algorithms), 3. Filtering Algorithms (a set of terms is selected from the results of the preview stage, a semantic weight is also inserted for each term and an index is generated for each text), 4. Clustering Algorithms (the clustering algorithms Simple K-Means, sIB and EM are applied to the indexes). After those procedures, clustering correctness and clustering time statistical results were collected. The sIB clustering algorithm is the best choice for both scientific and newspaper corpus, under the condition that the sIB clustering algorithm asks for the number of clusters as input before running (for the newspaper corpus, 68.9% correctness in 1 minute and for the scientific corpus, 77.8% correctness in 1 minute). The EM clustering algorithm additionally guesses the number of clusters without user intervention, but its best case is less than 53% correctness. Considering the experiments carried out, the results of human text classification and automated clustering are distant; it was also observed that the clustering correctness results vary according to the number of input texts and their topics. |
publishDate |
2014 |
dc.date.none.fl_str_mv |
2014-08-21 |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://www.revistas.usp.br/jistem/article/view/84679 10.4301/10.4301%2FS1807-17752014000200011 |
url |
https://www.revistas.usp.br/jistem/article/view/84679 |
identifier_str_mv |
10.4301/10.4301%2FS1807-17752014000200011 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
https://www.revistas.usp.br/jistem/article/view/84679/87393 |
dc.rights.driver.fl_str_mv |
Copyright (c) 2018 JISTEM - Journal of Information Systems and Technology Management (Online) info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
Copyright (c) 2018 JISTEM - Journal of Information Systems and Technology Management (Online) |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
TECSI - FEA - Universidade de São Paulo. Faculdade de Economia, Administração, Contabilidade e Atuária |
publisher.none.fl_str_mv |
TECSI - FEA - Universidade de São Paulo. Faculdade de Economia, Administração, Contabilidade e Atuária |
dc.source.none.fl_str_mv |
Journal of Information Systems and Technology Management; v. 11 n. 2 (2014); 415-436 Journal of Information Systems and Technology Management; Vol. 11 No. 2 (2014); 415-436 Journal of Information Systems and Technology Management; Vol. 11 Núm. 2 (2014); 415-436 1807-1775 reponame:Journal of Information Systems and Technology Management (Online) instname:Universidade de São Paulo (USP) instacron:USP |
instname_str |
Universidade de São Paulo (USP) |
instacron_str |
USP |
institution |
USP |
reponame_str |
Journal of Information Systems and Technology Management (Online) |
collection |
Journal of Information Systems and Technology Management (Online) |
repository.name.fl_str_mv |
Journal of Information Systems and Technology Management (Online) - Universidade de São Paulo (USP) |
repository.mail.fl_str_mv |
||jistem@usp.br |
_version_ |
1809284036693065728 |