AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS

Afonso, Alexandre Ribeiro; Duque, Cláudio Gottschalg

AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS

Detalhes bibliográficos
Autor(a) principal:	Afonso, Alexandre Ribeiro
Data de Publicação:	2014
Outros Autores:	Duque, Cláudio Gottschalg
Tipo de documento:	Artigo
Idioma:	eng
Título da fonte:	Journal of Information Systems and Technology Management (Online)
Texto Completo:	https://www.revistas.usp.br/jistem/article/view/84679
Resumo:	This article reports the findings of an empirical study about Automated Text Clustering applied to scientific articles and newspaper texts in Brazilian Portuguese, the objective was to find the most effective computational method able to cluster the input of texts in their original groups. The study covered four experiments, each experiment had four procedures: 1. Corpus Selections (a set of texts is selected for clustering), 2. Word Class Selections (Nouns, Verbs and Adjectives are chosen from each text by using specific algorithms), 3. Filtering Algorithms (a set of terms is selected from the results of the preview stage, a semantic weight is also inserted for each term and an index is generated for each text), 4. Clustering Algorithms (the clustering algorithms Simple K-Means, sIB and EM are applied to the indexes). After those procedures, clustering correctness and clustering time statistical results were collected. The sIB clustering algorithm is the best choice for both scientific and newspaper corpus, under the condition that the sIB clustering algorithm asks for the number of clusters as input before running (for the newspaper corpus, 68.9% correctness in 1 minute and for the scientific corpus, 77.8% correctness in 1 minute). The EM clustering algorithm additionally guesses the number of clusters without user intervention, but its best case is less than 53% correctness. Considering the experiments carried out, the results of human text classification and automated clustering are distant; it was also observed that the clustering correctness results vary according to the number of input texts and their topics.

Metadados do item

id	USP-33_760cf657c4d81465e5ad89245dd7ff16
oai_identifier_str	oai:revistas.usp.br:article/84679
network_acronym_str	USP-33
network_name_str	Journal of Information Systems and Technology Management (Online)
repository_id_str
spelling	AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODSText MiningText ClusteringNatural Language ProcessingBrazilian PortugueseEffectiveness.This article reports the findings of an empirical study about Automated Text Clustering applied to scientific articles and newspaper texts in Brazilian Portuguese, the objective was to find the most effective computational method able to cluster the input of texts in their original groups. The study covered four experiments, each experiment had four procedures: 1. Corpus Selections (a set of texts is selected for clustering), 2. Word Class Selections (Nouns, Verbs and Adjectives are chosen from each text by using specific algorithms), 3. Filtering Algorithms (a set of terms is selected from the results of the preview stage, a semantic weight is also inserted for each term and an index is generated for each text), 4. Clustering Algorithms (the clustering algorithms Simple K-Means, sIB and EM are applied to the indexes). After those procedures, clustering correctness and clustering time statistical results were collected. The sIB clustering algorithm is the best choice for both scientific and newspaper corpus, under the condition that the sIB clustering algorithm asks for the number of clusters as input before running (for the newspaper corpus, 68.9% correctness in 1 minute and for the scientific corpus, 77.8% correctness in 1 minute). The EM clustering algorithm additionally guesses the number of clusters without user intervention, but its best case is less than 53% correctness. Considering the experiments carried out, the results of human text classification and automated clustering are distant; it was also observed that the clustering correctness results vary according to the number of input texts and their topics.TECSI - FEA - Universidade de São Paulo. Faculdade de Economia, Administração, Contabilidade e Atuária2014-08-21info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionapplication/pdfhttps://www.revistas.usp.br/jistem/article/view/8467910.4301/10.4301%2FS1807-17752014000200011Journal of Information Systems and Technology Management; v. 11 n. 2 (2014); 415-436Journal of Information Systems and Technology Management; Vol. 11 No. 2 (2014); 415-436Journal of Information Systems and Technology Management; Vol. 11 Núm. 2 (2014); 415-4361807-1775reponame:Journal of Information Systems and Technology Management (Online)instname:Universidade de São Paulo (USP)instacron:USPenghttps://www.revistas.usp.br/jistem/article/view/84679/87393Copyright (c) 2018 JISTEM - Journal of Information Systems and Technology Management (Online)info:eu-repo/semantics/openAccessAfonso, Alexandre RibeiroDuque, Cláudio Gottschalg2014-09-16T13:25:36Zoai:revistas.usp.br:article/84679Revistahttp://www.scielo.br/scielo.php?script=sci_serial&pid=1807-1775&lng=pt&nrm=isoPUBhttps://old.scielo.br/oai/scielo-oai.php\|\|jistem@usp.br1807-17751807-1775opendoar:2014-09-16T13:25:36Journal of Information Systems and Technology Management (Online) - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv	AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS
title	AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS
spellingShingle	AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS Afonso, Alexandre Ribeiro Text Mining Text Clustering Natural Language Processing Brazilian Portuguese Effectiveness.
title_short	AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS
title_full	AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS
title_fullStr	AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS
title_full_unstemmed	AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS
title_sort	AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS
author	Afonso, Alexandre Ribeiro
author_facet	Afonso, Alexandre Ribeiro Duque, Cláudio Gottschalg
author_role	author
author2	Duque, Cláudio Gottschalg
author2_role	author
dc.contributor.author.fl_str_mv	Afonso, Alexandre Ribeiro Duque, Cláudio Gottschalg
dc.subject.por.fl_str_mv	Text Mining Text Clustering Natural Language Processing Brazilian Portuguese Effectiveness.
topic	Text Mining Text Clustering Natural Language Processing Brazilian Portuguese Effectiveness.
description	This article reports the findings of an empirical study about Automated Text Clustering applied to scientific articles and newspaper texts in Brazilian Portuguese, the objective was to find the most effective computational method able to cluster the input of texts in their original groups. The study covered four experiments, each experiment had four procedures: 1. Corpus Selections (a set of texts is selected for clustering), 2. Word Class Selections (Nouns, Verbs and Adjectives are chosen from each text by using specific algorithms), 3. Filtering Algorithms (a set of terms is selected from the results of the preview stage, a semantic weight is also inserted for each term and an index is generated for each text), 4. Clustering Algorithms (the clustering algorithms Simple K-Means, sIB and EM are applied to the indexes). After those procedures, clustering correctness and clustering time statistical results were collected. The sIB clustering algorithm is the best choice for both scientific and newspaper corpus, under the condition that the sIB clustering algorithm asks for the number of clusters as input before running (for the newspaper corpus, 68.9% correctness in 1 minute and for the scientific corpus, 77.8% correctness in 1 minute). The EM clustering algorithm additionally guesses the number of clusters without user intervention, but its best case is less than 53% correctness. Considering the experiments carried out, the results of human text classification and automated clustering are distant; it was also observed that the clustering correctness results vary according to the number of input texts and their topics.
publishDate	2014
dc.date.none.fl_str_mv	2014-08-21
dc.type.driver.fl_str_mv	info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion
format	article
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://www.revistas.usp.br/jistem/article/view/84679 10.4301/10.4301%2FS1807-17752014000200011
url	https://www.revistas.usp.br/jistem/article/view/84679
identifier_str_mv	10.4301/10.4301%2FS1807-17752014000200011
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	https://www.revistas.usp.br/jistem/article/view/84679/87393
dc.rights.driver.fl_str_mv	Copyright (c) 2018 JISTEM - Journal of Information Systems and Technology Management (Online) info:eu-repo/semantics/openAccess
rights_invalid_str_mv	Copyright (c) 2018 JISTEM - Journal of Information Systems and Technology Management (Online)
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	TECSI - FEA - Universidade de São Paulo. Faculdade de Economia, Administração, Contabilidade e Atuária
publisher.none.fl_str_mv	TECSI - FEA - Universidade de São Paulo. Faculdade de Economia, Administração, Contabilidade e Atuária
dc.source.none.fl_str_mv	Journal of Information Systems and Technology Management; v. 11 n. 2 (2014); 415-436 Journal of Information Systems and Technology Management; Vol. 11 No. 2 (2014); 415-436 Journal of Information Systems and Technology Management; Vol. 11 Núm. 2 (2014); 415-436 1807-1775 reponame:Journal of Information Systems and Technology Management (Online) instname:Universidade de São Paulo (USP) instacron:USP
instname_str	Universidade de São Paulo (USP)
instacron_str	USP
institution	USP
reponame_str	Journal of Information Systems and Technology Management (Online)
collection	Journal of Information Systems and Technology Management (Online)
repository.name.fl_str_mv	Journal of Information Systems and Technology Management (Online) - Universidade de São Paulo (USP)
repository.mail.fl_str_mv	\|\|jistem@usp.br
_version_	1824326122834231296

AUTOMATED TEXT CLUSTERING OF NEWSPAPER AND SCIENTIFIC TEXTS IN BRAZILIAN PORTUGUESE: ANALYSIS AND COMPARISON OF METHODS

Registros relacionados