Análise de algoritmos de agrupamento para base de dados textuais

Almeida, Luiz Gonzaga Paula de

Análise de algoritmos de agrupamento para base de dados textuais

Detalhes bibliográficos
Autor(a) principal:	Almeida, Luiz Gonzaga Paula de
Data de Publicação:	2007
Tipo de documento:	Dissertação
Idioma:	por
Título da fonte:	Biblioteca Digital de Teses e Dissertações do LNCC
Texto Completo:	https://tede.lncc.br/handle/tede/75
Resumo:	The increasing amount of digitally stored texts makes necessary the development of computational tools to allow the access of information and knowledge in an efficient and efficacious manner. This problem is extremely relevant in biomedicine research, since most of the generated knowledge is translated into scientific articles and it is necessary to have the most easy and fast access. The research field known as Text Mining deals with the problem of identifying new information and knowledge in text databases. One of its tasks is to find in databases groups of texts that are correlated, an issue known as text clustering. To allow clustering, text databases must be transformed into the commonly used Vector Space Model, in which texts are represented by vectors composed by the frequency of occurrence of words and terms present in the databases. The set of vectors composing a matrix named document-term is usually sparse with high dimension. Normally, to attenuate the problems caused by these features, a subset of terms is selected, thus giving rise a new document-term matrix with reduced dimensions, which is then used by clustering algorithms. This work presents two algorithms for terms selection and the evaluation of clustering algorithms: k-means, spectral and graph portioning, in five pre-classified databases. The databases were pre-processed by previously described methods. The results indicate that the term selection algorithms implemented increased the performance of the clustering algorithms used and that the k-means and spectral algorithms outperformed the graph portioning.

Metadados do item

id	LNCC_8d9dc72ae5711cb318a779b447d095f7
oai_identifier_str	oai:tede-server.lncc.br:tede/75
network_acronym_str	LNCC
network_name_str	Biblioteca Digital de Teses e Dissertações do LNCC
repository_id_str
spelling	Vasconcelos, Ana Tereza RibeiroCPF:81737963787http://lattes.cnpq.br/8989199088323836Maia, Marco Antonio Grivet Mattosohtta://lattes.cnpq.br/2519031892464448Baczynski, JackCPF:33304165720http://lattes.cnpq.br/2332051647489024Carvalho, Alexandre Plastino dehttp://lattes.cnpq.br/4985266524417261CPF:84286121704http://lattes.cnpq.br/3708867677533851Almeida, Luiz Gonzaga Paula de2015-03-04T18:50:55Z2008-04-172007-08-31ALMEIDA, Luiz Gonzaga Paula de. Análise de algoritmos de agrupamento para base de dados textuais. 2007. 139 f. Dissertação (Mestrado em Modelagem computacional) - Laboratório Nacional de Computação Científica, Petrópolis, 2007.https://tede.lncc.br/handle/tede/75The increasing amount of digitally stored texts makes necessary the development of computational tools to allow the access of information and knowledge in an efficient and efficacious manner. This problem is extremely relevant in biomedicine research, since most of the generated knowledge is translated into scientific articles and it is necessary to have the most easy and fast access. The research field known as Text Mining deals with the problem of identifying new information and knowledge in text databases. One of its tasks is to find in databases groups of texts that are correlated, an issue known as text clustering. To allow clustering, text databases must be transformed into the commonly used Vector Space Model, in which texts are represented by vectors composed by the frequency of occurrence of words and terms present in the databases. The set of vectors composing a matrix named document-term is usually sparse with high dimension. Normally, to attenuate the problems caused by these features, a subset of terms is selected, thus giving rise a new document-term matrix with reduced dimensions, which is then used by clustering algorithms. This work presents two algorithms for terms selection and the evaluation of clustering algorithms: k-means, spectral and graph portioning, in five pre-classified databases. The databases were pre-processed by previously described methods. The results indicate that the term selection algorithms implemented increased the performance of the clustering algorithms used and that the k-means and spectral algorithms outperformed the graph portioning.O volume crescente de textos digitalmente armazenados torna necessária a construção de ferramentas computacionais que permitam a organização e o acesso eficaz e eficiente à informação e ao conhecimento nele contidos. No campo do conhecimento da biomedicina este problema se torna extremamente relevante, pois a maior parte do conhecimento gerado é formalizada através de artigos científicos e é necessário que o acesso a estes seja o mais fácil e rápido possível. A área de pesquisa conhecida como Mineração de Textos (do inglês Text Mining), se propõe a enfrentar este problema ao procurar identificar novas informações e conhecimentos até então desconhecidos, em bases de dados textuais. Uma de suas tarefas é a descoberta de grupos de textos correlatos em base de dados textuais e esse problema é conhecido como agrupamento de textos (do inglês Text Clustering). Para este fim, a representação das bases de dados textuais comumente utilizada no agrupamento de textos é o Modelo Espaço-vetorial, no qual cada texto é representado por um vetor de características, que são as freqüências das palavras ou termos que nele ocorrem. O conjunto de vetores forma uma matriz denominada de documento-termo, que é esparsa e de alta dimensionalidade. Para atenuar os problemas decorrentes dessas características, normalmente é selecionado um subconjunto de termos, construindo-se assim uma nova matriz documento-termo com um número reduzido de dimensões que é então utilizada nos algoritmos de agrupamento. Este trabalho se desdobra em: i) introdução e implementação de dois algoritmos para seleção de termos e ii) avaliação dos algoritmos k-means, espectral e de particionamento de grafos, em cinco base de dados de textos previamente classificadas. As bases de dados são pré-processadas através de métodos descritos na literatura, produzindo-se as matrizes documento-termo. Os resultados indicam que os algoritmos de seleção propostos, para a redução das matrizes documento-termo, melhoram o desempenho dos algoritmos de agrupamento avaliados. Os algoritmos k-means e espectral têm um desempenho superior ao algoritmos de particionamento de grafos no agrupamento de bases de dados textuais, com ou sem a seleção de características.Made available in DSpace on 2015-03-04T18:50:55Z (GMT). No. of bitstreams: 1 DissertacaoLuizGonzaga.pdf: 3514446 bytes, checksum: 517d9c7b241b2bd9c799c807d6eac037 (MD5) Previous issue date: 2008-08-31application/pdfhttp://tede-server.lncc.br:8080/retrieve/314/Texto%20completo.jpghttp://tede-server.lncc.br:8080/retrieve/540/Texto%20completo.jpgporLaboratório Nacional de Computação CientíficaPrograma de Pós-Graduação em Modelagem ComputacionalLNCCBRServiço de Análise e Apoio a Formação de Recursos HumanosAnálise por agrupamentoSeleção de característicasMineração de textosClustering analysisFeature selectionText miningCNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::TEORIA DA COMPUTACAO::COMPUTABILIDADE E MODELOS DE COMPUTACAOAnálise de algoritmos de agrupamento para base de dados textuaisAnalysis of the clustering algorithms for the databasesinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisinfo:eu-repo/semantics/openAccessreponame:Biblioteca Digital de Teses e Dissertações do LNCCinstname:Laboratório Nacional de Computação Científica (LNCC)instacron:LNCCORIGINALTexto completoapplication/pdf3514446http://tede-server.lncc.br:8080/tede/bitstream/tede/75/1/Texto+completo517d9c7b241b2bd9c799c807d6eac037MD51THUMBNAILTexto completo.jpgTexto completo.jpgimage/jpeg3053http://tede-server.lncc.br:8080/tede/bitstream/tede/75/2/Texto+completo.jpg1b9140d8a7c41d7b4ded975f8b3c6723MD52tede/752023-05-18 11:40:55.351oai:tede-server.lncc.br:tede/75Biblioteca Digital de Teses e Dissertaçõeshttps://tede.lncc.br/PUBhttps://tede.lncc.br/oai/requestlibrary@lncc.br\|\|library@lncc.bropendoar:2023-05-18T14:40:55Biblioteca Digital de Teses e Dissertações do LNCC - Laboratório Nacional de Computação Científica (LNCC)false
dc.title.por.fl_str_mv	Análise de algoritmos de agrupamento para base de dados textuais
dc.title.alternative.eng.fl_str_mv	Analysis of the clustering algorithms for the databases
title	Análise de algoritmos de agrupamento para base de dados textuais
spellingShingle	Análise de algoritmos de agrupamento para base de dados textuais Almeida, Luiz Gonzaga Paula de Análise por agrupamento Seleção de características Mineração de textos Clustering analysis Feature selection Text mining CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::TEORIA DA COMPUTACAO::COMPUTABILIDADE E MODELOS DE COMPUTACAO
title_short	Análise de algoritmos de agrupamento para base de dados textuais
title_full	Análise de algoritmos de agrupamento para base de dados textuais
title_fullStr	Análise de algoritmos de agrupamento para base de dados textuais
title_full_unstemmed	Análise de algoritmos de agrupamento para base de dados textuais
title_sort	Análise de algoritmos de agrupamento para base de dados textuais
author	Almeida, Luiz Gonzaga Paula de
author_facet	Almeida, Luiz Gonzaga Paula de
author_role	author
dc.contributor.advisor1.fl_str_mv	Vasconcelos, Ana Tereza Ribeiro
dc.contributor.advisor1ID.fl_str_mv	CPF:81737963787
dc.contributor.advisor1Lattes.fl_str_mv	http://lattes.cnpq.br/8989199088323836
dc.contributor.advisor-co1.fl_str_mv	Maia, Marco Antonio Grivet Mattoso
dc.contributor.advisor-co1Lattes.fl_str_mv	htta://lattes.cnpq.br/2519031892464448
dc.contributor.referee1.fl_str_mv	Baczynski, Jack
dc.contributor.referee1ID.fl_str_mv	CPF:33304165720
dc.contributor.referee1Lattes.fl_str_mv	http://lattes.cnpq.br/2332051647489024
dc.contributor.referee2.fl_str_mv	Carvalho, Alexandre Plastino de
dc.contributor.referee2Lattes.fl_str_mv	http://lattes.cnpq.br/4985266524417261
dc.contributor.authorID.fl_str_mv	CPF:84286121704
dc.contributor.authorLattes.fl_str_mv	http://lattes.cnpq.br/3708867677533851
dc.contributor.author.fl_str_mv	Almeida, Luiz Gonzaga Paula de
contributor_str_mv	Vasconcelos, Ana Tereza Ribeiro Maia, Marco Antonio Grivet Mattoso Baczynski, Jack Carvalho, Alexandre Plastino de
dc.subject.por.fl_str_mv	Análise por agrupamento Seleção de características Mineração de textos
topic	Análise por agrupamento Seleção de características Mineração de textos Clustering analysis Feature selection Text mining CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::TEORIA DA COMPUTACAO::COMPUTABILIDADE E MODELOS DE COMPUTACAO
dc.subject.eng.fl_str_mv	Clustering analysis Feature selection Text mining
dc.subject.cnpq.fl_str_mv	CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::TEORIA DA COMPUTACAO::COMPUTABILIDADE E MODELOS DE COMPUTACAO
description	The increasing amount of digitally stored texts makes necessary the development of computational tools to allow the access of information and knowledge in an efficient and efficacious manner. This problem is extremely relevant in biomedicine research, since most of the generated knowledge is translated into scientific articles and it is necessary to have the most easy and fast access. The research field known as Text Mining deals with the problem of identifying new information and knowledge in text databases. One of its tasks is to find in databases groups of texts that are correlated, an issue known as text clustering. To allow clustering, text databases must be transformed into the commonly used Vector Space Model, in which texts are represented by vectors composed by the frequency of occurrence of words and terms present in the databases. The set of vectors composing a matrix named document-term is usually sparse with high dimension. Normally, to attenuate the problems caused by these features, a subset of terms is selected, thus giving rise a new document-term matrix with reduced dimensions, which is then used by clustering algorithms. This work presents two algorithms for terms selection and the evaluation of clustering algorithms: k-means, spectral and graph portioning, in five pre-classified databases. The databases were pre-processed by previously described methods. The results indicate that the term selection algorithms implemented increased the performance of the clustering algorithms used and that the k-means and spectral algorithms outperformed the graph portioning.
publishDate	2007
dc.date.issued.fl_str_mv	2007-08-31
dc.date.available.fl_str_mv	2008-04-17
dc.date.accessioned.fl_str_mv	2015-03-04T18:50:55Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.citation.fl_str_mv	ALMEIDA, Luiz Gonzaga Paula de. Análise de algoritmos de agrupamento para base de dados textuais. 2007. 139 f. Dissertação (Mestrado em Modelagem computacional) - Laboratório Nacional de Computação Científica, Petrópolis, 2007.
dc.identifier.uri.fl_str_mv	https://tede.lncc.br/handle/tede/75
identifier_str_mv	ALMEIDA, Luiz Gonzaga Paula de. Análise de algoritmos de agrupamento para base de dados textuais. 2007. 139 f. Dissertação (Mestrado em Modelagem computacional) - Laboratório Nacional de Computação Científica, Petrópolis, 2007.
url	https://tede.lncc.br/handle/tede/75
dc.language.iso.fl_str_mv	por
language	por
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Laboratório Nacional de Computação Científica
dc.publisher.program.fl_str_mv	Programa de Pós-Graduação em Modelagem Computacional
dc.publisher.initials.fl_str_mv	LNCC
dc.publisher.country.fl_str_mv	BR
dc.publisher.department.fl_str_mv	Serviço de Análise e Apoio a Formação de Recursos Humanos
publisher.none.fl_str_mv	Laboratório Nacional de Computação Científica
dc.source.none.fl_str_mv	reponame:Biblioteca Digital de Teses e Dissertações do LNCC instname:Laboratório Nacional de Computação Científica (LNCC) instacron:LNCC
instname_str	Laboratório Nacional de Computação Científica (LNCC)
instacron_str	LNCC
institution	LNCC
reponame_str	Biblioteca Digital de Teses e Dissertações do LNCC
collection	Biblioteca Digital de Teses e Dissertações do LNCC
bitstream.url.fl_str_mv	http://tede-server.lncc.br:8080/tede/bitstream/tede/75/1/Texto+completo http://tede-server.lncc.br:8080/tede/bitstream/tede/75/2/Texto+completo.jpg
bitstream.checksum.fl_str_mv	517d9c7b241b2bd9c799c807d6eac037 1b9140d8a7c41d7b4ded975f8b3c6723
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5
repository.name.fl_str_mv	Biblioteca Digital de Teses e Dissertações do LNCC - Laboratório Nacional de Computação Científica (LNCC)
repository.mail.fl_str_mv	library@lncc.br\|\|library@lncc.br
_version_	1797683217049845760

Análise de algoritmos de agrupamento para base de dados textuais

Registros relacionados