Detecção de tópicos em documentos usando agrupamento de vetores de palavras

Miranda, Guilherme Raiol de

Detecção de tópicos em documentos usando agrupamento de vetores de palavras

Detalhes bibliográficos
Autor(a) principal:	Miranda, Guilherme Raiol de
Data de Publicação:	2020
Tipo de documento:	Dissertação
Idioma:	por
Título da fonte:	Biblioteca Digital de Teses e Dissertações do Mackenzie
Texto Completo:	https://dspace.mackenzie.br/handle/10899/28616
Resumo:	With the exponential increase of texts generated each year, the demand for Natural Language Processing techniques has been increasing, both by companies and by the Academy. Automatic topic detection in documents is one of the most challenging and useful tasks for information discovery and document summarization. Traditional topic detection techniques, such as the Latent Dirichlet Allocation (LDA) and the Non-Negative Matrix Factorization (NMF), originally did not produce good results for large databases nor use syntactic and semantic information to generate topics. Recently, word vectorization techniques, suchasWord2Vec, proved to have a good computational performance in large data sets and to be eﬀective in representing words as word vectors in a distributed way, maintaining syntactic and semantic information. This dissertation proposes the veriﬁcation of the following research hypothesis: Is Word2Vec capable of providing enough information to generate interpretable topics? For validation, a method, named Word2Topic, with two approaches for the generation of topics was proposed: direct grouping of word vectors and grouping after dimensionality reduction. The method was applied in two benchmark datasets and was compared with the traditional algorithms by a topic interpretability metric. It was observed in the results that the techniques proposed in one of the data bases generated set sof interpretable words or similar morphological classes. The topics obtained were similar to those of NMF, while LDA was unable to generate interpretable topics. It was not possible to fully validate the research hypothesis because the results of the second dataset did not lead to have the same interpretability or generated morphologically similar words.

Metadados do item

id	UPM_74542ae47f248fca67970a1afe590fe1
oai_identifier_str	oai:dspace.mackenzie.br:10899/28616
network_acronym_str	UPM
network_name_str	Biblioteca Digital de Teses e Dissertações do Mackenzie
repository_id_str	10277
spelling	2021-12-18T21:44:28Z2021-12-18T21:44:28Z2020-08-21MIRANDA, Guilherme Raiol de. Detecção de tópicos em documentos usando agrupamento de vetores de palavras. 2020. 91 f. Dissertação (Engenharia Elétrica) - Universidade Presbiteriana Mackenzie, São Paulo, 2020https://dspace.mackenzie.br/handle/10899/28616With the exponential increase of texts generated each year, the demand for Natural Language Processing techniques has been increasing, both by companies and by the Academy. Automatic topic detection in documents is one of the most challenging and useful tasks for information discovery and document summarization. Traditional topic detection techniques, such as the Latent Dirichlet Allocation (LDA) and the Non-Negative Matrix Factorization (NMF), originally did not produce good results for large databases nor use syntactic and semantic information to generate topics. Recently, word vectorization techniques, suchasWord2Vec, proved to have a good computational performance in large data sets and to be eﬀective in representing words as word vectors in a distributed way, maintaining syntactic and semantic information. This dissertation proposes the veriﬁcation of the following research hypothesis: Is Word2Vec capable of providing enough information to generate interpretable topics? For validation, a method, named Word2Topic, with two approaches for the generation of topics was proposed: direct grouping of word vectors and grouping after dimensionality reduction. The method was applied in two benchmark datasets and was compared with the traditional algorithms by a topic interpretability metric. It was observed in the results that the techniques proposed in one of the data bases generated set sof interpretable words or similar morphological classes. The topics obtained were similar to those of NMF, while LDA was unable to generate interpretable topics. It was not possible to fully validate the research hypothesis because the results of the second dataset did not lead to have the same interpretability or generated morphologically similar words.Com o aumento exponencial de textos gerados a cada ano, a procura de técnicas de Processamento de Língua Natural vem aumentado, tanto por empresas, como pela Academia. A detecção automática de tópicos em documentos é uma das tarefas mais desaﬁadoras e úteis para a descoberta de informações e sumarização de documentos. Técnicas tradicionais de detecção de tópicos, como a Latent Dirichlet Allocation (LDA) e a Non-Negative Matrix Factorization (NMF), originalmente não produzem bons resultados para bases de dados grandes, nem utilizam informações sintáticas e semânticas para geração de tópicos. Recentemente, técnicas de vetorização de palavras, como o Word2Vec, provaram ter um bom desempenho computacional em grandes conjuntos de dados e serem eﬁcazes representando palavras como vetores de palavras de forma distribuída, mantendo as informações sintáticas e semânticas. Esta dissertação propõe a veriﬁcação da seguinte questão de pesquisa: O Word2Vec é capaz de prover informações suﬁcientes para a geração de tópicos interpretáveis? Para a validação, foi proposto um método, denominado Word2Topic, com duas abordagens para a geração dos tópicos: agrupamento direto dos vetores de palavras e agrupamento após redução de dimensionalidade. O método foi aplicado em duas bases referência da literatura e foi comparado com os algoritmos tradicionais por uma métrica de interpretabilidade dos tópicos. Foi observado nos resultados que as técnicas propostas em uma das bases de dados gerou conjuntos de palavras interpretáveis ou de classes morfológicas similares. Os tópicos obtidos foram similares aos da técnica NMF, enquanto a técnica LDA não conseguiu gerar tópicos interpretáveis. Não foi possível validar completamente a questão de pesquisa, pois os resultados da segunda base não permitiram a mesma interpretabilidade ou geração de palavras morfologicamente similares.Coordenação de Aperfeiçoamento de Pessoal de Nível SuperiorFundo Mackenzie de Pesquisaapplication/pdfporUniversidade Presbiteriana MackenzieEngenharia ElétricaUPMBrasilEscola de Engenharia Mackenzie (EE)http://creativecommons.org/licenses/by-nc-nd/4.0/info:eu-repo/semantics/openAccessprocessamendo de língua naturaldetecção de tópicosword2vecCNPQ::ENGENHARIASDetecção de tópicos em documentos usando agrupamento de vetores de palavrasinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisSilva, Leandro Nunes de Castrohttp://lattes.cnpq.br/2741458816539568Araújo, Renata Mendes dehttp://lattes.cnpq.br/3589012014320121Coello, Juan Manuel Adánhttp://lattes.cnpq.br/3087162397314631 / https://orcid.org/0000-0001-5942-9598http://lattes.cnpq.br/6553504314681393Miranda, Guilherme Raiol dekey-wordsnatural language processingtopic modelingword2vecreponame:Biblioteca Digital de Teses e Dissertações do Mackenzieinstname:Universidade Presbiteriana Mackenzie (MACKENZIE)instacron:MACKENZIEORIGINALGUILHERME RAIOL DE MIRANDA - protegido.pdfGuilherme Raiol de Mirandaapplication/pdf1040118https://dspace.mackenzie.br/bitstream/10899/28616/1/GUILHERME%20RAIOL%20DE%20MIRANDA%20-%20protegido.pdf0b9f047ad10cf884d01bfba56dbd8e42MD51CC-LICENSElicense_urlapplication/octet-stream49https://dspace.mackenzie.br/bitstream/10899/28616/2/license_url4afdbb8c545fd630ea7db775da747b2fMD52license_textapplication/octet-stream0https://dspace.mackenzie.br/bitstream/10899/28616/3/license_textd41d8cd98f00b204e9800998ecf8427eMD53license_rdfapplication/octet-stream0https://dspace.mackenzie.br/bitstream/10899/28616/4/license_rdfd41d8cd98f00b204e9800998ecf8427eMD54LICENSElicense.txttext/plain2108https://dspace.mackenzie.br/bitstream/10899/28616/5/license.txt1ca4f25d161e955cf4b7a4aa65b8e96eMD55TEXTGUILHERME RAIOL DE MIRANDA - protegido.pdf.txtGUILHERME RAIOL DE MIRANDA - protegido.pdf.txtExtracted texttext/plain135427https://dspace.mackenzie.br/bitstream/10899/28616/6/GUILHERME%20RAIOL%20DE%20MIRANDA%20-%20protegido.pdf.txta3cd565a14beb0c0c12f48afa7311d65MD56THUMBNAILGUILHERME RAIOL DE MIRANDA - protegido.pdf.jpgGUILHERME RAIOL DE MIRANDA - protegido.pdf.jpgGenerated Thumbnailimage/jpeg1205https://dspace.mackenzie.br/bitstream/10899/28616/7/GUILHERME%20RAIOL%20DE%20MIRANDA%20-%20protegido.pdf.jpg48d5b09d05bffe4c06848c881c192bfcMD5710899/286162021-12-19 03:02:45.341TElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEKCkNvbSBhIGFwcmVzZW50YcOnw6NvIGRlc3RhIGxpY2Vuw6dhLCB2b2PDqiAobyBhdXRvciAoZXMpIG91IG8gdGl0dWxhciBkb3MgZGlyZWl0b3MgZGUgYXV0b3IpIGNvbmNlZGUgw6AgVW5pdmVyc2lkYWRlIFByZXNiaXRlcmlhbmEgTWFja2VuemllIG8gZGlyZWl0byBuw6NvLWV4Y2x1c2l2byBkZSByZXByb2R1emlyLCAgdHJhZHV6aXIgKGNvbmZvcm1lIGRlZmluaWRvIGFiYWl4byksIGUvb3UgZGlzdHJpYnVpciBhIHN1YSB0ZXNlIG91IGRpc3NlcnRhw6fDo28gKGluY2x1aW5kbyBvIHJlc3VtbykgcG9yIHRvZG8gbyBtdW5kbyBubyBmb3JtYXRvIGltcHJlc3NvIGUgZWxldHLDtG5pY28gZSBlbSBxdWFscXVlciBtZWlvLCBpbmNsdWluZG8gb3MgZm9ybWF0b3Mgw6F1ZGlvIG91IHbDrWRlby4KClZvY8OqIGNvbmNvcmRhIHF1ZSBhIFVuaXZlcnNpZGFkZSBQcmVzYml0ZXJpYW5hIE1hY2tlbnppZSBwb2RlLCBzZW0gYWx0ZXJhciBvIGNvbnRlw7pkbywgdHJhbnNwb3IgYSBzdWEgdGVzZSBvdSBkaXNzZXJ0YcOnw6NvIHBhcmEgcXVhbHF1ZXIgbWVpbyBvdSBmb3JtYXRvIHBhcmEgZmlucyBkZSBwcmVzZXJ2YcOnw6NvLgoKVm9jw6ogdGFtYsOpbSBjb25jb3JkYSBxdWUgYSBVbml2ZXJzaWRhZGUgUHJlc2JpdGVyaWFuYSBNYWNrZW56aWUgcG9kZSBtYW50ZXIgbWFpcyBkZSB1bWEgY8OzcGlhIGEgc3VhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyBwYXJhIGZpbnMgZGUgc2VndXJhbsOnYSwgYmFjay11cCBlIHByZXNlcnZhw6fDo28uCgpWb2PDqiBkZWNsYXJhIHF1ZSBhIHN1YSB0ZXNlIG91IGRpc3NlcnRhw6fDo28gw6kgb3JpZ2luYWwgZSBxdWUgdm9jw6ogdGVtIG8gcG9kZXIgZGUgY29uY2VkZXIgb3MgZGlyZWl0b3MgY29udGlkb3MgbmVzdGEgbGljZW7Dp2EuIFZvY8OqIHRhbWLDqW0gZGVjbGFyYSBxdWUgbyBkZXDDs3NpdG8gZGEgc3VhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyBuw6NvLCBxdWUgc2VqYSBkZSBzZXUgY29uaGVjaW1lbnRvLCBpbmZyaW5nZSBkaXJlaXRvcyBhdXRvcmFpcyBkZSBuaW5ndcOpbS4KCkNhc28gYSBzdWEgdGVzZSBvdSBkaXNzZXJ0YcOnw6NvIGNvbnRlbmhhIG1hdGVyaWFsIHF1ZSB2b2PDqiBuw6NvIHBvc3N1aSBhIHRpdHVsYXJpZGFkZSBkb3MgZGlyZWl0b3MgYXV0b3JhaXMsIHZvY8OqIGRlY2xhcmEgcXVlIG9idGV2ZSBhIHBlcm1pc3PDo28gaXJyZXN0cml0YSBkbyBkZXRlbnRvciBkb3MgZGlyZWl0b3MgYXV0b3JhaXMgcGFyYSBjb25jZWRlciDDoCBVbml2ZXJzaWRhZGUgUHJlc2JpdGVyaWFuYSBNYWNrZW56aWUgb3MgZGlyZWl0b3MgYXByZXNlbnRhZG9zIG5lc3RhIGxpY2Vuw6dhLCBlIHF1ZSBlc3NlIG1hdGVyaWFsIGRlIHByb3ByaWVkYWRlIGRlIHRlcmNlaXJvcyBlc3TDoSBjbGFyYW1lbnRlIGlkZW50aWZpY2FkbyBlIHJlY29uaGVjaWRvIG5vIHRleHRvIG91IG5vIGNvbnRlw7pkbyBkYSB0ZXNlIG91IGRpc3NlcnRhw6fDo28gb3JhIGRlcG9zaXRhZGEuCgpDQVNPIEEgVEVTRSBPVSBESVNTRVJUQcOHw4NPIE9SQSBERVBPU0lUQURBIFRFTkhBIFNJRE8gUkVTVUxUQURPIERFIFVNIFBBVFJPQ8ONTklPIE9VIEFQT0lPIERFIFVNQSBBR8OKTkNJQSBERSBGT01FTlRPIE9VIE9VVFJPIE9SR0FOSVNNTyBRVUUgTsODTyBTRUpBIEEgVU5JVkVSU0lEQURFIFBSRVNCSVRFUklBTkEgTUFDS0VOWklFLCBWT0PDiiBERUNMQVJBIFFVRSBSRVNQRUlUT1UgVE9ET1MgRSBRVUFJU1FVRVIgRElSRUlUT1MgREUgUkVWSVPDg08gQ09NTyBUQU1Cw4lNIEFTIERFTUFJUyBPQlJJR0HDh8OVRVMgRVhJR0lEQVMgUE9SIENPTlRSQVRPIE9VIEFDT1JETy4KCkEgVW5pdmVyc2lkYWRlIFByZXNiaXRlcmlhbmEgTWFja2VuemllIHNlIGNvbXByb21ldGUgYSBpZGVudGlmaWNhciBjbGFyYW1lbnRlIG8gc2V1IG5vbWUgKHMpIG91IG8ocykgbm9tZShzKSBkbyhzKSBkZXRlbnRvcihlcykgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIGRhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbywgZSBuw6NvIGZhcsOhIHF1YWxxdWVyIGFsdGVyYcOnw6NvLCBhbMOpbSBkYXF1ZWxhcyBjb25jZWRpZGFzIHBvciBlc3RhIGxpY2Vuw6dhLgo=Biblioteca Digital de Teses e Dissertaçõeshttp://tede.mackenzie.br/jspui/PRI
dc.title.por.fl_str_mv	Detecção de tópicos em documentos usando agrupamento de vetores de palavras
title	Detecção de tópicos em documentos usando agrupamento de vetores de palavras
spellingShingle	Detecção de tópicos em documentos usando agrupamento de vetores de palavras Miranda, Guilherme Raiol de processamendo de língua natural detecção de tópicos word2vec CNPQ::ENGENHARIAS
title_short	Detecção de tópicos em documentos usando agrupamento de vetores de palavras
title_full	Detecção de tópicos em documentos usando agrupamento de vetores de palavras
title_fullStr	Detecção de tópicos em documentos usando agrupamento de vetores de palavras
title_full_unstemmed	Detecção de tópicos em documentos usando agrupamento de vetores de palavras
title_sort	Detecção de tópicos em documentos usando agrupamento de vetores de palavras
author	Miranda, Guilherme Raiol de
author_facet	Miranda, Guilherme Raiol de
author_role	author
dc.contributor.advisor1.fl_str_mv	Silva, Leandro Nunes de Castro
dc.contributor.advisor1Lattes.fl_str_mv	http://lattes.cnpq.br/2741458816539568
dc.contributor.referee1.fl_str_mv	Araújo, Renata Mendes de
dc.contributor.referee1Lattes.fl_str_mv	http://lattes.cnpq.br/3589012014320121
dc.contributor.referee2.fl_str_mv	Coello, Juan Manuel Adán
dc.contributor.referee2Lattes.fl_str_mv	http://lattes.cnpq.br/3087162397314631 / https://orcid.org/0000-0001-5942-9598
dc.contributor.authorLattes.fl_str_mv	http://lattes.cnpq.br/6553504314681393
dc.contributor.author.fl_str_mv	Miranda, Guilherme Raiol de
contributor_str_mv	Silva, Leandro Nunes de Castro Araújo, Renata Mendes de Coello, Juan Manuel Adán
dc.subject.por.fl_str_mv	processamendo de língua natural detecção de tópicos word2vec
topic	processamendo de língua natural detecção de tópicos word2vec CNPQ::ENGENHARIAS
dc.subject.cnpq.fl_str_mv	CNPQ::ENGENHARIAS
description	With the exponential increase of texts generated each year, the demand for Natural Language Processing techniques has been increasing, both by companies and by the Academy. Automatic topic detection in documents is one of the most challenging and useful tasks for information discovery and document summarization. Traditional topic detection techniques, such as the Latent Dirichlet Allocation (LDA) and the Non-Negative Matrix Factorization (NMF), originally did not produce good results for large databases nor use syntactic and semantic information to generate topics. Recently, word vectorization techniques, suchasWord2Vec, proved to have a good computational performance in large data sets and to be eﬀective in representing words as word vectors in a distributed way, maintaining syntactic and semantic information. This dissertation proposes the veriﬁcation of the following research hypothesis: Is Word2Vec capable of providing enough information to generate interpretable topics? For validation, a method, named Word2Topic, with two approaches for the generation of topics was proposed: direct grouping of word vectors and grouping after dimensionality reduction. The method was applied in two benchmark datasets and was compared with the traditional algorithms by a topic interpretability metric. It was observed in the results that the techniques proposed in one of the data bases generated set sof interpretable words or similar morphological classes. The topics obtained were similar to those of NMF, while LDA was unable to generate interpretable topics. It was not possible to fully validate the research hypothesis because the results of the second dataset did not lead to have the same interpretability or generated morphologically similar words.
publishDate	2020
dc.date.issued.fl_str_mv	2020-08-21
dc.date.accessioned.fl_str_mv	2021-12-18T21:44:28Z
dc.date.available.fl_str_mv	2021-12-18T21:44:28Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.citation.fl_str_mv	MIRANDA, Guilherme Raiol de. Detecção de tópicos em documentos usando agrupamento de vetores de palavras. 2020. 91 f. Dissertação (Engenharia Elétrica) - Universidade Presbiteriana Mackenzie, São Paulo, 2020
dc.identifier.uri.fl_str_mv	https://dspace.mackenzie.br/handle/10899/28616
identifier_str_mv	MIRANDA, Guilherme Raiol de. Detecção de tópicos em documentos usando agrupamento de vetores de palavras. 2020. 91 f. Dissertação (Engenharia Elétrica) - Universidade Presbiteriana Mackenzie, São Paulo, 2020
url	https://dspace.mackenzie.br/handle/10899/28616
dc.language.iso.fl_str_mv	por
language	por
dc.rights.driver.fl_str_mv	http://creativecommons.org/licenses/by-nc-nd/4.0/ info:eu-repo/semantics/openAccess
rights_invalid_str_mv	http://creativecommons.org/licenses/by-nc-nd/4.0/
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Universidade Presbiteriana Mackenzie
dc.publisher.program.fl_str_mv	Engenharia Elétrica
dc.publisher.initials.fl_str_mv	UPM
dc.publisher.country.fl_str_mv	Brasil
dc.publisher.department.fl_str_mv	Escola de Engenharia Mackenzie (EE)
publisher.none.fl_str_mv	Universidade Presbiteriana Mackenzie
dc.source.none.fl_str_mv	reponame:Biblioteca Digital de Teses e Dissertações do Mackenzie instname:Universidade Presbiteriana Mackenzie (MACKENZIE) instacron:MACKENZIE
instname_str	Universidade Presbiteriana Mackenzie (MACKENZIE)
instacron_str	MACKENZIE
institution	MACKENZIE
reponame_str	Biblioteca Digital de Teses e Dissertações do Mackenzie
collection	Biblioteca Digital de Teses e Dissertações do Mackenzie
bitstream.url.fl_str_mv	https://dspace.mackenzie.br/bitstream/10899/28616/1/GUILHERME%20RAIOL%20DE%20MIRANDA%20-%20protegido.pdf https://dspace.mackenzie.br/bitstream/10899/28616/2/license_url https://dspace.mackenzie.br/bitstream/10899/28616/3/license_text https://dspace.mackenzie.br/bitstream/10899/28616/4/license_rdf https://dspace.mackenzie.br/bitstream/10899/28616/5/license.txt https://dspace.mackenzie.br/bitstream/10899/28616/6/GUILHERME%20RAIOL%20DE%20MIRANDA%20-%20protegido.pdf.txt https://dspace.mackenzie.br/bitstream/10899/28616/7/GUILHERME%20RAIOL%20DE%20MIRANDA%20-%20protegido.pdf.jpg
bitstream.checksum.fl_str_mv	0b9f047ad10cf884d01bfba56dbd8e42 4afdbb8c545fd630ea7db775da747b2f d41d8cd98f00b204e9800998ecf8427e d41d8cd98f00b204e9800998ecf8427e 1ca4f25d161e955cf4b7a4aa65b8e96e a3cd565a14beb0c0c12f48afa7311d65 48d5b09d05bffe4c06848c881c192bfc
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5 MD5 MD5 MD5 MD5 MD5
repository.name.fl_str_mv
repository.mail.fl_str_mv
_version_	1757177244663414784

Detecção de tópicos em documentos usando agrupamento de vetores de palavras

Registros relacionados