Detecção de tópicos em documentos usando agrupamento de vetores de palavras

Detalhes bibliográficos
Autor(a) principal: Miranda, Guilherme Raiol de
Data de Publicação: 2020
Tipo de documento: Dissertação
Idioma: por
Título da fonte: Biblioteca Digital de Teses e Dissertações do Mackenzie
Texto Completo: https://dspace.mackenzie.br/handle/10899/28616
Resumo: With the exponential increase of texts generated each year, the demand for Natural Language Processing techniques has been increasing, both by companies and by the Academy. Automatic topic detection in documents is one of the most challenging and useful tasks for information discovery and document summarization. Traditional topic detection techniques, such as the Latent Dirichlet Allocation (LDA) and the Non-Negative Matrix Factorization (NMF), originally did not produce good results for large databases nor use syntactic and semantic information to generate topics. Recently, word vectorization techniques, suchasWord2Vec, proved to have a good computational performance in large data sets and to be effective in representing words as word vectors in a distributed way, maintaining syntactic and semantic information. This dissertation proposes the verification of the following research hypothesis: Is Word2Vec capable of providing enough information to generate interpretable topics? For validation, a method, named Word2Topic, with two approaches for the generation of topics was proposed: direct grouping of word vectors and grouping after dimensionality reduction. The method was applied in two benchmark datasets and was compared with the traditional algorithms by a topic interpretability metric. It was observed in the results that the techniques proposed in one of the data bases generated set sof interpretable words or similar morphological classes. The topics obtained were similar to those of NMF, while LDA was unable to generate interpretable topics. It was not possible to fully validate the research hypothesis because the results of the second dataset did not lead to have the same interpretability or generated morphologically similar words.
id UPM_74542ae47f248fca67970a1afe590fe1
oai_identifier_str oai:dspace.mackenzie.br:10899/28616
network_acronym_str UPM
network_name_str Biblioteca Digital de Teses e Dissertações do Mackenzie
repository_id_str 10277
spelling 2021-12-18T21:44:28Z2021-12-18T21:44:28Z2020-08-21MIRANDA, Guilherme Raiol de. Detecção de tópicos em documentos usando agrupamento de vetores de palavras. 2020. 91 f. Dissertação (Engenharia Elétrica) - Universidade Presbiteriana Mackenzie, São Paulo, 2020https://dspace.mackenzie.br/handle/10899/28616With the exponential increase of texts generated each year, the demand for Natural Language Processing techniques has been increasing, both by companies and by the Academy. Automatic topic detection in documents is one of the most challenging and useful tasks for information discovery and document summarization. Traditional topic detection techniques, such as the Latent Dirichlet Allocation (LDA) and the Non-Negative Matrix Factorization (NMF), originally did not produce good results for large databases nor use syntactic and semantic information to generate topics. Recently, word vectorization techniques, suchasWord2Vec, proved to have a good computational performance in large data sets and to be effective in representing words as word vectors in a distributed way, maintaining syntactic and semantic information. This dissertation proposes the verification of the following research hypothesis: Is Word2Vec capable of providing enough information to generate interpretable topics? For validation, a method, named Word2Topic, with two approaches for the generation of topics was proposed: direct grouping of word vectors and grouping after dimensionality reduction. The method was applied in two benchmark datasets and was compared with the traditional algorithms by a topic interpretability metric. It was observed in the results that the techniques proposed in one of the data bases generated set sof interpretable words or similar morphological classes. The topics obtained were similar to those of NMF, while LDA was unable to generate interpretable topics. It was not possible to fully validate the research hypothesis because the results of the second dataset did not lead to have the same interpretability or generated morphologically similar words.Com o aumento exponencial de textos gerados a cada ano, a procura de técnicas de Processamento de Língua Natural vem aumentado, tanto por empresas, como pela Academia. A detecção automática de tópicos em documentos é uma das tarefas mais desafiadoras e úteis para a descoberta de informações e sumarização de documentos. Técnicas tradicionais de detecção de tópicos, como a Latent Dirichlet Allocation (LDA) e a Non-Negative Matrix Factorization (NMF), originalmente não produzem bons resultados para bases de dados grandes, nem utilizam informações sintáticas e semânticas para geração de tópicos. Recentemente, técnicas de vetorização de palavras, como o Word2Vec, provaram ter um bom desempenho computacional em grandes conjuntos de dados e serem eficazes representando palavras como vetores de palavras de forma distribuída, mantendo as informações sintáticas e semânticas. Esta dissertação propõe a verificação da seguinte questão de pesquisa: O Word2Vec é capaz de prover informações suficientes para a geração de tópicos interpretáveis? Para a validação, foi proposto um método, denominado Word2Topic, com duas abordagens para a geração dos tópicos: agrupamento direto dos vetores de palavras e agrupamento após redução de dimensionalidade. O método foi aplicado em duas bases referência da literatura e foi comparado com os algoritmos tradicionais por uma métrica de interpretabilidade dos tópicos. Foi observado nos resultados que as técnicas propostas em uma das bases de dados gerou conjuntos de palavras interpretáveis ou de classes morfológicas similares. Os tópicos obtidos foram similares aos da técnica NMF, enquanto a técnica LDA não conseguiu gerar tópicos interpretáveis. Não foi possível validar completamente a questão de pesquisa, pois os resultados da segunda base não permitiram a mesma interpretabilidade ou geração de palavras morfologicamente similares.Coordenação de Aperfeiçoamento de Pessoal de Nível SuperiorFundo Mackenzie de Pesquisaapplication/pdfporUniversidade Presbiteriana MackenzieEngenharia ElétricaUPMBrasilEscola de Engenharia Mackenzie (EE)http://creativecommons.org/licenses/by-nc-nd/4.0/info:eu-repo/semantics/openAccessprocessamendo de língua naturaldetecção de tópicosword2vecCNPQ::ENGENHARIASDetecção de tópicos em documentos usando agrupamento de vetores de palavrasinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisSilva, Leandro Nunes de Castrohttp://lattes.cnpq.br/2741458816539568Araújo, Renata Mendes dehttp://lattes.cnpq.br/3589012014320121Coello, Juan Manuel Adánhttp://lattes.cnpq.br/3087162397314631 / https://orcid.org/0000-0001-5942-9598http://lattes.cnpq.br/6553504314681393Miranda, Guilherme Raiol dekey-wordsnatural language processingtopic modelingword2vecreponame:Biblioteca Digital de Teses e Dissertações do Mackenzieinstname:Universidade Presbiteriana Mackenzie (MACKENZIE)instacron:MACKENZIEORIGINALGUILHERME RAIOL DE MIRANDA - protegido.pdfGuilherme Raiol de Mirandaapplication/pdf1040118https://dspace.mackenzie.br/bitstream/10899/28616/1/GUILHERME%20RAIOL%20DE%20MIRANDA%20-%20protegido.pdf0b9f047ad10cf884d01bfba56dbd8e42MD51CC-LICENSElicense_urlapplication/octet-stream49https://dspace.mackenzie.br/bitstream/10899/28616/2/license_url4afdbb8c545fd630ea7db775da747b2fMD52license_textapplication/octet-stream0https://dspace.mackenzie.br/bitstream/10899/28616/3/license_textd41d8cd98f00b204e9800998ecf8427eMD53license_rdfapplication/octet-stream0https://dspace.mackenzie.br/bitstream/10899/28616/4/license_rdfd41d8cd98f00b204e9800998ecf8427eMD54LICENSElicense.txttext/plain2108https://dspace.mackenzie.br/bitstream/10899/28616/5/license.txt1ca4f25d161e955cf4b7a4aa65b8e96eMD55TEXTGUILHERME RAIOL DE MIRANDA - protegido.pdf.txtGUILHERME RAIOL DE MIRANDA - protegido.pdf.txtExtracted texttext/plain135427https://dspace.mackenzie.br/bitstream/10899/28616/6/GUILHERME%20RAIOL%20DE%20MIRANDA%20-%20protegido.pdf.txta3cd565a14beb0c0c12f48afa7311d65MD56THUMBNAILGUILHERME RAIOL DE MIRANDA - protegido.pdf.jpgGUILHERME RAIOL DE MIRANDA - protegido.pdf.jpgGenerated Thumbnailimage/jpeg1205https://dspace.mackenzie.br/bitstream/10899/28616/7/GUILHERME%20RAIOL%20DE%20MIRANDA%20-%20protegido.pdf.jpg48d5b09d05bffe4c06848c881c192bfcMD5710899/286162021-12-19 03:02:45.341TElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEKCkNvbSBhIGFwcmVzZW50YcOnw6NvIGRlc3RhIGxpY2Vuw6dhLCB2b2PDqiAobyBhdXRvciAoZXMpIG91IG8gdGl0dWxhciBkb3MgZGlyZWl0b3MgZGUgYXV0b3IpIGNvbmNlZGUgw6AgVW5pdmVyc2lkYWRlIFByZXNiaXRlcmlhbmEgTWFja2VuemllIG8gZGlyZWl0byBuw6NvLWV4Y2x1c2l2byBkZSByZXByb2R1emlyLCAgdHJhZHV6aXIgKGNvbmZvcm1lIGRlZmluaWRvIGFiYWl4byksIGUvb3UgZGlzdHJpYnVpciBhIHN1YSB0ZXNlIG91IGRpc3NlcnRhw6fDo28gKGluY2x1aW5kbyBvIHJlc3VtbykgcG9yIHRvZG8gbyBtdW5kbyBubyBmb3JtYXRvIGltcHJlc3NvIGUgZWxldHLDtG5pY28gZSBlbSBxdWFscXVlciBtZWlvLCBpbmNsdWluZG8gb3MgZm9ybWF0b3Mgw6F1ZGlvIG91IHbDrWRlby4KClZvY8OqIGNvbmNvcmRhIHF1ZSBhIFVuaXZlcnNpZGFkZSBQcmVzYml0ZXJpYW5hIE1hY2tlbnppZSBwb2RlLCBzZW0gYWx0ZXJhciBvIGNvbnRlw7pkbywgdHJhbnNwb3IgYSBzdWEgdGVzZSBvdSBkaXNzZXJ0YcOnw6NvIHBhcmEgcXVhbHF1ZXIgbWVpbyBvdSBmb3JtYXRvIHBhcmEgZmlucyBkZSBwcmVzZXJ2YcOnw6NvLgoKVm9jw6ogdGFtYsOpbSBjb25jb3JkYSBxdWUgYSBVbml2ZXJzaWRhZGUgUHJlc2JpdGVyaWFuYSBNYWNrZW56aWUgcG9kZSBtYW50ZXIgbWFpcyBkZSB1bWEgY8OzcGlhIGEgc3VhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyBwYXJhIGZpbnMgZGUgc2VndXJhbsOnYSwgYmFjay11cCBlIHByZXNlcnZhw6fDo28uCgpWb2PDqiBkZWNsYXJhIHF1ZSBhIHN1YSB0ZXNlIG91IGRpc3NlcnRhw6fDo28gw6kgb3JpZ2luYWwgZSBxdWUgdm9jw6ogdGVtIG8gcG9kZXIgZGUgY29uY2VkZXIgb3MgZGlyZWl0b3MgY29udGlkb3MgbmVzdGEgbGljZW7Dp2EuIFZvY8OqIHRhbWLDqW0gZGVjbGFyYSBxdWUgbyBkZXDDs3NpdG8gZGEgc3VhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyBuw6NvLCBxdWUgc2VqYSBkZSBzZXUgY29uaGVjaW1lbnRvLCBpbmZyaW5nZSBkaXJlaXRvcyBhdXRvcmFpcyBkZSBuaW5ndcOpbS4KCkNhc28gYSBzdWEgdGVzZSBvdSBkaXNzZXJ0YcOnw6NvIGNvbnRlbmhhIG1hdGVyaWFsIHF1ZSB2b2PDqiBuw6NvIHBvc3N1aSBhIHRpdHVsYXJpZGFkZSBkb3MgZGlyZWl0b3MgYXV0b3JhaXMsIHZvY8OqIGRlY2xhcmEgcXVlIG9idGV2ZSBhIHBlcm1pc3PDo28gaXJyZXN0cml0YSBkbyBkZXRlbnRvciBkb3MgZGlyZWl0b3MgYXV0b3JhaXMgcGFyYSBjb25jZWRlciDDoCBVbml2ZXJzaWRhZGUgUHJlc2JpdGVyaWFuYSBNYWNrZW56aWUgb3MgZGlyZWl0b3MgYXByZXNlbnRhZG9zIG5lc3RhIGxpY2Vuw6dhLCBlIHF1ZSBlc3NlIG1hdGVyaWFsIGRlIHByb3ByaWVkYWRlIGRlIHRlcmNlaXJvcyBlc3TDoSBjbGFyYW1lbnRlIGlkZW50aWZpY2FkbyBlIHJlY29uaGVjaWRvIG5vIHRleHRvIG91IG5vIGNvbnRlw7pkbyBkYSB0ZXNlIG91IGRpc3NlcnRhw6fDo28gb3JhIGRlcG9zaXRhZGEuCgpDQVNPIEEgVEVTRSBPVSBESVNTRVJUQcOHw4NPIE9SQSBERVBPU0lUQURBIFRFTkhBIFNJRE8gUkVTVUxUQURPIERFIFVNIFBBVFJPQ8ONTklPIE9VIEFQT0lPIERFIFVNQSBBR8OKTkNJQSBERSBGT01FTlRPIE9VIE9VVFJPIE9SR0FOSVNNTyBRVUUgTsODTyBTRUpBIEEgVU5JVkVSU0lEQURFIFBSRVNCSVRFUklBTkEgTUFDS0VOWklFLCBWT0PDiiBERUNMQVJBIFFVRSBSRVNQRUlUT1UgVE9ET1MgRSBRVUFJU1FVRVIgRElSRUlUT1MgREUgUkVWSVPDg08gQ09NTyBUQU1Cw4lNIEFTIERFTUFJUyBPQlJJR0HDh8OVRVMgRVhJR0lEQVMgUE9SIENPTlRSQVRPIE9VIEFDT1JETy4KCkEgVW5pdmVyc2lkYWRlIFByZXNiaXRlcmlhbmEgTWFja2VuemllIHNlIGNvbXByb21ldGUgYSBpZGVudGlmaWNhciBjbGFyYW1lbnRlIG8gc2V1IG5vbWUgKHMpIG91IG8ocykgbm9tZShzKSBkbyhzKSBkZXRlbnRvcihlcykgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIGRhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbywgZSBuw6NvIGZhcsOhIHF1YWxxdWVyIGFsdGVyYcOnw6NvLCBhbMOpbSBkYXF1ZWxhcyBjb25jZWRpZGFzIHBvciBlc3RhIGxpY2Vuw6dhLgo=Biblioteca Digital de Teses e Dissertaçõeshttp://tede.mackenzie.br/jspui/PRI
dc.title.por.fl_str_mv Detecção de tópicos em documentos usando agrupamento de vetores de palavras
title Detecção de tópicos em documentos usando agrupamento de vetores de palavras
spellingShingle Detecção de tópicos em documentos usando agrupamento de vetores de palavras
Miranda, Guilherme Raiol de
processamendo de língua natural
detecção de tópicos
word2vec
CNPQ::ENGENHARIAS
title_short Detecção de tópicos em documentos usando agrupamento de vetores de palavras
title_full Detecção de tópicos em documentos usando agrupamento de vetores de palavras
title_fullStr Detecção de tópicos em documentos usando agrupamento de vetores de palavras
title_full_unstemmed Detecção de tópicos em documentos usando agrupamento de vetores de palavras
title_sort Detecção de tópicos em documentos usando agrupamento de vetores de palavras
author Miranda, Guilherme Raiol de
author_facet Miranda, Guilherme Raiol de
author_role author
dc.contributor.advisor1.fl_str_mv Silva, Leandro Nunes de Castro
dc.contributor.advisor1Lattes.fl_str_mv http://lattes.cnpq.br/2741458816539568
dc.contributor.referee1.fl_str_mv Araújo, Renata Mendes de
dc.contributor.referee1Lattes.fl_str_mv http://lattes.cnpq.br/3589012014320121
dc.contributor.referee2.fl_str_mv Coello, Juan Manuel Adán
dc.contributor.referee2Lattes.fl_str_mv http://lattes.cnpq.br/3087162397314631 / https://orcid.org/0000-0001-5942-9598
dc.contributor.authorLattes.fl_str_mv http://lattes.cnpq.br/6553504314681393
dc.contributor.author.fl_str_mv Miranda, Guilherme Raiol de
contributor_str_mv Silva, Leandro Nunes de Castro
Araújo, Renata Mendes de
Coello, Juan Manuel Adán
dc.subject.por.fl_str_mv processamendo de língua natural
detecção de tópicos
word2vec
topic processamendo de língua natural
detecção de tópicos
word2vec
CNPQ::ENGENHARIAS
dc.subject.cnpq.fl_str_mv CNPQ::ENGENHARIAS
description With the exponential increase of texts generated each year, the demand for Natural Language Processing techniques has been increasing, both by companies and by the Academy. Automatic topic detection in documents is one of the most challenging and useful tasks for information discovery and document summarization. Traditional topic detection techniques, such as the Latent Dirichlet Allocation (LDA) and the Non-Negative Matrix Factorization (NMF), originally did not produce good results for large databases nor use syntactic and semantic information to generate topics. Recently, word vectorization techniques, suchasWord2Vec, proved to have a good computational performance in large data sets and to be effective in representing words as word vectors in a distributed way, maintaining syntactic and semantic information. This dissertation proposes the verification of the following research hypothesis: Is Word2Vec capable of providing enough information to generate interpretable topics? For validation, a method, named Word2Topic, with two approaches for the generation of topics was proposed: direct grouping of word vectors and grouping after dimensionality reduction. The method was applied in two benchmark datasets and was compared with the traditional algorithms by a topic interpretability metric. It was observed in the results that the techniques proposed in one of the data bases generated set sof interpretable words or similar morphological classes. The topics obtained were similar to those of NMF, while LDA was unable to generate interpretable topics. It was not possible to fully validate the research hypothesis because the results of the second dataset did not lead to have the same interpretability or generated morphologically similar words.
publishDate 2020
dc.date.issued.fl_str_mv 2020-08-21
dc.date.accessioned.fl_str_mv 2021-12-18T21:44:28Z
dc.date.available.fl_str_mv 2021-12-18T21:44:28Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.citation.fl_str_mv MIRANDA, Guilherme Raiol de. Detecção de tópicos em documentos usando agrupamento de vetores de palavras. 2020. 91 f. Dissertação (Engenharia Elétrica) - Universidade Presbiteriana Mackenzie, São Paulo, 2020
dc.identifier.uri.fl_str_mv https://dspace.mackenzie.br/handle/10899/28616
identifier_str_mv MIRANDA, Guilherme Raiol de. Detecção de tópicos em documentos usando agrupamento de vetores de palavras. 2020. 91 f. Dissertação (Engenharia Elétrica) - Universidade Presbiteriana Mackenzie, São Paulo, 2020
url https://dspace.mackenzie.br/handle/10899/28616
dc.language.iso.fl_str_mv por
language por
dc.rights.driver.fl_str_mv http://creativecommons.org/licenses/by-nc-nd/4.0/
info:eu-repo/semantics/openAccess
rights_invalid_str_mv http://creativecommons.org/licenses/by-nc-nd/4.0/
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Universidade Presbiteriana Mackenzie
dc.publisher.program.fl_str_mv Engenharia Elétrica
dc.publisher.initials.fl_str_mv UPM
dc.publisher.country.fl_str_mv Brasil
dc.publisher.department.fl_str_mv Escola de Engenharia Mackenzie (EE)
publisher.none.fl_str_mv Universidade Presbiteriana Mackenzie
dc.source.none.fl_str_mv reponame:Biblioteca Digital de Teses e Dissertações do Mackenzie
instname:Universidade Presbiteriana Mackenzie (MACKENZIE)
instacron:MACKENZIE
instname_str Universidade Presbiteriana Mackenzie (MACKENZIE)
instacron_str MACKENZIE
institution MACKENZIE
reponame_str Biblioteca Digital de Teses e Dissertações do Mackenzie
collection Biblioteca Digital de Teses e Dissertações do Mackenzie
bitstream.url.fl_str_mv https://dspace.mackenzie.br/bitstream/10899/28616/1/GUILHERME%20RAIOL%20DE%20MIRANDA%20-%20protegido.pdf
https://dspace.mackenzie.br/bitstream/10899/28616/2/license_url
https://dspace.mackenzie.br/bitstream/10899/28616/3/license_text
https://dspace.mackenzie.br/bitstream/10899/28616/4/license_rdf
https://dspace.mackenzie.br/bitstream/10899/28616/5/license.txt
https://dspace.mackenzie.br/bitstream/10899/28616/6/GUILHERME%20RAIOL%20DE%20MIRANDA%20-%20protegido.pdf.txt
https://dspace.mackenzie.br/bitstream/10899/28616/7/GUILHERME%20RAIOL%20DE%20MIRANDA%20-%20protegido.pdf.jpg
bitstream.checksum.fl_str_mv 0b9f047ad10cf884d01bfba56dbd8e42
4afdbb8c545fd630ea7db775da747b2f
d41d8cd98f00b204e9800998ecf8427e
d41d8cd98f00b204e9800998ecf8427e
1ca4f25d161e955cf4b7a4aa65b8e96e
a3cd565a14beb0c0c12f48afa7311d65
48d5b09d05bffe4c06848c881c192bfc
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
MD5
MD5
MD5
MD5
repository.name.fl_str_mv
repository.mail.fl_str_mv
_version_ 1757177244663414784