Word embedding-based representations for short text

Marcelo Rodrigo de Souza Pita

Word embedding-based representations for short text

Detalhes bibliográficos
Autor(a) principal:	Marcelo Rodrigo de Souza Pita
Data de Publicação:	2019
Tipo de documento:	Tese
Idioma:	eng
Título da fonte:	Repositório Institucional da UFMG
Texto Completo:	http://hdl.handle.net/1843/38885 https://orcid.org/0000-0001-7582-4651
Resumo:	Short texts are everywhere in the Web, including social media, Q&A websites, advertisement text, and an increasing number of other applications. They are characterized by little context words and a large collection vocabulary. This makes the discovery of knowledge in short text challenging, motivating the development of novel effective methods. An important part of this research is focused on topic modeling that, beyond the popular LDA method, have produced specific algorithms for short text. Text mining techniques are dependent on the way text is represented. The need of fixed-length input for most machine learning algorithms asks for vector representations, such as the classics TF and TF-IDF. These representations are sparse and eventually induce the curse of dimensionality. In the level of words, word vector models, such as Skip-Gram and GloVe, produce embeddings that are sensitive to semantics and consistent with vector algebra. A natural evolution of this research is the derivation of document vectors. This work has contributions in two lines of research, namely, short text representation for document classification and short text topic modeling (STTM). In first line, we report a work that investigates proper ways of combining word vectors to produce document vectors. Strategies vary from simple approaches, such as sum and average of word vectors, to a sophisticated one based on the PSO meta-heuristic. Results on document classification are competitive with TF-IDF and show significant improvement over other methods. Regarding the second line of research, a framework that creates larger pseudo-documents for STTM is proposed, from which we derive two implementations: (1) CoFE, based on the co-occurrence of words; and (2) DREx, which relies on word vectors. We also propose Vec2Graph, a graph-based representation for corpora induced by word vectors, and VGTM, a probabilistic short text topic model that works on the top of Vec2Graph. Comparative experiments with state of the art baselines show significant improvements both in NPMI and F1-score.

Metadados do item

id	UFMG_74572b360df087c190ee1c037b9de784
oai_identifier_str	oai:repositorio.ufmg.br:1843/38885
network_acronym_str	UFMG
network_name_str	Repositório Institucional da UFMG
repository_id_str
spelling	Gisele Lobo Pappahttp://lattes.cnpq.br/5936682335701497Marcos André GonçalvesMarco Antônio Pinheiro de CristoAlexandre Plastino de CarvalhoPedro Olmo Stancioli Vaz de Melohttp://lattes.cnpq.br/2463256611461412Marcelo Rodrigo de Souza Pita2021-12-17T23:31:39Z2021-12-17T23:31:39Z2019-12-02http://hdl.handle.net/1843/38885https://orcid.org/0000-0001-7582-4651Short texts are everywhere in the Web, including social media, Q&A websites, advertisement text, and an increasing number of other applications. They are characterized by little context words and a large collection vocabulary. This makes the discovery of knowledge in short text challenging, motivating the development of novel effective methods. An important part of this research is focused on topic modeling that, beyond the popular LDA method, have produced specific algorithms for short text. Text mining techniques are dependent on the way text is represented. The need of fixed-length input for most machine learning algorithms asks for vector representations, such as the classics TF and TF-IDF. These representations are sparse and eventually induce the curse of dimensionality. In the level of words, word vector models, such as Skip-Gram and GloVe, produce embeddings that are sensitive to semantics and consistent with vector algebra. A natural evolution of this research is the derivation of document vectors. This work has contributions in two lines of research, namely, short text representation for document classification and short text topic modeling (STTM). In first line, we report a work that investigates proper ways of combining word vectors to produce document vectors. Strategies vary from simple approaches, such as sum and average of word vectors, to a sophisticated one based on the PSO meta-heuristic. Results on document classification are competitive with TF-IDF and show significant improvement over other methods. Regarding the second line of research, a framework that creates larger pseudo-documents for STTM is proposed, from which we derive two implementations: (1) CoFE, based on the co-occurrence of words; and (2) DREx, which relies on word vectors. We also propose Vec2Graph, a graph-based representation for corpora induced by word vectors, and VGTM, a probabilistic short text topic model that works on the top of Vec2Graph. Comparative experiments with state of the art baselines show significant improvements both in NPMI and F1-score.Textos curtos estão em todo lugar na Web, incluindo mídias sociais, sites de perguntas e respostas (Q&A), textos de propagandas e um número cada vez maior de outras aplicações. Eles são caracterizados pelo escasso contexto de palavras e extenso vocabulário. Estas características tornam a descoberta de conhecimento em texto curto desafiadora, motivando o desenvolvimento de novos métodos. Técnicas de mineração de texto são dependentes da forma como textos são representados. A necessidade de entradas de tamanho fixo para a maioria dos algortimos de aprendizado de máquina exige representações vetoriais, tais como as representações clássicas TF e TF-IDF. Contudo, estas representações são esparsas e podem induzir a "maldição da dimensionalidade". No nível de palavras, modelos de vetores de palavras, tais como Skip-Gram e GloVe, produzem embeddings que são sensíveis a semântica e consistentes com álgebra de vetores. Este trabalho apresenta contribuições em representação de texto curto para classificação de documentos e modelagem de tópicos para texto curto. Na primeira linha, uma investação sobre combinações apropriadas de vetores de palavras para geração de vetores de documentos é realizada. Estratégias variam de simples combinações até o método PSO-WAWV, baseado na meta-heurística PSO. Resultados em classificação de documentos são competitivos com TF-IDF e revelam ganhos significativos sobre outros métodos. Na segunda linha de pesquisa, um arcabouço que cria pseudodocumentos para modelagem de tópicos é proposto, além de duas implementações: (1) CoFE, baseado na co-ocorrência de palavras; e (2) DREx, que usa vetores de palavras. Também são propostos o modelo Vec2Graph, que induz um grafo de similaridade de vetores de palavras, e o algoritmo VGTM, um modelo de tópicos probabilístico para texto curto que funciona sobre Vec2Graph. Resultados experimentais mostram ganhos significativos em NPMI e F1-score quando comparados com métodos estado-da-arte.engUniversidade Federal de Minas GeraisPrograma de Pós-Graduação em Ciência da ComputaçãoUFMGBrasilICX - DEPARTAMENTO DE CIÊNCIA DA COMPUTAÇÃOhttp://creativecommons.org/licenses/by-nd/3.0/pt/info:eu-repo/semantics/openAccessComputação – TesesModelagem de tópicos – TesesRepresentação de textos - TesesProcessamento de linguagem natural (Computação) – TesesAprendizado de máquina – TesesShort text topic modelingShort text representationWord vectorsWord embedding-based representations for short textRepresentações de documentos curtos baseadas em vetores de palavrasinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisreponame:Repositório Institucional da UFMGinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMGCC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-8805https://repositorio.ufmg.br/bitstream/1843/38885/5/license_rdf00e5e6a57d5512d202d12cb48704dfd6MD55LICENSElicense.txtlicense.txttext/plain; charset=utf-82118https://repositorio.ufmg.br/bitstream/1843/38885/6/license.txtcda590c95a0b51b4d15f60c9642ca272MD56ORIGINALTese_Marcelo-Pita_final.pdfTese_Marcelo-Pita_final.pdfTese de Doutoradoapplication/pdf19390511https://repositorio.ufmg.br/bitstream/1843/38885/4/Tese_Marcelo-Pita_final.pdf5649621548654f620c401a463c6eb767MD541843/388852021-12-17 20:31:40.341oai:repositorio.ufmg.br:1843/38885TElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEgRE8gUkVQT1NJVMOTUklPIElOU1RJVFVDSU9OQUwgREEgVUZNRwoKQ29tIGEgYXByZXNlbnRhw6fDo28gZGVzdGEgbGljZW7Dp2EsIHZvY8OqIChvIGF1dG9yIChlcykgb3UgbyB0aXR1bGFyIGRvcyBkaXJlaXRvcyBkZSBhdXRvcikgY29uY2VkZSBhbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIChSSS1VRk1HKSBvIGRpcmVpdG8gbsOjbyBleGNsdXNpdm8gZSBpcnJldm9nw6F2ZWwgZGUgcmVwcm9kdXppciBlL291IGRpc3RyaWJ1aXIgYSBzdWEgcHVibGljYcOnw6NvIChpbmNsdWluZG8gbyByZXN1bW8pIHBvciB0b2RvIG8gbXVuZG8gbm8gZm9ybWF0byBpbXByZXNzbyBlIGVsZXRyw7RuaWNvIGUgZW0gcXVhbHF1ZXIgbWVpbywgaW5jbHVpbmRvIG9zIGZvcm1hdG9zIMOhdWRpbyBvdSB2w61kZW8uCgpWb2PDqiBkZWNsYXJhIHF1ZSBjb25oZWNlIGEgcG9sw610aWNhIGRlIGNvcHlyaWdodCBkYSBlZGl0b3JhIGRvIHNldSBkb2N1bWVudG8gZSBxdWUgY29uaGVjZSBlIGFjZWl0YSBhcyBEaXJldHJpemVzIGRvIFJJLVVGTUcuCgpWb2PDqiBjb25jb3JkYSBxdWUgbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIHBvZGUsIHNlbSBhbHRlcmFyIG8gY29udGXDumRvLCB0cmFuc3BvciBhIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBxdWFscXVlciBtZWlvIG91IGZvcm1hdG8gcGFyYSBmaW5zIGRlIHByZXNlcnZhw6fDo28uCgpWb2PDqiB0YW1iw6ltIGNvbmNvcmRhIHF1ZSBvIFJlcG9zaXTDs3JpbyBJbnN0aXR1Y2lvbmFsIGRhIFVGTUcgcG9kZSBtYW50ZXIgbWFpcyBkZSB1bWEgY8OzcGlhIGRlIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBmaW5zIGRlIHNlZ3VyYW7Dp2EsIGJhY2stdXAgZSBwcmVzZXJ2YcOnw6NvLgoKVm9jw6ogZGVjbGFyYSBxdWUgYSBzdWEgcHVibGljYcOnw6NvIMOpIG9yaWdpbmFsIGUgcXVlIHZvY8OqIHRlbSBvIHBvZGVyIGRlIGNvbmNlZGVyIG9zIGRpcmVpdG9zIGNvbnRpZG9zIG5lc3RhIGxpY2Vuw6dhLiBWb2PDqiB0YW1iw6ltIGRlY2xhcmEgcXVlIG8gZGVww7NzaXRvIGRlIHN1YSBwdWJsaWNhw6fDo28gbsOjbywgcXVlIHNlamEgZGUgc2V1IGNvbmhlY2ltZW50bywgaW5mcmluZ2UgZGlyZWl0b3MgYXV0b3JhaXMgZGUgbmluZ3XDqW0uCgpDYXNvIGEgc3VhIHB1YmxpY2HDp8OjbyBjb250ZW5oYSBtYXRlcmlhbCBxdWUgdm9jw6ogbsOjbyBwb3NzdWkgYSB0aXR1bGFyaWRhZGUgZG9zIGRpcmVpdG9zIGF1dG9yYWlzLCB2b2PDqiBkZWNsYXJhIHF1ZSBvYnRldmUgYSBwZXJtaXNzw6NvIGlycmVzdHJpdGEgZG8gZGV0ZW50b3IgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIHBhcmEgY29uY2VkZXIgYW8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBvcyBkaXJlaXRvcyBhcHJlc2VudGFkb3MgbmVzdGEgbGljZW7Dp2EsIGUgcXVlIGVzc2UgbWF0ZXJpYWwgZGUgcHJvcHJpZWRhZGUgZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUgaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3Ugbm8gY29udGXDumRvIGRhIHB1YmxpY2HDp8OjbyBvcmEgZGVwb3NpdGFkYS4KCkNBU08gQSBQVUJMSUNBw4fDg08gT1JBIERFUE9TSVRBREEgVEVOSEEgU0lETyBSRVNVTFRBRE8gREUgVU0gUEFUUk9Dw41OSU8gT1UgQVBPSU8gREUgVU1BIEFHw4pOQ0lBIERFIEZPTUVOVE8gT1UgT1VUUk8gT1JHQU5JU01PLCBWT0PDiiBERUNMQVJBIFFVRSBSRVNQRUlUT1UgVE9ET1MgRSBRVUFJU1FVRVIgRElSRUlUT1MgREUgUkVWSVPDg08gQ09NTyBUQU1Cw4lNIEFTIERFTUFJUyBPQlJJR0HDh8OVRVMgRVhJR0lEQVMgUE9SIENPTlRSQVRPIE9VIEFDT1JETy4KCk8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBzZSBjb21wcm9tZXRlIGEgaWRlbnRpZmljYXIgY2xhcmFtZW50ZSBvIHNldSBub21lKHMpIG91IG8ocykgbm9tZXMocykgZG8ocykgZGV0ZW50b3IoZXMpIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBkYSBwdWJsaWNhw6fDo28sIGUgbsOjbyBmYXLDoSBxdWFscXVlciBhbHRlcmHDp8OjbywgYWzDqW0gZGFxdWVsYXMgY29uY2VkaWRhcyBwb3IgZXN0YSBsaWNlbsOnYS4KRepositório de PublicaçõesPUBhttps://repositorio.ufmg.br/oaiopendoar:2021-12-17T23:31:40Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)false
dc.title.pt_BR.fl_str_mv	Word embedding-based representations for short text
dc.title.alternative.pt_BR.fl_str_mv	Representações de documentos curtos baseadas em vetores de palavras
title	Word embedding-based representations for short text
spellingShingle	Word embedding-based representations for short text Marcelo Rodrigo de Souza Pita Short text topic modeling Short text representation Word vectors Computação – Teses Modelagem de tópicos – Teses Representação de textos - Teses Processamento de linguagem natural (Computação) – Teses Aprendizado de máquina – Teses
title_short	Word embedding-based representations for short text
title_full	Word embedding-based representations for short text
title_fullStr	Word embedding-based representations for short text
title_full_unstemmed	Word embedding-based representations for short text
title_sort	Word embedding-based representations for short text
author	Marcelo Rodrigo de Souza Pita
author_facet	Marcelo Rodrigo de Souza Pita
author_role	author
dc.contributor.advisor1.fl_str_mv	Gisele Lobo Pappa
dc.contributor.advisor1Lattes.fl_str_mv	http://lattes.cnpq.br/5936682335701497
dc.contributor.referee1.fl_str_mv	Marcos André Gonçalves
dc.contributor.referee2.fl_str_mv	Marco Antônio Pinheiro de Cristo
dc.contributor.referee3.fl_str_mv	Alexandre Plastino de Carvalho
dc.contributor.referee4.fl_str_mv	Pedro Olmo Stancioli Vaz de Melo
dc.contributor.authorLattes.fl_str_mv	http://lattes.cnpq.br/2463256611461412
dc.contributor.author.fl_str_mv	Marcelo Rodrigo de Souza Pita
contributor_str_mv	Gisele Lobo Pappa Marcos André Gonçalves Marco Antônio Pinheiro de Cristo Alexandre Plastino de Carvalho Pedro Olmo Stancioli Vaz de Melo
dc.subject.por.fl_str_mv	Short text topic modeling Short text representation Word vectors
topic	Short text topic modeling Short text representation Word vectors Computação – Teses Modelagem de tópicos – Teses Representação de textos - Teses Processamento de linguagem natural (Computação) – Teses Aprendizado de máquina – Teses
dc.subject.other.pt_BR.fl_str_mv	Computação – Teses Modelagem de tópicos – Teses Representação de textos - Teses Processamento de linguagem natural (Computação) – Teses Aprendizado de máquina – Teses
description	Short texts are everywhere in the Web, including social media, Q&A websites, advertisement text, and an increasing number of other applications. They are characterized by little context words and a large collection vocabulary. This makes the discovery of knowledge in short text challenging, motivating the development of novel effective methods. An important part of this research is focused on topic modeling that, beyond the popular LDA method, have produced specific algorithms for short text. Text mining techniques are dependent on the way text is represented. The need of fixed-length input for most machine learning algorithms asks for vector representations, such as the classics TF and TF-IDF. These representations are sparse and eventually induce the curse of dimensionality. In the level of words, word vector models, such as Skip-Gram and GloVe, produce embeddings that are sensitive to semantics and consistent with vector algebra. A natural evolution of this research is the derivation of document vectors. This work has contributions in two lines of research, namely, short text representation for document classification and short text topic modeling (STTM). In first line, we report a work that investigates proper ways of combining word vectors to produce document vectors. Strategies vary from simple approaches, such as sum and average of word vectors, to a sophisticated one based on the PSO meta-heuristic. Results on document classification are competitive with TF-IDF and show significant improvement over other methods. Regarding the second line of research, a framework that creates larger pseudo-documents for STTM is proposed, from which we derive two implementations: (1) CoFE, based on the co-occurrence of words; and (2) DREx, which relies on word vectors. We also propose Vec2Graph, a graph-based representation for corpora induced by word vectors, and VGTM, a probabilistic short text topic model that works on the top of Vec2Graph. Comparative experiments with state of the art baselines show significant improvements both in NPMI and F1-score.
publishDate	2019
dc.date.issued.fl_str_mv	2019-12-02
dc.date.accessioned.fl_str_mv	2021-12-17T23:31:39Z
dc.date.available.fl_str_mv	2021-12-17T23:31:39Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/doctoralThesis
format	doctoralThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/1843/38885
dc.identifier.orcid.pt_BR.fl_str_mv	https://orcid.org/0000-0001-7582-4651
url	http://hdl.handle.net/1843/38885 https://orcid.org/0000-0001-7582-4651
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	http://creativecommons.org/licenses/by-nd/3.0/pt/ info:eu-repo/semantics/openAccess
rights_invalid_str_mv	http://creativecommons.org/licenses/by-nd/3.0/pt/
eu_rights_str_mv	openAccess
dc.publisher.none.fl_str_mv	Universidade Federal de Minas Gerais
dc.publisher.program.fl_str_mv	Programa de Pós-Graduação em Ciência da Computação
dc.publisher.initials.fl_str_mv	UFMG
dc.publisher.country.fl_str_mv	Brasil
dc.publisher.department.fl_str_mv	ICX - DEPARTAMENTO DE CIÊNCIA DA COMPUTAÇÃO
publisher.none.fl_str_mv	Universidade Federal de Minas Gerais
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFMG instname:Universidade Federal de Minas Gerais (UFMG) instacron:UFMG
instname_str	Universidade Federal de Minas Gerais (UFMG)
instacron_str	UFMG
institution	UFMG
reponame_str	Repositório Institucional da UFMG
collection	Repositório Institucional da UFMG
bitstream.url.fl_str_mv	https://repositorio.ufmg.br/bitstream/1843/38885/5/license_rdf https://repositorio.ufmg.br/bitstream/1843/38885/6/license.txt https://repositorio.ufmg.br/bitstream/1843/38885/4/Tese_Marcelo-Pita_final.pdf
bitstream.checksum.fl_str_mv	00e5e6a57d5512d202d12cb48704dfd6 cda590c95a0b51b4d15f60c9642ca272 5649621548654f620c401a463c6eb767
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)
repository.mail.fl_str_mv
_version_	1803589219802677248

Word embedding-based representations for short text

Registros relacionados