Exploiting semantic similarity for improved text representation
Autor(a) principal: | |
---|---|
Data de Publicação: | 2018 |
Tipo de documento: | Dissertação |
Idioma: | eng |
Título da fonte: | Repositório Institucional da UFMG |
Texto Completo: | http://hdl.handle.net/1843/39134 |
Resumo: | Automatic Document Classification is a key technique to help extracting useful information from the huge amount of textual data produced daily on the Web and inside organizations. Recently, Word Embeddings (e.g., Word2Vec) have been proposed for representing terms as vectors whose similarities should correlate with semantic relatedness. There has also been some research on how to use Word Embeddings to improve text classification. Nevertheless, current results depend on heavy and careful parameter tuning and still do not consistently outperform Bag-of-Words representation in a variety of scenarios. Since the nearest words of a given Word Embedding are all semantically related to each other, we propose a new method for generating features from clusters of similar Word Embeddings. We refer to these clusters as hyperwords, since they correspond to new semantic concepts, richer than simple words. We propose an adaptation of the TF-IDF weighting scheme for these new features so that they can be used similarly to the original terms, but substituting them. We demonstrate that features generated from hyperwords are significantly more discriminative than those obtained from simple words. We also experiment with the combination of the hyperwords-based representation with a state-of-art pooling technique, obtaining a very robust method. Extensive experiments performed using 24 benchmarks on topic classification and sentiment analysis against state-of-the-art baselines that exploit Word Embedding-based document representations show the superiority of our proposals by large margins, achieving gains up to 18% on topic classification datasets and 16% in sentiment classification datasets over the Bag-of-Words representation. |
id |
UFMG_b2063c198be61188a0abf44f412fce8d |
---|---|
oai_identifier_str |
oai:repositorio.ufmg.br:1843/39134 |
network_acronym_str |
UFMG |
network_name_str |
Repositório Institucional da UFMG |
repository_id_str |
|
spelling |
Marcos André Gonçalveshttp://lattes.cnpq.br/3457219624656691Gisele Lobo PappaMário Sérgio Ferreira Alvim JúniorLeonardo Chaves Dutra da Rochahttp://lattes.cnpq.br/7314598614070575Victor Silva Rodrigues2022-01-20T18:49:35Z2022-01-20T18:49:35Z2018-08-24http://hdl.handle.net/1843/39134Automatic Document Classification is a key technique to help extracting useful information from the huge amount of textual data produced daily on the Web and inside organizations. Recently, Word Embeddings (e.g., Word2Vec) have been proposed for representing terms as vectors whose similarities should correlate with semantic relatedness. There has also been some research on how to use Word Embeddings to improve text classification. Nevertheless, current results depend on heavy and careful parameter tuning and still do not consistently outperform Bag-of-Words representation in a variety of scenarios. Since the nearest words of a given Word Embedding are all semantically related to each other, we propose a new method for generating features from clusters of similar Word Embeddings. We refer to these clusters as hyperwords, since they correspond to new semantic concepts, richer than simple words. We propose an adaptation of the TF-IDF weighting scheme for these new features so that they can be used similarly to the original terms, but substituting them. We demonstrate that features generated from hyperwords are significantly more discriminative than those obtained from simple words. We also experiment with the combination of the hyperwords-based representation with a state-of-art pooling technique, obtaining a very robust method. Extensive experiments performed using 24 benchmarks on topic classification and sentiment analysis against state-of-the-art baselines that exploit Word Embedding-based document representations show the superiority of our proposals by large margins, achieving gains up to 18% on topic classification datasets and 16% in sentiment classification datasets over the Bag-of-Words representation.A Classificação Automática de Documentos é uma técnica fundamental quando se trata da extração de informações úteis da grande e crescente quantidade de dados textuais produzidos diariamente na Internet e dentro das organizações. Recentemente, Vetores de Palavras (Word Embeddings, como por exemplo Word2Vec) foram propostos para representar termos como vetores cujas similaridades correspondem à proximidade semântica entre as palavras. Além disso, existem linhas de pesquisa cujo objetivo é compreender a utilização de Vetores de Palavras para melhorar a classificação textual. Entretanto, os resultados atuais dependem de muitos ajustes finos em suas parametrizações, e seus resultados nem sempre são consistentes quanto à superioridade em relação ao modelo tradicional de Saco-de-Palavras (Bag-of-Words). Como as palavras mais próximas em um modelo de Vetores de Palavras são semanticamente relacionadas, propomos um novo método de geração de atributos a partir de agrupamentos de palavras similares. Nós nos referimos a esses agrupamentos como “hyper-palavras” (hyperwords), uma vez que eles correspondem a novos conceitos semânticos, mais ricos do que as palavras simples. Nós propomos, ainda, uma adaptação ao modelo TF-IDF de assinalamento de pesos, criado especificamente para as hyper-palavras, que pode ser utilizado de forma similar àquela utilizada pelos termos originais, efetivamente substituindo as palavras na representação de documentos. Demonstramos que os atributos gerados a partir de hyper-palavras são significativamente mais discriminativos do que aqueles obtidos a partir de palavras simples. Também experimentamos uma combinação entre os atributos de hyper-palavras com os atributos derivados de uma técnica estado-da-arte de agregação de vetores de palavras, obtendo um método robusto. Experimentos amplos foram executados utilizando 24 bases de comparação em classificação de tópicos e de análise de sentimentos, comparando com métodos estado-da-arte em vetores de palavras, demonstrando a superioridade da nossa proposta em grandes margens, obtendo ganhos de até 18% em classificação de tópicos e 16% em classificação de sentimentos quando comparado ao modelo de Saco-de-Palavras.Outra AgênciaengUniversidade Federal de Minas GeraisPrograma de Pós-Graduação em Ciência da ComputaçãoUFMGBrasilICX - DEPARTAMENTO DE CIÊNCIA DA COMPUTAÇÃOComputação – Teses.Indexação automática – Teses.Processamento da linguagem natural (Computação) – TesesText classificationHyperwordsBag-of-WordsWord embeddingsClassificação automática de documentosHyper-palavrasSaco-de-PalavrasVetores de palavrasExploiting semantic similarity for improved text representationUtilizando similaridade semântica para aprimorar a representação de documentos textuaisinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisinfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da UFMGinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMGORIGINALExploitingSemanticSimilarityForImprovedTextRepresentation.pdfExploitingSemanticSimilarityForImprovedTextRepresentation.pdfapplication/pdf2623533https://repositorio.ufmg.br/bitstream/1843/39134/3/ExploitingSemanticSimilarityForImprovedTextRepresentation.pdf4815da9566a2cb424642fc7463cda803MD53LICENSElicense.txtlicense.txttext/plain; charset=utf-82118https://repositorio.ufmg.br/bitstream/1843/39134/4/license.txtcda590c95a0b51b4d15f60c9642ca272MD541843/391342022-01-20 15:49:36.744oai:repositorio.ufmg.br:1843/39134TElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEgRE8gUkVQT1NJVMOTUklPIElOU1RJVFVDSU9OQUwgREEgVUZNRwoKQ29tIGEgYXByZXNlbnRhw6fDo28gZGVzdGEgbGljZW7Dp2EsIHZvY8OqIChvIGF1dG9yIChlcykgb3UgbyB0aXR1bGFyIGRvcyBkaXJlaXRvcyBkZSBhdXRvcikgY29uY2VkZSBhbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIChSSS1VRk1HKSBvIGRpcmVpdG8gbsOjbyBleGNsdXNpdm8gZSBpcnJldm9nw6F2ZWwgZGUgcmVwcm9kdXppciBlL291IGRpc3RyaWJ1aXIgYSBzdWEgcHVibGljYcOnw6NvIChpbmNsdWluZG8gbyByZXN1bW8pIHBvciB0b2RvIG8gbXVuZG8gbm8gZm9ybWF0byBpbXByZXNzbyBlIGVsZXRyw7RuaWNvIGUgZW0gcXVhbHF1ZXIgbWVpbywgaW5jbHVpbmRvIG9zIGZvcm1hdG9zIMOhdWRpbyBvdSB2w61kZW8uCgpWb2PDqiBkZWNsYXJhIHF1ZSBjb25oZWNlIGEgcG9sw610aWNhIGRlIGNvcHlyaWdodCBkYSBlZGl0b3JhIGRvIHNldSBkb2N1bWVudG8gZSBxdWUgY29uaGVjZSBlIGFjZWl0YSBhcyBEaXJldHJpemVzIGRvIFJJLVVGTUcuCgpWb2PDqiBjb25jb3JkYSBxdWUgbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIHBvZGUsIHNlbSBhbHRlcmFyIG8gY29udGXDumRvLCB0cmFuc3BvciBhIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBxdWFscXVlciBtZWlvIG91IGZvcm1hdG8gcGFyYSBmaW5zIGRlIHByZXNlcnZhw6fDo28uCgpWb2PDqiB0YW1iw6ltIGNvbmNvcmRhIHF1ZSBvIFJlcG9zaXTDs3JpbyBJbnN0aXR1Y2lvbmFsIGRhIFVGTUcgcG9kZSBtYW50ZXIgbWFpcyBkZSB1bWEgY8OzcGlhIGRlIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBmaW5zIGRlIHNlZ3VyYW7Dp2EsIGJhY2stdXAgZSBwcmVzZXJ2YcOnw6NvLgoKVm9jw6ogZGVjbGFyYSBxdWUgYSBzdWEgcHVibGljYcOnw6NvIMOpIG9yaWdpbmFsIGUgcXVlIHZvY8OqIHRlbSBvIHBvZGVyIGRlIGNvbmNlZGVyIG9zIGRpcmVpdG9zIGNvbnRpZG9zIG5lc3RhIGxpY2Vuw6dhLiBWb2PDqiB0YW1iw6ltIGRlY2xhcmEgcXVlIG8gZGVww7NzaXRvIGRlIHN1YSBwdWJsaWNhw6fDo28gbsOjbywgcXVlIHNlamEgZGUgc2V1IGNvbmhlY2ltZW50bywgaW5mcmluZ2UgZGlyZWl0b3MgYXV0b3JhaXMgZGUgbmluZ3XDqW0uCgpDYXNvIGEgc3VhIHB1YmxpY2HDp8OjbyBjb250ZW5oYSBtYXRlcmlhbCBxdWUgdm9jw6ogbsOjbyBwb3NzdWkgYSB0aXR1bGFyaWRhZGUgZG9zIGRpcmVpdG9zIGF1dG9yYWlzLCB2b2PDqiBkZWNsYXJhIHF1ZSBvYnRldmUgYSBwZXJtaXNzw6NvIGlycmVzdHJpdGEgZG8gZGV0ZW50b3IgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIHBhcmEgY29uY2VkZXIgYW8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBvcyBkaXJlaXRvcyBhcHJlc2VudGFkb3MgbmVzdGEgbGljZW7Dp2EsIGUgcXVlIGVzc2UgbWF0ZXJpYWwgZGUgcHJvcHJpZWRhZGUgZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUgaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3Ugbm8gY29udGXDumRvIGRhIHB1YmxpY2HDp8OjbyBvcmEgZGVwb3NpdGFkYS4KCkNBU08gQSBQVUJMSUNBw4fDg08gT1JBIERFUE9TSVRBREEgVEVOSEEgU0lETyBSRVNVTFRBRE8gREUgVU0gUEFUUk9Dw41OSU8gT1UgQVBPSU8gREUgVU1BIEFHw4pOQ0lBIERFIEZPTUVOVE8gT1UgT1VUUk8gT1JHQU5JU01PLCBWT0PDiiBERUNMQVJBIFFVRSBSRVNQRUlUT1UgVE9ET1MgRSBRVUFJU1FVRVIgRElSRUlUT1MgREUgUkVWSVPDg08gQ09NTyBUQU1Cw4lNIEFTIERFTUFJUyBPQlJJR0HDh8OVRVMgRVhJR0lEQVMgUE9SIENPTlRSQVRPIE9VIEFDT1JETy4KCk8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBzZSBjb21wcm9tZXRlIGEgaWRlbnRpZmljYXIgY2xhcmFtZW50ZSBvIHNldSBub21lKHMpIG91IG8ocykgbm9tZXMocykgZG8ocykgZGV0ZW50b3IoZXMpIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBkYSBwdWJsaWNhw6fDo28sIGUgbsOjbyBmYXLDoSBxdWFscXVlciBhbHRlcmHDp8OjbywgYWzDqW0gZGFxdWVsYXMgY29uY2VkaWRhcyBwb3IgZXN0YSBsaWNlbsOnYS4KRepositório de PublicaçõesPUBhttps://repositorio.ufmg.br/oaiopendoar:2022-01-20T18:49:36Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)false |
dc.title.pt_BR.fl_str_mv |
Exploiting semantic similarity for improved text representation |
dc.title.alternative.pt_BR.fl_str_mv |
Utilizando similaridade semântica para aprimorar a representação de documentos textuais |
title |
Exploiting semantic similarity for improved text representation |
spellingShingle |
Exploiting semantic similarity for improved text representation Victor Silva Rodrigues Text classification Hyperwords Bag-of-Words Word embeddings Classificação automática de documentos Hyper-palavras Saco-de-Palavras Vetores de palavras Computação – Teses. Indexação automática – Teses. Processamento da linguagem natural (Computação) – Teses |
title_short |
Exploiting semantic similarity for improved text representation |
title_full |
Exploiting semantic similarity for improved text representation |
title_fullStr |
Exploiting semantic similarity for improved text representation |
title_full_unstemmed |
Exploiting semantic similarity for improved text representation |
title_sort |
Exploiting semantic similarity for improved text representation |
author |
Victor Silva Rodrigues |
author_facet |
Victor Silva Rodrigues |
author_role |
author |
dc.contributor.advisor1.fl_str_mv |
Marcos André Gonçalves |
dc.contributor.advisor1Lattes.fl_str_mv |
http://lattes.cnpq.br/3457219624656691 |
dc.contributor.referee1.fl_str_mv |
Gisele Lobo Pappa |
dc.contributor.referee2.fl_str_mv |
Mário Sérgio Ferreira Alvim Júnior |
dc.contributor.referee3.fl_str_mv |
Leonardo Chaves Dutra da Rocha |
dc.contributor.authorLattes.fl_str_mv |
http://lattes.cnpq.br/7314598614070575 |
dc.contributor.author.fl_str_mv |
Victor Silva Rodrigues |
contributor_str_mv |
Marcos André Gonçalves Gisele Lobo Pappa Mário Sérgio Ferreira Alvim Júnior Leonardo Chaves Dutra da Rocha |
dc.subject.por.fl_str_mv |
Text classification Hyperwords Bag-of-Words Word embeddings Classificação automática de documentos Hyper-palavras Saco-de-Palavras Vetores de palavras |
topic |
Text classification Hyperwords Bag-of-Words Word embeddings Classificação automática de documentos Hyper-palavras Saco-de-Palavras Vetores de palavras Computação – Teses. Indexação automática – Teses. Processamento da linguagem natural (Computação) – Teses |
dc.subject.other.pt_BR.fl_str_mv |
Computação – Teses. Indexação automática – Teses. Processamento da linguagem natural (Computação) – Teses |
description |
Automatic Document Classification is a key technique to help extracting useful information from the huge amount of textual data produced daily on the Web and inside organizations. Recently, Word Embeddings (e.g., Word2Vec) have been proposed for representing terms as vectors whose similarities should correlate with semantic relatedness. There has also been some research on how to use Word Embeddings to improve text classification. Nevertheless, current results depend on heavy and careful parameter tuning and still do not consistently outperform Bag-of-Words representation in a variety of scenarios. Since the nearest words of a given Word Embedding are all semantically related to each other, we propose a new method for generating features from clusters of similar Word Embeddings. We refer to these clusters as hyperwords, since they correspond to new semantic concepts, richer than simple words. We propose an adaptation of the TF-IDF weighting scheme for these new features so that they can be used similarly to the original terms, but substituting them. We demonstrate that features generated from hyperwords are significantly more discriminative than those obtained from simple words. We also experiment with the combination of the hyperwords-based representation with a state-of-art pooling technique, obtaining a very robust method. Extensive experiments performed using 24 benchmarks on topic classification and sentiment analysis against state-of-the-art baselines that exploit Word Embedding-based document representations show the superiority of our proposals by large margins, achieving gains up to 18% on topic classification datasets and 16% in sentiment classification datasets over the Bag-of-Words representation. |
publishDate |
2018 |
dc.date.issued.fl_str_mv |
2018-08-24 |
dc.date.accessioned.fl_str_mv |
2022-01-20T18:49:35Z |
dc.date.available.fl_str_mv |
2022-01-20T18:49:35Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/1843/39134 |
url |
http://hdl.handle.net/1843/39134 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.publisher.none.fl_str_mv |
Universidade Federal de Minas Gerais |
dc.publisher.program.fl_str_mv |
Programa de Pós-Graduação em Ciência da Computação |
dc.publisher.initials.fl_str_mv |
UFMG |
dc.publisher.country.fl_str_mv |
Brasil |
dc.publisher.department.fl_str_mv |
ICX - DEPARTAMENTO DE CIÊNCIA DA COMPUTAÇÃO |
publisher.none.fl_str_mv |
Universidade Federal de Minas Gerais |
dc.source.none.fl_str_mv |
reponame:Repositório Institucional da UFMG instname:Universidade Federal de Minas Gerais (UFMG) instacron:UFMG |
instname_str |
Universidade Federal de Minas Gerais (UFMG) |
instacron_str |
UFMG |
institution |
UFMG |
reponame_str |
Repositório Institucional da UFMG |
collection |
Repositório Institucional da UFMG |
bitstream.url.fl_str_mv |
https://repositorio.ufmg.br/bitstream/1843/39134/3/ExploitingSemanticSimilarityForImprovedTextRepresentation.pdf https://repositorio.ufmg.br/bitstream/1843/39134/4/license.txt |
bitstream.checksum.fl_str_mv |
4815da9566a2cb424642fc7463cda803 cda590c95a0b51b4d15f60c9642ca272 |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 |
repository.name.fl_str_mv |
Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG) |
repository.mail.fl_str_mv |
|
_version_ |
1803589340035547136 |