Representações distribuídas de texto aplicadas em análise de sentimento de mensagens curtas e ruidosas

Bossolani, Carlos Augusto

Representações distribuídas de texto aplicadas em análise de sentimento de mensagens curtas e ruidosas

Detalhes bibliográficos
Autor(a) principal:	Bossolani, Carlos Augusto
Data de Publicação:	2018
Tipo de documento:	Dissertação
Idioma:	por
Título da fonte:	Repositório Institucional da UFSCAR
Texto Completo:	https://repositorio.ufscar.br/handle/ufscar/10917
Resumo:	The evolution of the Internet and the Web has given rise to a vast amount of text messages containing opinions. Although the importance of sentiment analysis has grown proportionately, the use of the traditional bag of words as a way to represent these messages computationally imposes serious limitations: the number of dimensions in the samples may be very high; information about the relative position of the words in the text is lost; the relation of synonymy is not captured, and no distinction is made between the different meanings of ambiguous words. Short messages, such as those posted on social media and instant messaging applications, often contain a lot of slang, abbreviations, phonetic spelling and emoticons, which aggravates the problem of computational representation. Lexical normalization techniques and semantic indexing, traditionally used to deal with these problems, depend on dictionaries and their maintenance is impractical given the speed of language evolution. Distributed text representations, which represent each word by a low dimensional vector, have the potential to bypass some of these shortcomings by capturing the similarity relationship among words, storing information about the contexts of their occurrence. Recent techniques have made it possible to obtain these vectors from the weights of an artificial neural network, which are optimized to maximize the probability of the contexts in which the word is observed. Later optimizations made it possible to generate these models with a much larger corpus, thus raising interest in these techniques. This work investigated and proved the hypothesis that the use of distributed text models overcomes the problems and disadvantages of the use bag of words in sentiment analysis in short and noisy messages, making it possible to dispense with the need for traditional lexical normalization techniques and semantic indexing, maintaining predictive power and reducing computational effort.

Metadados do item

id	SCAR_11b11598236afb81a3b88cbec90a230c
oai_identifier_str	oai:repositorio.ufscar.br:ufscar/10917
network_acronym_str	SCAR
network_name_str	Repositório Institucional da UFSCAR
repository_id_str	4322
spelling	Bossolani, Carlos AugustoAlmeida, Tiago Agostinho dehttp://lattes.cnpq.br/5368680512020633http://lattes.cnpq.br/3008025733135785dbc38720-e166-4996-8ba6-e093e46d794e2019-02-06T18:36:32Z2019-02-06T18:36:32Z2018-12-14BOSSOLANI, Carlos Augusto. Representações distribuídas de texto aplicadas em análise de sentimento de mensagens curtas e ruidosas. 2018. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de São Carlos, Sorocaba, 2018. Disponível em: https://repositorio.ufscar.br/handle/ufscar/10917.https://repositorio.ufscar.br/handle/ufscar/10917The evolution of the Internet and the Web has given rise to a vast amount of text messages containing opinions. Although the importance of sentiment analysis has grown proportionately, the use of the traditional bag of words as a way to represent these messages computationally imposes serious limitations: the number of dimensions in the samples may be very high; information about the relative position of the words in the text is lost; the relation of synonymy is not captured, and no distinction is made between the different meanings of ambiguous words. Short messages, such as those posted on social media and instant messaging applications, often contain a lot of slang, abbreviations, phonetic spelling and emoticons, which aggravates the problem of computational representation. Lexical normalization techniques and semantic indexing, traditionally used to deal with these problems, depend on dictionaries and their maintenance is impractical given the speed of language evolution. Distributed text representations, which represent each word by a low dimensional vector, have the potential to bypass some of these shortcomings by capturing the similarity relationship among words, storing information about the contexts of their occurrence. Recent techniques have made it possible to obtain these vectors from the weights of an artificial neural network, which are optimized to maximize the probability of the contexts in which the word is observed. Later optimizations made it possible to generate these models with a much larger corpus, thus raising interest in these techniques. This work investigated and proved the hypothesis that the use of distributed text models overcomes the problems and disadvantages of the use bag of words in sentiment analysis in short and noisy messages, making it possible to dispense with the need for traditional lexical normalization techniques and semantic indexing, maintaining predictive power and reducing computational effort.A evolução da Internet e da Web proporcionou o surgimento de uma quantidade vasta de mensagens de texto contendo opiniões. Embora a importância da análise de sentimento tenha crescido proporcionalmente, o uso da tradicional bag of words como forma de representar computacionalmente essas mensagens impõe sérias limitações: a quantidade de dimensões das amostras pode ser muito alta; a informação sobre a posição relativa das palavras no texto é perdida; não é capturada a relação de sinonímia, e não é feita distinção dos diferentes sentidos de palavras ambíguas. Mensagens curtas, como as postadas nas redes sociais e aplicativos de mensagens instantâneas, costumam ser repletas de gírias, abreviaturas, ortografia fonética e emoticons, o que agrava o problema da representação computacional. Técnicas de normalização léxica e indexação semântica, tradicionalmente utilizadas para lidar com esses problemas, dependem de dicionários, a manutenção dos quais é inviável dada a velocidade de evolução da língua. Representações distribuídas de texto, que representam cada palavra por um vetor de baixa dimensionalidade, têm o potencial de contornar algumas dessas deficiências, por capturar as relações de similaridades entre as palavras, armazenando informações sobre os contextos da sua ocorrência. Técnicas recentes possibilitaram obter esses vetores a partir dos pesos de uma rede neural artificial, que são otimizados para maximizar a probabilidade dos contextos em que a palavra é observada. Otimizações posteriores possibilitaram gerar esses modelos com corpus bem maiores, fazendo ressurgir o interesse nessas técnicas. Este trabalho de pesquisa investigou e confirmou a hipótese de que o uso de modelos de representação distribuída de texto contornam os problemas e desvantagens do uso de bag of words em análise de sentimento em mensagens curtas e ruidosas, dispensando a necessidade de técnicas tradicionais de normalização léxica e indexação semântica, mantendo a qualidade preditiva e reduzindo o esforço computacional.Não recebi financiamentoporUniversidade Federal de São CarlosCâmpus SorocabaPrograma de Pós-Graduação em Ciência da Computação - PPGCC-SoUFSCarAnálise de sentimentoProcessamento de linguagem naturalAprendizado de máquinaSentiment analysisNatural language processingMachine learningCIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::SISTEMAS DE COMPUTACAORepresentações distribuídas de texto aplicadas em análise de sentimento de mensagens curtas e ruidosasDistributed text representations applied in sentiment analysis of short and noisy messagesinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisOnline6006005de967ad-743c-4f36-972b-79dd683c0e9dinfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da UFSCARinstname:Universidade Federal de São Carlos (UFSCAR)instacron:UFSCARORIGINALcarlos_dissertacao_homologacao.pdfcarlos_dissertacao_homologacao.pdfapplication/pdf1173603https://repositorio.ufscar.br/bitstream/ufscar/10917/1/carlos_dissertacao_homologacao.pdf5df5c3c0cb32944bf2dbf80a947ecfb8MD51Encaminhamento_Carlos_assinado.pdfEncaminhamento_Carlos_assinado.pdfapplication/pdf398454https://repositorio.ufscar.br/bitstream/ufscar/10917/3/Encaminhamento_Carlos_assinado.pdfe3c0ee73b7a75ae5fb79779f03e82711MD53LICENSElicense.txtlicense.txttext/plain; charset=utf-81957https://repositorio.ufscar.br/bitstream/ufscar/10917/4/license.txtae0398b6f8b235e40ad82cba6c50031dMD54TEXTcarlos_dissertacao_homologacao.pdf.txtcarlos_dissertacao_homologacao.pdf.txtExtracted texttext/plain187007https://repositorio.ufscar.br/bitstream/ufscar/10917/5/carlos_dissertacao_homologacao.pdf.txt6683824e6f0c4f201dd0c6eb907bf43cMD55Encaminhamento_Carlos_assinado.pdf.txtEncaminhamento_Carlos_assinado.pdf.txtExtracted texttext/plain1https://repositorio.ufscar.br/bitstream/ufscar/10917/6/Encaminhamento_Carlos_assinado.pdf.txt68b329da9893e34099c7d8ad5cb9c940MD56THUMBNAILcarlos_dissertacao_homologacao.pdf.jpgcarlos_dissertacao_homologacao.pdf.jpgIM Thumbnailimage/jpeg6001https://repositorio.ufscar.br/bitstream/ufscar/10917/7/carlos_dissertacao_homologacao.pdf.jpg3fcca89ee151f36649655e15b11e7dcaMD57Encaminhamento_Carlos_assinado.pdf.jpgEncaminhamento_Carlos_assinado.pdf.jpgIM Thumbnailimage/jpeg14117https://repositorio.ufscar.br/bitstream/ufscar/10917/8/Encaminhamento_Carlos_assinado.pdf.jpg323bea2d0f6411c5a3a3528244068c4bMD58ufscar/109172023-09-18 18:31:20.253oai:repositorio.ufscar.br:ufscar/10917TElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEKCkNvbSBhIGFwcmVzZW50YcOnw6NvIGRlc3RhIGxpY2Vuw6dhLCB2b2PDqiAobyBhdXRvciAoZXMpIG91IG8gdGl0dWxhciBkb3MgZGlyZWl0b3MgZGUgYXV0b3IpIGNvbmNlZGUgw6AgVW5pdmVyc2lkYWRlCkZlZGVyYWwgZGUgU8OjbyBDYXJsb3MgbyBkaXJlaXRvIG7Do28tZXhjbHVzaXZvIGRlIHJlcHJvZHV6aXIsICB0cmFkdXppciAoY29uZm9ybWUgZGVmaW5pZG8gYWJhaXhvKSwgZS9vdQpkaXN0cmlidWlyIGEgc3VhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyAoaW5jbHVpbmRvIG8gcmVzdW1vKSBwb3IgdG9kbyBvIG11bmRvIG5vIGZvcm1hdG8gaW1wcmVzc28gZSBlbGV0csO0bmljbyBlCmVtIHF1YWxxdWVyIG1laW8sIGluY2x1aW5kbyBvcyBmb3JtYXRvcyDDoXVkaW8gb3UgdsOtZGVvLgoKVm9jw6ogY29uY29yZGEgcXVlIGEgVUZTQ2FyIHBvZGUsIHNlbSBhbHRlcmFyIG8gY29udGXDumRvLCB0cmFuc3BvciBhIHN1YSB0ZXNlIG91IGRpc3NlcnRhw6fDo28KcGFyYSBxdWFscXVlciBtZWlvIG91IGZvcm1hdG8gcGFyYSBmaW5zIGRlIHByZXNlcnZhw6fDo28uCgpWb2PDqiB0YW1iw6ltIGNvbmNvcmRhIHF1ZSBhIFVGU0NhciBwb2RlIG1hbnRlciBtYWlzIGRlIHVtYSBjw7NwaWEgYSBzdWEgdGVzZSBvdQpkaXNzZXJ0YcOnw6NvIHBhcmEgZmlucyBkZSBzZWd1cmFuw6dhLCBiYWNrLXVwIGUgcHJlc2VydmHDp8Ojby4KClZvY8OqIGRlY2xhcmEgcXVlIGEgc3VhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyDDqSBvcmlnaW5hbCBlIHF1ZSB2b2PDqiB0ZW0gbyBwb2RlciBkZSBjb25jZWRlciBvcyBkaXJlaXRvcyBjb250aWRvcwpuZXN0YSBsaWNlbsOnYS4gVm9jw6ogdGFtYsOpbSBkZWNsYXJhIHF1ZSBvIGRlcMOzc2l0byBkYSBzdWEgdGVzZSBvdSBkaXNzZXJ0YcOnw6NvIG7Do28sIHF1ZSBzZWphIGRlIHNldQpjb25oZWNpbWVudG8sIGluZnJpbmdlIGRpcmVpdG9zIGF1dG9yYWlzIGRlIG5pbmd1w6ltLgoKQ2FzbyBhIHN1YSB0ZXNlIG91IGRpc3NlcnRhw6fDo28gY29udGVuaGEgbWF0ZXJpYWwgcXVlIHZvY8OqIG7Do28gcG9zc3VpIGEgdGl0dWxhcmlkYWRlIGRvcyBkaXJlaXRvcyBhdXRvcmFpcywgdm9jw6oKZGVjbGFyYSBxdWUgb2J0ZXZlIGEgcGVybWlzc8OjbyBpcnJlc3RyaXRhIGRvIGRldGVudG9yIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBwYXJhIGNvbmNlZGVyIMOgIFVGU0NhcgpvcyBkaXJlaXRvcyBhcHJlc2VudGFkb3MgbmVzdGEgbGljZW7Dp2EsIGUgcXVlIGVzc2UgbWF0ZXJpYWwgZGUgcHJvcHJpZWRhZGUgZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUKaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3Ugbm8gY29udGXDumRvIGRhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyBvcmEgZGVwb3NpdGFkYS4KCkNBU08gQSBURVNFIE9VIERJU1NFUlRBw4fDg08gT1JBIERFUE9TSVRBREEgVEVOSEEgU0lETyBSRVNVTFRBRE8gREUgVU0gUEFUUk9Dw41OSU8gT1UKQVBPSU8gREUgVU1BIEFHw4pOQ0lBIERFIEZPTUVOVE8gT1UgT1VUUk8gT1JHQU5JU01PIFFVRSBOw4NPIFNFSkEgQSBVRlNDYXIsClZPQ8OKIERFQ0xBUkEgUVVFIFJFU1BFSVRPVSBUT0RPUyBFIFFVQUlTUVVFUiBESVJFSVRPUyBERSBSRVZJU8ODTyBDT01PClRBTULDiU0gQVMgREVNQUlTIE9CUklHQcOHw5VFUyBFWElHSURBUyBQT1IgQ09OVFJBVE8gT1UgQUNPUkRPLgoKQSBVRlNDYXIgc2UgY29tcHJvbWV0ZSBhIGlkZW50aWZpY2FyIGNsYXJhbWVudGUgbyBzZXUgbm9tZSAocykgb3UgbyhzKSBub21lKHMpIGRvKHMpCmRldGVudG9yKGVzKSBkb3MgZGlyZWl0b3MgYXV0b3JhaXMgZGEgdGVzZSBvdSBkaXNzZXJ0YcOnw6NvLCBlIG7Do28gZmFyw6EgcXVhbHF1ZXIgYWx0ZXJhw6fDo28sIGFsw6ltIGRhcXVlbGFzCmNvbmNlZGlkYXMgcG9yIGVzdGEgbGljZW7Dp2EuCg==Repositório InstitucionalPUBhttps://repositorio.ufscar.br/oai/requestopendoar:43222023-09-18T18:31:20Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)false
dc.title.por.fl_str_mv	Representações distribuídas de texto aplicadas em análise de sentimento de mensagens curtas e ruidosas
dc.title.alternative.eng.fl_str_mv	Distributed text representations applied in sentiment analysis of short and noisy messages
title	Representações distribuídas de texto aplicadas em análise de sentimento de mensagens curtas e ruidosas
spellingShingle	Representações distribuídas de texto aplicadas em análise de sentimento de mensagens curtas e ruidosas Bossolani, Carlos Augusto Análise de sentimento Processamento de linguagem natural Aprendizado de máquina Sentiment analysis Natural language processing Machine learning CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::SISTEMAS DE COMPUTACAO
title_short	Representações distribuídas de texto aplicadas em análise de sentimento de mensagens curtas e ruidosas
title_full	Representações distribuídas de texto aplicadas em análise de sentimento de mensagens curtas e ruidosas
title_fullStr	Representações distribuídas de texto aplicadas em análise de sentimento de mensagens curtas e ruidosas
title_full_unstemmed	Representações distribuídas de texto aplicadas em análise de sentimento de mensagens curtas e ruidosas
title_sort	Representações distribuídas de texto aplicadas em análise de sentimento de mensagens curtas e ruidosas
author	Bossolani, Carlos Augusto
author_facet	Bossolani, Carlos Augusto
author_role	author
dc.contributor.authorlattes.por.fl_str_mv	http://lattes.cnpq.br/3008025733135785
dc.contributor.author.fl_str_mv	Bossolani, Carlos Augusto
dc.contributor.advisor1.fl_str_mv	Almeida, Tiago Agostinho de
dc.contributor.advisor1Lattes.fl_str_mv	http://lattes.cnpq.br/5368680512020633
dc.contributor.authorID.fl_str_mv	dbc38720-e166-4996-8ba6-e093e46d794e
contributor_str_mv	Almeida, Tiago Agostinho de
dc.subject.por.fl_str_mv	Análise de sentimento Processamento de linguagem natural Aprendizado de máquina
topic	Análise de sentimento Processamento de linguagem natural Aprendizado de máquina Sentiment analysis Natural language processing Machine learning CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::SISTEMAS DE COMPUTACAO
dc.subject.eng.fl_str_mv	Sentiment analysis Natural language processing Machine learning
dc.subject.cnpq.fl_str_mv	CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::SISTEMAS DE COMPUTACAO
description	The evolution of the Internet and the Web has given rise to a vast amount of text messages containing opinions. Although the importance of sentiment analysis has grown proportionately, the use of the traditional bag of words as a way to represent these messages computationally imposes serious limitations: the number of dimensions in the samples may be very high; information about the relative position of the words in the text is lost; the relation of synonymy is not captured, and no distinction is made between the different meanings of ambiguous words. Short messages, such as those posted on social media and instant messaging applications, often contain a lot of slang, abbreviations, phonetic spelling and emoticons, which aggravates the problem of computational representation. Lexical normalization techniques and semantic indexing, traditionally used to deal with these problems, depend on dictionaries and their maintenance is impractical given the speed of language evolution. Distributed text representations, which represent each word by a low dimensional vector, have the potential to bypass some of these shortcomings by capturing the similarity relationship among words, storing information about the contexts of their occurrence. Recent techniques have made it possible to obtain these vectors from the weights of an artificial neural network, which are optimized to maximize the probability of the contexts in which the word is observed. Later optimizations made it possible to generate these models with a much larger corpus, thus raising interest in these techniques. This work investigated and proved the hypothesis that the use of distributed text models overcomes the problems and disadvantages of the use bag of words in sentiment analysis in short and noisy messages, making it possible to dispense with the need for traditional lexical normalization techniques and semantic indexing, maintaining predictive power and reducing computational effort.
publishDate	2018
dc.date.issued.fl_str_mv	2018-12-14
dc.date.accessioned.fl_str_mv	2019-02-06T18:36:32Z
dc.date.available.fl_str_mv	2019-02-06T18:36:32Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.citation.fl_str_mv	BOSSOLANI, Carlos Augusto. Representações distribuídas de texto aplicadas em análise de sentimento de mensagens curtas e ruidosas. 2018. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de São Carlos, Sorocaba, 2018. Disponível em: https://repositorio.ufscar.br/handle/ufscar/10917.
dc.identifier.uri.fl_str_mv	https://repositorio.ufscar.br/handle/ufscar/10917
identifier_str_mv	BOSSOLANI, Carlos Augusto. Representações distribuídas de texto aplicadas em análise de sentimento de mensagens curtas e ruidosas. 2018. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de São Carlos, Sorocaba, 2018. Disponível em: https://repositorio.ufscar.br/handle/ufscar/10917.
url	https://repositorio.ufscar.br/handle/ufscar/10917
dc.language.iso.fl_str_mv	por
language	por
dc.relation.confidence.fl_str_mv	600 600
dc.relation.authority.fl_str_mv	5de967ad-743c-4f36-972b-79dd683c0e9d
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.publisher.none.fl_str_mv	Universidade Federal de São Carlos Câmpus Sorocaba
dc.publisher.program.fl_str_mv	Programa de Pós-Graduação em Ciência da Computação - PPGCC-So
dc.publisher.initials.fl_str_mv	UFSCar
publisher.none.fl_str_mv	Universidade Federal de São Carlos Câmpus Sorocaba
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFSCAR instname:Universidade Federal de São Carlos (UFSCAR) instacron:UFSCAR
instname_str	Universidade Federal de São Carlos (UFSCAR)
instacron_str	UFSCAR
institution	UFSCAR
reponame_str	Repositório Institucional da UFSCAR
collection	Repositório Institucional da UFSCAR
bitstream.url.fl_str_mv	https://repositorio.ufscar.br/bitstream/ufscar/10917/1/carlos_dissertacao_homologacao.pdf https://repositorio.ufscar.br/bitstream/ufscar/10917/3/Encaminhamento_Carlos_assinado.pdf https://repositorio.ufscar.br/bitstream/ufscar/10917/4/license.txt https://repositorio.ufscar.br/bitstream/ufscar/10917/5/carlos_dissertacao_homologacao.pdf.txt https://repositorio.ufscar.br/bitstream/ufscar/10917/6/Encaminhamento_Carlos_assinado.pdf.txt https://repositorio.ufscar.br/bitstream/ufscar/10917/7/carlos_dissertacao_homologacao.pdf.jpg https://repositorio.ufscar.br/bitstream/ufscar/10917/8/Encaminhamento_Carlos_assinado.pdf.jpg
bitstream.checksum.fl_str_mv	5df5c3c0cb32944bf2dbf80a947ecfb8 e3c0ee73b7a75ae5fb79779f03e82711 ae0398b6f8b235e40ad82cba6c50031d 6683824e6f0c4f201dd0c6eb907bf43c 68b329da9893e34099c7d8ad5cb9c940 3fcca89ee151f36649655e15b11e7dca 323bea2d0f6411c5a3a3528244068c4b
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5 MD5 MD5 MD5 MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)
repository.mail.fl_str_mv
_version_	1802136352443072512

Representações distribuídas de texto aplicadas em análise de sentimento de mensagens curtas e ruidosas

Registros relacionados