Vector representation of texts applied to prediction models

Detalhes bibliográficos
Autor(a) principal: Stern, Deborah Bassi
Data de Publicação: 2020
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Biblioteca Digital de Teses e Dissertações da USP
Texto Completo: https://www.teses.usp.br/teses/disponiveis/104/104131/tde-10062020-102333/
Resumo: Natural Language Processing has gone through substantial changes over time. It was only recently that statistical approaches started receiving attention. The Word2Vec model is one of these. It is a shallow neural network designed to fit vectorial representations of words according to their syntactic and semantic values. The word embeddings acquired by this method are stateof- art. This method has many uses, one of which is the fitting of prediction models based on texts. It is common in the literature for a text to be represented as the mean of its word embeddings. The resulting vector is then used in the predictive model as an explanatory variables. In this dissertation, we propose getting more information of text by adding other summary statistics besides the mean, such as other moments and quantiles. The improvement of the prediction models is studied in real datasets.
id USP_c550f9efa5cfb5fa4884a784ab4ea64a
oai_identifier_str oai:teses.usp.br:tde-10062020-102333
network_acronym_str USP
network_name_str Biblioteca Digital de Teses e Dissertações da USP
repository_id_str 2721
spelling Vector representation of texts applied to prediction modelsRepresentações vetoriais de textos aplicados a modelos preditivosModelos de prediçãoNatural language processingNeural networksPrediction modelsProcessamento de linguagem naturalRedes neuraisRepresentação vetorial de palavrasWordVectorsNatural Language Processing has gone through substantial changes over time. It was only recently that statistical approaches started receiving attention. The Word2Vec model is one of these. It is a shallow neural network designed to fit vectorial representations of words according to their syntactic and semantic values. The word embeddings acquired by this method are stateof- art. This method has many uses, one of which is the fitting of prediction models based on texts. It is common in the literature for a text to be represented as the mean of its word embeddings. The resulting vector is then used in the predictive model as an explanatory variables. In this dissertation, we propose getting more information of text by adding other summary statistics besides the mean, such as other moments and quantiles. The improvement of the prediction models is studied in real datasets.Processamento de linguagem natural sofreu uma grande mudança com o tempo. Abordagens estatísticas passaram a ganhar atenção apenas recentemente. O modelo word2vec é uma destas. Ele é uma rede neural rasa desenhada para ajustar representações vetoriais de palavras segundo seus valores semânticos e sintáticos. As representações de palavras obtidas por este método são o estado da arte. Este método tem muitas aplicações, como permitir o ajuste de modelos preditivos baseadas em textos. Na literatura é comum um texto ser representado pela média das representações vetorias das palavras que o compõem. O vetor resultante é então incluído como variável explicativa no modelo. Nesta dissertação propomos a obtenção de mais informação sobre o texto através de outras estatísticas descritivas além da média, como outros momentos e quantis. A melhora dos modelos preditivos é estudada com dados reais.Biblioteca Digitais de Teses e Dissertações da USPIzbicki, RafaelStern, Deborah Bassi2020-03-09info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/104/104131/tde-10062020-102333/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2020-06-10T16:31:03Zoai:teses.usp.br:tde-10062020-102333Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212020-06-10T16:31:03Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv Vector representation of texts applied to prediction models
Representações vetoriais de textos aplicados a modelos preditivos
title Vector representation of texts applied to prediction models
spellingShingle Vector representation of texts applied to prediction models
Stern, Deborah Bassi
Modelos de predição
Natural language processing
Neural networks
Prediction models
Processamento de linguagem natural
Redes neurais
Representação vetorial de palavras
WordVectors
title_short Vector representation of texts applied to prediction models
title_full Vector representation of texts applied to prediction models
title_fullStr Vector representation of texts applied to prediction models
title_full_unstemmed Vector representation of texts applied to prediction models
title_sort Vector representation of texts applied to prediction models
author Stern, Deborah Bassi
author_facet Stern, Deborah Bassi
author_role author
dc.contributor.none.fl_str_mv Izbicki, Rafael
dc.contributor.author.fl_str_mv Stern, Deborah Bassi
dc.subject.por.fl_str_mv Modelos de predição
Natural language processing
Neural networks
Prediction models
Processamento de linguagem natural
Redes neurais
Representação vetorial de palavras
WordVectors
topic Modelos de predição
Natural language processing
Neural networks
Prediction models
Processamento de linguagem natural
Redes neurais
Representação vetorial de palavras
WordVectors
description Natural Language Processing has gone through substantial changes over time. It was only recently that statistical approaches started receiving attention. The Word2Vec model is one of these. It is a shallow neural network designed to fit vectorial representations of words according to their syntactic and semantic values. The word embeddings acquired by this method are stateof- art. This method has many uses, one of which is the fitting of prediction models based on texts. It is common in the literature for a text to be represented as the mean of its word embeddings. The resulting vector is then used in the predictive model as an explanatory variables. In this dissertation, we propose getting more information of text by adding other summary statistics besides the mean, such as other moments and quantiles. The improvement of the prediction models is studied in real datasets.
publishDate 2020
dc.date.none.fl_str_mv 2020-03-09
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://www.teses.usp.br/teses/disponiveis/104/104131/tde-10062020-102333/
url https://www.teses.usp.br/teses/disponiveis/104/104131/tde-10062020-102333/
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv Liberar o conteúdo para acesso público.
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Liberar o conteúdo para acesso público.
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv
reponame:Biblioteca Digital de Teses e Dissertações da USP
instname:Universidade de São Paulo (USP)
instacron:USP
instname_str Universidade de São Paulo (USP)
instacron_str USP
institution USP
reponame_str Biblioteca Digital de Teses e Dissertações da USP
collection Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br
_version_ 1815256913708843008