Mineração de Textos usando Word Embeddings com Contexto Geográfico

Detalhes bibliográficos
Autor(a) principal: Antônio Ronaldo da Silva
Data de Publicação: 2022
Tipo de documento: Dissertação
Idioma: por
Título da fonte: Repositório Institucional da UFMS
Texto Completo: https://repositorio.ufms.br/handle/123456789/5420
Resumo: Many essential phenomena are related to a geographic context, such as events extracted from textual bases in economics, public health, and urban violence. Manually analyzing events would be impractical, considering their significant volume and different data sources. Thus, there was a need for intelligent computational methods such as Text Mining that enable the exploration of textual content with geographic information and return patterns that traditional models would not find. The traditional model for analyzing the relationship between terms and regions is to calculate the probability of a term being used in texts associated with a region, in general, through the frequency of terms in regions. However, it is recognized that this approach fails for new terms presented to a model and texts with ambiguous terms. In this context, models based on Word Embeddings are recognized for improving the identification of the relationships between a word and the possible associated location. In this sense, this project investigates textual representations based on Word Embeddings from BERT models (Bidirectional Encoder Representations from Transformers) in a fine-tuning process, in which the georeferenced information of the texts is used as context. We named this proposal the GeoTransformers Language Model. One of the differentials of this proposal is to automatically identify macro-regions and micro-regions from the events and use them as a context for fine-tuning a language model. Compared to other models in the literature, the results generated by the GeoTransformers model obtained higher values for precision metrics, recall, F1-Score. Moreover, our model was the only one capable of dealing with regions with fewer events.
id UFMS_7f62fa633d48a1f447c4fd80e53db656
oai_identifier_str oai:repositorio.ufms.br:123456789/5420
network_acronym_str UFMS
network_name_str Repositório Institucional da UFMS
repository_id_str 2124
spelling 2022-12-02T21:09:49Z2022-12-02T21:09:49Z2022https://repositorio.ufms.br/handle/123456789/5420Many essential phenomena are related to a geographic context, such as events extracted from textual bases in economics, public health, and urban violence. Manually analyzing events would be impractical, considering their significant volume and different data sources. Thus, there was a need for intelligent computational methods such as Text Mining that enable the exploration of textual content with geographic information and return patterns that traditional models would not find. The traditional model for analyzing the relationship between terms and regions is to calculate the probability of a term being used in texts associated with a region, in general, through the frequency of terms in regions. However, it is recognized that this approach fails for new terms presented to a model and texts with ambiguous terms. In this context, models based on Word Embeddings are recognized for improving the identification of the relationships between a word and the possible associated location. In this sense, this project investigates textual representations based on Word Embeddings from BERT models (Bidirectional Encoder Representations from Transformers) in a fine-tuning process, in which the georeferenced information of the texts is used as context. We named this proposal the GeoTransformers Language Model. One of the differentials of this proposal is to automatically identify macro-regions and micro-regions from the events and use them as a context for fine-tuning a language model. Compared to other models in the literature, the results generated by the GeoTransformers model obtained higher values for precision metrics, recall, F1-Score. Moreover, our model was the only one capable of dealing with regions with fewer events.Muitos fenômenos importantes estão relacionados a um contexto geográfico, como eventos extraídos de bases textuais na área da economia, saúde pública, violência urbana e questões sociais. A análise de eventos de maneira manual seria impraticável considerando a sua grande quantidade e as diversas formas nas quais os dados são encontrados. Assim, passou-se a ter a necessidade de processos baseados em métodos computacionais inteligentes como a Mineração de Textos que, por meio das suas etapas, torna capaz a exploração do conteúdo textual com informação geográfica e retorna padrões que não seriam encontrados por modelos tradicionais. O modelo tradicional para analisar a relação entre termos e regiões é o de calcular a probabilidade de um termo ser utilizado em textos associados a uma região, em geral, por meio da frequência de termos em regiões. No entanto, é reconhecido que essa abordagem falha para novos termos apresentados a um modelo, bem como para textos com termos ambíguos. Nesse contexto, modelos baseados em Word Embeddings são reconhecidos por melhorar a identificação das relações entre uma palavra e o possível local associado. Nesse sentido, neste projeto são investigadas representações textuais baseadas em Word Embeddings do modelo BERT (Bidirectional Encoder Representations from Transformers) em um processo de ajuste fino, na qual as informações georreferenciadas dos textos são utilizadas como contexto, culminando na proposta deste trabalho denominada GeoTransformers Language Model. Um dos diferenciais da proposta é automaticamente identificar macrorregiões e microrregiões a partir dos eventos e utilizá-las como contexto para ajuste fino de um modelo de linguagem. Os resultados gerados pelo modelo GeoTransformers, em comparação com outros modelos da literatura, apresentaram maiores valores para métricas de precisão, revocação, F1-Score. Além disso, o modelo proposto foi o único capaz de lidar com regiões com menor quantidade de eventos e difíceis de classificar.Fundação Universidade Federal de Mato Grosso do SulUFMSBrasilAnálise de EventosWord EmbeddingsTextos GeorreferenciadosMineração de TextosMineração de Textos usando Word Embeddings com Contexto Geográficoinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisRicardo Marcondes MarcaciniAntônio Ronaldo da Silvainfo:eu-repo/semantics/openAccessporreponame:Repositório Institucional da UFMSinstname:Universidade Federal de Mato Grosso do Sul (UFMS)instacron:UFMSORIGINALdissertacao_antonio_ronaldo_da_silva.pdfdissertacao_antonio_ronaldo_da_silva.pdfapplication/pdf4303604https://repositorio.ufms.br/bitstream/123456789/5420/-1/dissertacao_antonio_ronaldo_da_silva.pdf7c3adf23299b83a5ad93c68bfd02d67fMD5-1123456789/54202022-12-02 17:09:49.807oai:repositorio.ufms.br:123456789/5420Repositório InstitucionalPUBhttps://repositorio.ufms.br/oai/requestri.prograd@ufms.bropendoar:21242022-12-02T21:09:49Repositório Institucional da UFMS - Universidade Federal de Mato Grosso do Sul (UFMS)false
dc.title.pt_BR.fl_str_mv Mineração de Textos usando Word Embeddings com Contexto Geográfico
title Mineração de Textos usando Word Embeddings com Contexto Geográfico
spellingShingle Mineração de Textos usando Word Embeddings com Contexto Geográfico
Antônio Ronaldo da Silva
Análise de Eventos
Word Embeddings
Textos Georreferenciados
Mineração de Textos
title_short Mineração de Textos usando Word Embeddings com Contexto Geográfico
title_full Mineração de Textos usando Word Embeddings com Contexto Geográfico
title_fullStr Mineração de Textos usando Word Embeddings com Contexto Geográfico
title_full_unstemmed Mineração de Textos usando Word Embeddings com Contexto Geográfico
title_sort Mineração de Textos usando Word Embeddings com Contexto Geográfico
author Antônio Ronaldo da Silva
author_facet Antônio Ronaldo da Silva
author_role author
dc.contributor.advisor1.fl_str_mv Ricardo Marcondes Marcacini
dc.contributor.author.fl_str_mv Antônio Ronaldo da Silva
contributor_str_mv Ricardo Marcondes Marcacini
dc.subject.por.fl_str_mv Análise de Eventos
Word Embeddings
Textos Georreferenciados
Mineração de Textos
topic Análise de Eventos
Word Embeddings
Textos Georreferenciados
Mineração de Textos
description Many essential phenomena are related to a geographic context, such as events extracted from textual bases in economics, public health, and urban violence. Manually analyzing events would be impractical, considering their significant volume and different data sources. Thus, there was a need for intelligent computational methods such as Text Mining that enable the exploration of textual content with geographic information and return patterns that traditional models would not find. The traditional model for analyzing the relationship between terms and regions is to calculate the probability of a term being used in texts associated with a region, in general, through the frequency of terms in regions. However, it is recognized that this approach fails for new terms presented to a model and texts with ambiguous terms. In this context, models based on Word Embeddings are recognized for improving the identification of the relationships between a word and the possible associated location. In this sense, this project investigates textual representations based on Word Embeddings from BERT models (Bidirectional Encoder Representations from Transformers) in a fine-tuning process, in which the georeferenced information of the texts is used as context. We named this proposal the GeoTransformers Language Model. One of the differentials of this proposal is to automatically identify macro-regions and micro-regions from the events and use them as a context for fine-tuning a language model. Compared to other models in the literature, the results generated by the GeoTransformers model obtained higher values for precision metrics, recall, F1-Score. Moreover, our model was the only one capable of dealing with regions with fewer events.
publishDate 2022
dc.date.accessioned.fl_str_mv 2022-12-02T21:09:49Z
dc.date.available.fl_str_mv 2022-12-02T21:09:49Z
dc.date.issued.fl_str_mv 2022
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://repositorio.ufms.br/handle/123456789/5420
url https://repositorio.ufms.br/handle/123456789/5420
dc.language.iso.fl_str_mv por
language por
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv Fundação Universidade Federal de Mato Grosso do Sul
dc.publisher.initials.fl_str_mv UFMS
dc.publisher.country.fl_str_mv Brasil
publisher.none.fl_str_mv Fundação Universidade Federal de Mato Grosso do Sul
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFMS
instname:Universidade Federal de Mato Grosso do Sul (UFMS)
instacron:UFMS
instname_str Universidade Federal de Mato Grosso do Sul (UFMS)
instacron_str UFMS
institution UFMS
reponame_str Repositório Institucional da UFMS
collection Repositório Institucional da UFMS
bitstream.url.fl_str_mv https://repositorio.ufms.br/bitstream/123456789/5420/-1/dissertacao_antonio_ronaldo_da_silva.pdf
bitstream.checksum.fl_str_mv 7c3adf23299b83a5ad93c68bfd02d67f
bitstream.checksumAlgorithm.fl_str_mv MD5
repository.name.fl_str_mv Repositório Institucional da UFMS - Universidade Federal de Mato Grosso do Sul (UFMS)
repository.mail.fl_str_mv ri.prograd@ufms.br
_version_ 1815448018164383744