Identification and analysis of health states in twitter messages

Detalhes bibliográficos
Autor(a) principal: Morais, Edgar Guilherme Silva
Data de Publicação: 2022
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10773/36519
Resumo: Social media has become very widely used all over the world for its ability to connect people from different countries and create global communities. One of the most prominent social media platforms is Twitter. Twitter is a platform where users can share text segments with a maximum length of 280 characters. Due to the nature of the platform, it generates very large amounts of text data about its users’ lives. This data can be used to extract health information about a segment of the population for the purpose of public health surveillance. Social Media Mining for Health Shared Task is a challenge that encompasses many Natural Language Processing tasks related to the use of social media data for health research purposes. This dissertation describes the approach I used in my participation in the Social Media Mining for Health Shared Task. I participated in task 1 of the Shared Task. This task was divided into three subtasks. Subtask 1a consisted of the classification of Tweets regarding the presence of Adverse Drug Events. Subtask 1b was a Named Entity Recognition task that aimed at detecting Adverse Drug Effect spans in tweets. Subtask 1c was a normalization task that sought to match an Adverse Drug Event mention to a Medical Dictionary for Regulatory Activities preferred term ID. Toward discovering the best approach for each of the subtasks I made many experiments with different models and techniques to distinguish the ones that were more suited for each subtask. To solve these subtasks, I used transformer-based models as well as other techniques that aim at solving the challenges present in each of the subtasks. The best-performing approach for subtask 1a was a BERTweet large model trained with an augmented training set. As for subtask 1b, the best results were obtained through a RoBERTa large model with oversampled training data. Regarding subtask 1c, I used a RoBERTa base model trained with data from an additional dataset beyond the one made available by the shared task organizers. The systems used for subtasks 1a and 1b both achieved state-of-the-art performance, however, the approach for the third subtask was not able to achieve favorable results. The system used in subtask 1a achieved an F1 score of 0.698, the one used in subtask 1b achieved a relaxed F1 score of 0.661, and the one used in the final subtask achieved a relaxed F1 score of 0.116.
id RCAP_54b3060160977dbf883798168f282939
oai_identifier_str oai:ria.ua.pt:10773/36519
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Identification and analysis of health states in twitter messagesSocial mediaHealth informationNatural language processingMachine learningSocial media has become very widely used all over the world for its ability to connect people from different countries and create global communities. One of the most prominent social media platforms is Twitter. Twitter is a platform where users can share text segments with a maximum length of 280 characters. Due to the nature of the platform, it generates very large amounts of text data about its users’ lives. This data can be used to extract health information about a segment of the population for the purpose of public health surveillance. Social Media Mining for Health Shared Task is a challenge that encompasses many Natural Language Processing tasks related to the use of social media data for health research purposes. This dissertation describes the approach I used in my participation in the Social Media Mining for Health Shared Task. I participated in task 1 of the Shared Task. This task was divided into three subtasks. Subtask 1a consisted of the classification of Tweets regarding the presence of Adverse Drug Events. Subtask 1b was a Named Entity Recognition task that aimed at detecting Adverse Drug Effect spans in tweets. Subtask 1c was a normalization task that sought to match an Adverse Drug Event mention to a Medical Dictionary for Regulatory Activities preferred term ID. Toward discovering the best approach for each of the subtasks I made many experiments with different models and techniques to distinguish the ones that were more suited for each subtask. To solve these subtasks, I used transformer-based models as well as other techniques that aim at solving the challenges present in each of the subtasks. The best-performing approach for subtask 1a was a BERTweet large model trained with an augmented training set. As for subtask 1b, the best results were obtained through a RoBERTa large model with oversampled training data. Regarding subtask 1c, I used a RoBERTa base model trained with data from an additional dataset beyond the one made available by the shared task organizers. The systems used for subtasks 1a and 1b both achieved state-of-the-art performance, however, the approach for the third subtask was not able to achieve favorable results. The system used in subtask 1a achieved an F1 score of 0.698, the one used in subtask 1b achieved a relaxed F1 score of 0.661, and the one used in the final subtask achieved a relaxed F1 score of 0.116.As redes sociais tornaram-se muito utilizadas por todo o mundo, permitindo ligar pessoas de diferentes países e criar comunidades globais. O Twitter, uma das redes sociais mais populares, permite que os seus utilizadores partilhem segmentos curtos de texto com um máximo de 280 caracteres. Esta partilha na rede gera uma enorme quantidade de dados sobre os seus utilizadores, podendo ser analisados sobre múltiplas perspetivas. Por exemplo, podem ser utilizados para extrair informação sobre a saúde de um segmento da população tendo em vista a vigilância de saúde pública. O objetivo deste trabalho foi a investigação e o desenvolvimento de soluções técnicas para participar no “Social Media Mining for Health Shared Task” (#SMM4H), um desafio constituído por diversas tarefas de processamento de linguagem natural relacionadas com o uso de dados provenientes de redes sociais para o propósito de investigação na área da saúde. O trabalho envolveu o desenvolvimento de modelos baseados em transformadores e outras técnicas relacionadas, para participação na tarefa 1 deste desafio, que por sua vez está dividida em 3 subtarefas: 1a) classificação de tweets relativamente à presença ou não de eventos adversos de medicamentos (ADE); 1b) reconhecimento de entidades com o objetivo de detetar menções de ADE; 1c) tarefa de normalização com o objetivo de associar as menções de ADE ao termo MedDRA correspondente (“Medical Dictionary for Regulatory Activities”). A abordagem com melhor desempenho na tarefa 1a foi um modelo BERTweet large treinado com dados gerados através de um processo de data augmentation. Relativamente à tarefa 1b, os melhores resultados foram obtidos usando um modelo RoBERTa large com dados de treino sobreamostrados. Na tarefa 1c utilizou-se um modelo RoBERTa base treinado com dados adicionais provenientes de um conjunto de dados externo. A abordagem utilizada na terceira tarefa não conseguiu alcançar resultados relevantes (F1 de 0.12), enquanto que os sistemas desenvolvidos para as duas primeiras alcançaram resultados ao nível dos melhores do desafio (F1 de 0.69 e 0.66 respetivamente).2023-03-09T10:12:48Z2022-11-21T00:00:00Z2022-11-21info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10773/36519engMorais, Edgar Guilherme Silvainfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-02-22T12:10:24Zoai:ria.ua.pt:10773/36519Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:07:18.328671Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Identification and analysis of health states in twitter messages
title Identification and analysis of health states in twitter messages
spellingShingle Identification and analysis of health states in twitter messages
Morais, Edgar Guilherme Silva
Social media
Health information
Natural language processing
Machine learning
title_short Identification and analysis of health states in twitter messages
title_full Identification and analysis of health states in twitter messages
title_fullStr Identification and analysis of health states in twitter messages
title_full_unstemmed Identification and analysis of health states in twitter messages
title_sort Identification and analysis of health states in twitter messages
author Morais, Edgar Guilherme Silva
author_facet Morais, Edgar Guilherme Silva
author_role author
dc.contributor.author.fl_str_mv Morais, Edgar Guilherme Silva
dc.subject.por.fl_str_mv Social media
Health information
Natural language processing
Machine learning
topic Social media
Health information
Natural language processing
Machine learning
description Social media has become very widely used all over the world for its ability to connect people from different countries and create global communities. One of the most prominent social media platforms is Twitter. Twitter is a platform where users can share text segments with a maximum length of 280 characters. Due to the nature of the platform, it generates very large amounts of text data about its users’ lives. This data can be used to extract health information about a segment of the population for the purpose of public health surveillance. Social Media Mining for Health Shared Task is a challenge that encompasses many Natural Language Processing tasks related to the use of social media data for health research purposes. This dissertation describes the approach I used in my participation in the Social Media Mining for Health Shared Task. I participated in task 1 of the Shared Task. This task was divided into three subtasks. Subtask 1a consisted of the classification of Tweets regarding the presence of Adverse Drug Events. Subtask 1b was a Named Entity Recognition task that aimed at detecting Adverse Drug Effect spans in tweets. Subtask 1c was a normalization task that sought to match an Adverse Drug Event mention to a Medical Dictionary for Regulatory Activities preferred term ID. Toward discovering the best approach for each of the subtasks I made many experiments with different models and techniques to distinguish the ones that were more suited for each subtask. To solve these subtasks, I used transformer-based models as well as other techniques that aim at solving the challenges present in each of the subtasks. The best-performing approach for subtask 1a was a BERTweet large model trained with an augmented training set. As for subtask 1b, the best results were obtained through a RoBERTa large model with oversampled training data. Regarding subtask 1c, I used a RoBERTa base model trained with data from an additional dataset beyond the one made available by the shared task organizers. The systems used for subtasks 1a and 1b both achieved state-of-the-art performance, however, the approach for the third subtask was not able to achieve favorable results. The system used in subtask 1a achieved an F1 score of 0.698, the one used in subtask 1b achieved a relaxed F1 score of 0.661, and the one used in the final subtask achieved a relaxed F1 score of 0.116.
publishDate 2022
dc.date.none.fl_str_mv 2022-11-21T00:00:00Z
2022-11-21
2023-03-09T10:12:48Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10773/36519
url http://hdl.handle.net/10773/36519
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799137728611221504