Using named entity recognition for relevance detection in social network messages
Autor(a) principal: | |
---|---|
Data de Publicação: | 2017 |
Tipo de documento: | Dissertação |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | https://repositorio-aberto.up.pt/handle/10216/106160 |
Resumo: | The continuous growth of social networks in the past decade has led to massive amounts of information being generated on a daily-basis. While a lot of this information is merely personal or simply irrelevant to a general audience, relevant news being transmitted through social networks is an increasingly common phenomenon, and therefore detecting such news automatically has become a field of interest and active research.The contribution of the present thesis consisted in studying the importance of named entities in the task of relevance detection. With that in mind, the goal of this work was twofold: 1) to implement or find the best named entity recognition tools for social media texts, and 2) to analyze the importance of extracted entities from posts as features for relevance detection with machine learning. There are already well-known named entity recognition tools, however, most state-of-the-art tools for named entity recognition show significant decrease of performance when tested on social media texts, in comparison to news media texts. This is mainly due to the informal character of social media texts: the absence of context, the lack of proper punctuation, wrong capitalization, the use of characters to represent emoticons, spelling errors and even the use of different languages in the same text. To address these problems, four different state-of-the-art toolkits - Stanford NLP, GATE with TwitIE, Twitter NLP tools and OpenNLP - were tested on social media datasets. In addition, we tried to understand how differently these toolkits predicted Named Entities, in terms of their precision and recall for three different entity types (Person, Location, Organization), and how they could complement each other in this task in order to achieve a combined performance superior to each individual one, creating an ensemble of toolkits.Following the extraction of entities using the developed Ensemble, different features were generated based on these entities. These features included the number of persons, locations and organizations mentioned in a post, statistics retrieved from The Guardian's open API, and were also combined with word embeddings features. Multiple machine learning models were then trained on a manually annotated datasets of tweets. The obtained performances of different combinations of selected features, ML algorithms, hyperparameters, and datasets, were analyzed. Our results showed that using an ensemble of toolkits can improve the recognition of specific entity types, depending on the criteria used for the voting, and even the overall performance average of the entity types Person, Location, and Organization. The relevance analysis showed that Named Entities can indeed be useful for relevance detection, proving to be useful not only when used alone, achieving up to 74% of AUC, but also helpful when combined with other features such as word embeddings, achieving a maximum AUC of 94%, a 2.6% improve over word embeddings alone. |
id |
RCAP_9acd92ad0c7bc01394d87ad43e6dc024 |
---|---|
oai_identifier_str |
oai:repositorio-aberto.up.pt:10216/106160 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Using named entity recognition for relevance detection in social network messagesEngenharia electrotécnica, electrónica e informáticaElectrical engineering, Electronic engineering, Information engineeringThe continuous growth of social networks in the past decade has led to massive amounts of information being generated on a daily-basis. While a lot of this information is merely personal or simply irrelevant to a general audience, relevant news being transmitted through social networks is an increasingly common phenomenon, and therefore detecting such news automatically has become a field of interest and active research.The contribution of the present thesis consisted in studying the importance of named entities in the task of relevance detection. With that in mind, the goal of this work was twofold: 1) to implement or find the best named entity recognition tools for social media texts, and 2) to analyze the importance of extracted entities from posts as features for relevance detection with machine learning. There are already well-known named entity recognition tools, however, most state-of-the-art tools for named entity recognition show significant decrease of performance when tested on social media texts, in comparison to news media texts. This is mainly due to the informal character of social media texts: the absence of context, the lack of proper punctuation, wrong capitalization, the use of characters to represent emoticons, spelling errors and even the use of different languages in the same text. To address these problems, four different state-of-the-art toolkits - Stanford NLP, GATE with TwitIE, Twitter NLP tools and OpenNLP - were tested on social media datasets. In addition, we tried to understand how differently these toolkits predicted Named Entities, in terms of their precision and recall for three different entity types (Person, Location, Organization), and how they could complement each other in this task in order to achieve a combined performance superior to each individual one, creating an ensemble of toolkits.Following the extraction of entities using the developed Ensemble, different features were generated based on these entities. These features included the number of persons, locations and organizations mentioned in a post, statistics retrieved from The Guardian's open API, and were also combined with word embeddings features. Multiple machine learning models were then trained on a manually annotated datasets of tweets. The obtained performances of different combinations of selected features, ML algorithms, hyperparameters, and datasets, were analyzed. Our results showed that using an ensemble of toolkits can improve the recognition of specific entity types, depending on the criteria used for the voting, and even the overall performance average of the entity types Person, Location, and Organization. The relevance analysis showed that Named Entities can indeed be useful for relevance detection, proving to be useful not only when used alone, achieving up to 74% of AUC, but also helpful when combined with other features such as word embeddings, achieving a maximum AUC of 94%, a 2.6% improve over word embeddings alone.2017-07-112017-07-11T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://repositorio-aberto.up.pt/handle/10216/106160TID:201804417engFilipe Daniel da Gama Batistainfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-29T13:10:57Zoai:repositorio-aberto.up.pt:10216/106160Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T23:35:15.880904Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Using named entity recognition for relevance detection in social network messages |
title |
Using named entity recognition for relevance detection in social network messages |
spellingShingle |
Using named entity recognition for relevance detection in social network messages Filipe Daniel da Gama Batista Engenharia electrotécnica, electrónica e informática Electrical engineering, Electronic engineering, Information engineering |
title_short |
Using named entity recognition for relevance detection in social network messages |
title_full |
Using named entity recognition for relevance detection in social network messages |
title_fullStr |
Using named entity recognition for relevance detection in social network messages |
title_full_unstemmed |
Using named entity recognition for relevance detection in social network messages |
title_sort |
Using named entity recognition for relevance detection in social network messages |
author |
Filipe Daniel da Gama Batista |
author_facet |
Filipe Daniel da Gama Batista |
author_role |
author |
dc.contributor.author.fl_str_mv |
Filipe Daniel da Gama Batista |
dc.subject.por.fl_str_mv |
Engenharia electrotécnica, electrónica e informática Electrical engineering, Electronic engineering, Information engineering |
topic |
Engenharia electrotécnica, electrónica e informática Electrical engineering, Electronic engineering, Information engineering |
description |
The continuous growth of social networks in the past decade has led to massive amounts of information being generated on a daily-basis. While a lot of this information is merely personal or simply irrelevant to a general audience, relevant news being transmitted through social networks is an increasingly common phenomenon, and therefore detecting such news automatically has become a field of interest and active research.The contribution of the present thesis consisted in studying the importance of named entities in the task of relevance detection. With that in mind, the goal of this work was twofold: 1) to implement or find the best named entity recognition tools for social media texts, and 2) to analyze the importance of extracted entities from posts as features for relevance detection with machine learning. There are already well-known named entity recognition tools, however, most state-of-the-art tools for named entity recognition show significant decrease of performance when tested on social media texts, in comparison to news media texts. This is mainly due to the informal character of social media texts: the absence of context, the lack of proper punctuation, wrong capitalization, the use of characters to represent emoticons, spelling errors and even the use of different languages in the same text. To address these problems, four different state-of-the-art toolkits - Stanford NLP, GATE with TwitIE, Twitter NLP tools and OpenNLP - were tested on social media datasets. In addition, we tried to understand how differently these toolkits predicted Named Entities, in terms of their precision and recall for three different entity types (Person, Location, Organization), and how they could complement each other in this task in order to achieve a combined performance superior to each individual one, creating an ensemble of toolkits.Following the extraction of entities using the developed Ensemble, different features were generated based on these entities. These features included the number of persons, locations and organizations mentioned in a post, statistics retrieved from The Guardian's open API, and were also combined with word embeddings features. Multiple machine learning models were then trained on a manually annotated datasets of tweets. The obtained performances of different combinations of selected features, ML algorithms, hyperparameters, and datasets, were analyzed. Our results showed that using an ensemble of toolkits can improve the recognition of specific entity types, depending on the criteria used for the voting, and even the overall performance average of the entity types Person, Location, and Organization. The relevance analysis showed that Named Entities can indeed be useful for relevance detection, proving to be useful not only when used alone, achieving up to 74% of AUC, but also helpful when combined with other features such as word embeddings, achieving a maximum AUC of 94%, a 2.6% improve over word embeddings alone. |
publishDate |
2017 |
dc.date.none.fl_str_mv |
2017-07-11 2017-07-11T00:00:00Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://repositorio-aberto.up.pt/handle/10216/106160 TID:201804417 |
url |
https://repositorio-aberto.up.pt/handle/10216/106160 |
identifier_str_mv |
TID:201804417 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799135664668672000 |