Using named entity recognition for relevance detection in social network messages

Filipe Daniel da Gama Batista

Using named entity recognition for relevance detection in social network messages

Detalhes bibliográficos
Autor(a) principal:	Filipe Daniel da Gama Batista
Data de Publicação:	2017
Tipo de documento:	Dissertação
Idioma:	eng
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	https://repositorio-aberto.up.pt/handle/10216/106160
Resumo:	The continuous growth of social networks in the past decade has led to massive amounts of information being generated on a daily-basis. While a lot of this information is merely personal or simply irrelevant to a general audience, relevant news being transmitted through social networks is an increasingly common phenomenon, and therefore detecting such news automatically has become a field of interest and active research.The contribution of the present thesis consisted in studying the importance of named entities in the task of relevance detection. With that in mind, the goal of this work was twofold: 1) to implement or find the best named entity recognition tools for social media texts, and 2) to analyze the importance of extracted entities from posts as features for relevance detection with machine learning. There are already well-known named entity recognition tools, however, most state-of-the-art tools for named entity recognition show significant decrease of performance when tested on social media texts, in comparison to news media texts. This is mainly due to the informal character of social media texts: the absence of context, the lack of proper punctuation, wrong capitalization, the use of characters to represent emoticons, spelling errors and even the use of different languages in the same text. To address these problems, four different state-of-the-art toolkits - Stanford NLP, GATE with TwitIE, Twitter NLP tools and OpenNLP - were tested on social media datasets. In addition, we tried to understand how differently these toolkits predicted Named Entities, in terms of their precision and recall for three different entity types (Person, Location, Organization), and how they could complement each other in this task in order to achieve a combined performance superior to each individual one, creating an ensemble of toolkits.Following the extraction of entities using the developed Ensemble, different features were generated based on these entities. These features included the number of persons, locations and organizations mentioned in a post, statistics retrieved from The Guardian's open API, and were also combined with word embeddings features. Multiple machine learning models were then trained on a manually annotated datasets of tweets. The obtained performances of different combinations of selected features, ML algorithms, hyperparameters, and datasets, were analyzed. Our results showed that using an ensemble of toolkits can improve the recognition of specific entity types, depending on the criteria used for the voting, and even the overall performance average of the entity types Person, Location, and Organization. The relevance analysis showed that Named Entities can indeed be useful for relevance detection, proving to be useful not only when used alone, achieving up to 74% of AUC, but also helpful when combined with other features such as word embeddings, achieving a maximum AUC of 94%, a 2.6% improve over word embeddings alone.

Metadados do item

id	RCAP_9acd92ad0c7bc01394d87ad43e6dc024
oai_identifier_str	oai:repositorio-aberto.up.pt:10216/106160
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	Using named entity recognition for relevance detection in social network messagesEngenharia electrotécnica, electrónica e informáticaElectrical engineering, Electronic engineering, Information engineeringThe continuous growth of social networks in the past decade has led to massive amounts of information being generated on a daily-basis. While a lot of this information is merely personal or simply irrelevant to a general audience, relevant news being transmitted through social networks is an increasingly common phenomenon, and therefore detecting such news automatically has become a field of interest and active research.The contribution of the present thesis consisted in studying the importance of named entities in the task of relevance detection. With that in mind, the goal of this work was twofold: 1) to implement or find the best named entity recognition tools for social media texts, and 2) to analyze the importance of extracted entities from posts as features for relevance detection with machine learning. There are already well-known named entity recognition tools, however, most state-of-the-art tools for named entity recognition show significant decrease of performance when tested on social media texts, in comparison to news media texts. This is mainly due to the informal character of social media texts: the absence of context, the lack of proper punctuation, wrong capitalization, the use of characters to represent emoticons, spelling errors and even the use of different languages in the same text. To address these problems, four different state-of-the-art toolkits - Stanford NLP, GATE with TwitIE, Twitter NLP tools and OpenNLP - were tested on social media datasets. In addition, we tried to understand how differently these toolkits predicted Named Entities, in terms of their precision and recall for three different entity types (Person, Location, Organization), and how they could complement each other in this task in order to achieve a combined performance superior to each individual one, creating an ensemble of toolkits.Following the extraction of entities using the developed Ensemble, different features were generated based on these entities. These features included the number of persons, locations and organizations mentioned in a post, statistics retrieved from The Guardian's open API, and were also combined with word embeddings features. Multiple machine learning models were then trained on a manually annotated datasets of tweets. The obtained performances of different combinations of selected features, ML algorithms, hyperparameters, and datasets, were analyzed. Our results showed that using an ensemble of toolkits can improve the recognition of specific entity types, depending on the criteria used for the voting, and even the overall performance average of the entity types Person, Location, and Organization. The relevance analysis showed that Named Entities can indeed be useful for relevance detection, proving to be useful not only when used alone, achieving up to 74% of AUC, but also helpful when combined with other features such as word embeddings, achieving a maximum AUC of 94%, a 2.6% improve over word embeddings alone.2017-07-112017-07-11T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://repositorio-aberto.up.pt/handle/10216/106160TID:201804417engFilipe Daniel da Gama Batistainfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-29T13:10:57Zoai:repositorio-aberto.up.pt:10216/106160Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T23:35:15.880904Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	Using named entity recognition for relevance detection in social network messages
title	Using named entity recognition for relevance detection in social network messages
spellingShingle	Using named entity recognition for relevance detection in social network messages Filipe Daniel da Gama Batista Engenharia electrotécnica, electrónica e informática Electrical engineering, Electronic engineering, Information engineering
title_short	Using named entity recognition for relevance detection in social network messages
title_full	Using named entity recognition for relevance detection in social network messages
title_fullStr	Using named entity recognition for relevance detection in social network messages
title_full_unstemmed	Using named entity recognition for relevance detection in social network messages
title_sort	Using named entity recognition for relevance detection in social network messages
author	Filipe Daniel da Gama Batista
author_facet	Filipe Daniel da Gama Batista
author_role	author
dc.contributor.author.fl_str_mv	Filipe Daniel da Gama Batista
dc.subject.por.fl_str_mv	Engenharia electrotécnica, electrónica e informática Electrical engineering, Electronic engineering, Information engineering
topic	Engenharia electrotécnica, electrónica e informática Electrical engineering, Electronic engineering, Information engineering
description	The continuous growth of social networks in the past decade has led to massive amounts of information being generated on a daily-basis. While a lot of this information is merely personal or simply irrelevant to a general audience, relevant news being transmitted through social networks is an increasingly common phenomenon, and therefore detecting such news automatically has become a field of interest and active research.The contribution of the present thesis consisted in studying the importance of named entities in the task of relevance detection. With that in mind, the goal of this work was twofold: 1) to implement or find the best named entity recognition tools for social media texts, and 2) to analyze the importance of extracted entities from posts as features for relevance detection with machine learning. There are already well-known named entity recognition tools, however, most state-of-the-art tools for named entity recognition show significant decrease of performance when tested on social media texts, in comparison to news media texts. This is mainly due to the informal character of social media texts: the absence of context, the lack of proper punctuation, wrong capitalization, the use of characters to represent emoticons, spelling errors and even the use of different languages in the same text. To address these problems, four different state-of-the-art toolkits - Stanford NLP, GATE with TwitIE, Twitter NLP tools and OpenNLP - were tested on social media datasets. In addition, we tried to understand how differently these toolkits predicted Named Entities, in terms of their precision and recall for three different entity types (Person, Location, Organization), and how they could complement each other in this task in order to achieve a combined performance superior to each individual one, creating an ensemble of toolkits.Following the extraction of entities using the developed Ensemble, different features were generated based on these entities. These features included the number of persons, locations and organizations mentioned in a post, statistics retrieved from The Guardian's open API, and were also combined with word embeddings features. Multiple machine learning models were then trained on a manually annotated datasets of tweets. The obtained performances of different combinations of selected features, ML algorithms, hyperparameters, and datasets, were analyzed. Our results showed that using an ensemble of toolkits can improve the recognition of specific entity types, depending on the criteria used for the voting, and even the overall performance average of the entity types Person, Location, and Organization. The relevance analysis showed that Named Entities can indeed be useful for relevance detection, proving to be useful not only when used alone, achieving up to 74% of AUC, but also helpful when combined with other features such as word embeddings, achieving a maximum AUC of 94%, a 2.6% improve over word embeddings alone.
publishDate	2017
dc.date.none.fl_str_mv	2017-07-11 2017-07-11T00:00:00Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://repositorio-aberto.up.pt/handle/10216/106160 TID:201804417
url	https://repositorio-aberto.up.pt/handle/10216/106160
identifier_str_mv	TID:201804417
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799135664668672000

Using named entity recognition for relevance detection in social network messages

Registros relacionados