Named entity extraction from Portuguese web text

André Ricardo Oliveira Pires

Named entity extraction from Portuguese web text

Detalhes bibliográficos
Autor(a) principal:	André Ricardo Oliveira Pires
Data de Publicação:	2017
Tipo de documento:	Dissertação
Idioma:	eng
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	https://hdl.handle.net/10216/106094
Resumo:	In the context of Natural Language Processing, the Named Entity Recognition (NER) task focuses on extracting and classifying named entities from free text, such as news, which usually has a particular phrasing structure. Entity detection supports more complex tasks, such as Relation Extraction or Entity-Oriented Search, for instance the ANT search engine. There are some NER tools focused on the Portuguese language, such as Palavras or NERP-CRF, but their F-measure is still below the F-measure obtained by other available tools, for instance based on an annotated English corpus, trained with Stanford CoreNLP or with OpenNLP. ANT is an entity-oriented search engine for the University of Porto (UP). This search system indexes the information available in SIGARRA, the information system of the UP. Currently it uses handcrafted selectors to extract entities, based on XPath or CSS, which are dependent on the structure of the page. Furthermore, it does not work on free text, specially on SIGARRA's news. Using a machine learning method allows for the automation of the extraction task, making it scalable, structure independent and lowering the required work effort and consumed time. In this dissertation, I evaluate existing NER tools in order to select the best approach and configuration for the Portuguese language, particularly in the domain of SIGARRA's news. The evaluation was done based on two datasets, the HAREM collection, and a manually annotated subset of SIGARRA's news, which are used to assess the tools' performance using precision, recall and F-measure. Expanding the existing knowledge base will help index SIGARRA pages by providing a richer entity-oriented search experience with new information, as well as a better ranking scheme based on the additional context made available to the search engine. The scientific community also benefits from this work, with several detailed manuals resulting of the systematic analysis of available tools, in particular for the Portuguese language. First, I carried an out-of-the-box performance analysis of some selected tools (Stanford CoreNLP, OpenNLP, spaCy and NLTK) with the HAREM dataset, obtaining the best results for Stanford CoreNLP, followed by OpenNLP. Then, I performed a hyperparamenter study in order to select the best configuration for each tool, having achieved better-than-default results in each tool, particularly for NLTK's Maximum Entropy classifier, increasing the F-measure from 1.11% to 35.24%. Finally, using the best configuration, I repeated the training process with the SIGARRA News Corpus, having achieved F-measures as high as 86.64%, for Stanford CoreNLP. Furthermore, given this was also the out-of-the box winner, it leads me to conclude that Stanford CoreNLP is the best option for this particular context.

Metadados do item

id	RCAP_c8537e20a6b4e48292abdc17614690d8
oai_identifier_str	oai:repositorio-aberto.up.pt:10216/106094
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	Named entity extraction from Portuguese web textEngenharia electrotécnica, electrónica e informáticaElectrical engineering, Electronic engineering, Information engineeringIn the context of Natural Language Processing, the Named Entity Recognition (NER) task focuses on extracting and classifying named entities from free text, such as news, which usually has a particular phrasing structure. Entity detection supports more complex tasks, such as Relation Extraction or Entity-Oriented Search, for instance the ANT search engine. There are some NER tools focused on the Portuguese language, such as Palavras or NERP-CRF, but their F-measure is still below the F-measure obtained by other available tools, for instance based on an annotated English corpus, trained with Stanford CoreNLP or with OpenNLP. ANT is an entity-oriented search engine for the University of Porto (UP). This search system indexes the information available in SIGARRA, the information system of the UP. Currently it uses handcrafted selectors to extract entities, based on XPath or CSS, which are dependent on the structure of the page. Furthermore, it does not work on free text, specially on SIGARRA's news. Using a machine learning method allows for the automation of the extraction task, making it scalable, structure independent and lowering the required work effort and consumed time. In this dissertation, I evaluate existing NER tools in order to select the best approach and configuration for the Portuguese language, particularly in the domain of SIGARRA's news. The evaluation was done based on two datasets, the HAREM collection, and a manually annotated subset of SIGARRA's news, which are used to assess the tools' performance using precision, recall and F-measure. Expanding the existing knowledge base will help index SIGARRA pages by providing a richer entity-oriented search experience with new information, as well as a better ranking scheme based on the additional context made available to the search engine. The scientific community also benefits from this work, with several detailed manuals resulting of the systematic analysis of available tools, in particular for the Portuguese language. First, I carried an out-of-the-box performance analysis of some selected tools (Stanford CoreNLP, OpenNLP, spaCy and NLTK) with the HAREM dataset, obtaining the best results for Stanford CoreNLP, followed by OpenNLP. Then, I performed a hyperparamenter study in order to select the best configuration for each tool, having achieved better-than-default results in each tool, particularly for NLTK's Maximum Entropy classifier, increasing the F-measure from 1.11% to 35.24%. Finally, using the best configuration, I repeated the training process with the SIGARRA News Corpus, having achieved F-measures as high as 86.64%, for Stanford CoreNLP. Furthermore, given this was also the out-of-the box winner, it leads me to conclude that Stanford CoreNLP is the best option for this particular context.2017-07-072017-07-07T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://hdl.handle.net/10216/106094TID:201795310engAndré Ricardo Oliveira Piresinfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-29T14:05:10Zoai:repositorio-aberto.up.pt:10216/106094Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T23:54:23.720317Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	Named entity extraction from Portuguese web text
title	Named entity extraction from Portuguese web text
spellingShingle	Named entity extraction from Portuguese web text André Ricardo Oliveira Pires Engenharia electrotécnica, electrónica e informática Electrical engineering, Electronic engineering, Information engineering
title_short	Named entity extraction from Portuguese web text
title_full	Named entity extraction from Portuguese web text
title_fullStr	Named entity extraction from Portuguese web text
title_full_unstemmed	Named entity extraction from Portuguese web text
title_sort	Named entity extraction from Portuguese web text
author	André Ricardo Oliveira Pires
author_facet	André Ricardo Oliveira Pires
author_role	author
dc.contributor.author.fl_str_mv	André Ricardo Oliveira Pires
dc.subject.por.fl_str_mv	Engenharia electrotécnica, electrónica e informática Electrical engineering, Electronic engineering, Information engineering
topic	Engenharia electrotécnica, electrónica e informática Electrical engineering, Electronic engineering, Information engineering
description	In the context of Natural Language Processing, the Named Entity Recognition (NER) task focuses on extracting and classifying named entities from free text, such as news, which usually has a particular phrasing structure. Entity detection supports more complex tasks, such as Relation Extraction or Entity-Oriented Search, for instance the ANT search engine. There are some NER tools focused on the Portuguese language, such as Palavras or NERP-CRF, but their F-measure is still below the F-measure obtained by other available tools, for instance based on an annotated English corpus, trained with Stanford CoreNLP or with OpenNLP. ANT is an entity-oriented search engine for the University of Porto (UP). This search system indexes the information available in SIGARRA, the information system of the UP. Currently it uses handcrafted selectors to extract entities, based on XPath or CSS, which are dependent on the structure of the page. Furthermore, it does not work on free text, specially on SIGARRA's news. Using a machine learning method allows for the automation of the extraction task, making it scalable, structure independent and lowering the required work effort and consumed time. In this dissertation, I evaluate existing NER tools in order to select the best approach and configuration for the Portuguese language, particularly in the domain of SIGARRA's news. The evaluation was done based on two datasets, the HAREM collection, and a manually annotated subset of SIGARRA's news, which are used to assess the tools' performance using precision, recall and F-measure. Expanding the existing knowledge base will help index SIGARRA pages by providing a richer entity-oriented search experience with new information, as well as a better ranking scheme based on the additional context made available to the search engine. The scientific community also benefits from this work, with several detailed manuals resulting of the systematic analysis of available tools, in particular for the Portuguese language. First, I carried an out-of-the-box performance analysis of some selected tools (Stanford CoreNLP, OpenNLP, spaCy and NLTK) with the HAREM dataset, obtaining the best results for Stanford CoreNLP, followed by OpenNLP. Then, I performed a hyperparamenter study in order to select the best configuration for each tool, having achieved better-than-default results in each tool, particularly for NLTK's Maximum Entropy classifier, increasing the F-measure from 1.11% to 35.24%. Finally, using the best configuration, I repeated the training process with the SIGARRA News Corpus, having achieved F-measures as high as 86.64%, for Stanford CoreNLP. Furthermore, given this was also the out-of-the box winner, it leads me to conclude that Stanford CoreNLP is the best option for this particular context.
publishDate	2017
dc.date.none.fl_str_mv	2017-07-07 2017-07-07T00:00:00Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://hdl.handle.net/10216/106094 TID:201795310
url	https://hdl.handle.net/10216/106094
identifier_str_mv	TID:201795310
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799135863587733504

Named entity extraction from Portuguese web text

Registros relacionados