Deep Test to Transformers Architecture in Named Entity Recognition

Antunes, Gonçalo André Santos

Deep Test to Transformers Architecture in Named Entity Recognition

Detalhes bibliográficos
Autor(a) principal:	Antunes, Gonçalo André Santos
Data de Publicação:	2022
Tipo de documento:	Dissertação
Idioma:	eng
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	http://hdl.handle.net/10362/159584
Resumo:	Named Entity Recognition is a task of Natural Language Processing, which aims to extract and classify named entities such as ”Queen of England”. Depending on the objective of the extraction, the entities can be classified with different labels. These labels usually are Person, Organization, and Location but can be extended and include sub-entities like cars, countries, etc., or very different such as when the scope of the classification is biological, and the entities are Genes or Virus. These entities are extracted from raw text, which may be a well-structured scientific document or an internet post, and written in any language. These constraints create a considerable challenge to create an independent domain model. So, most of the authors have focused on English documents, which is the most explored language and contain more labeled data, which requires a significant amount of human resources. More recently, approaches are focused on Transformers architecture models, which may take up to days to train and consume millions of labeled entities. My approach is a statistical one, which means it will be language-independent while still requiring much computation power. This model will combine multiple techniques such as Bag of Words, Steeming, and Word2Vec to compute his features. Then, it will be compared with two transformer-based models, that although they have similar architecture, they have respectful differences. The three models will be tested in multiple datasets, each with its challenges, to conduct deep research on each model’s strengths and weaknesses. After a tough evaluation process the three models achieved performances of over 90% in datasets with high number of samples. The biggest challenge were the datasets with lower data, where the Pipeline achieved better performances than the transformer-based models.

Metadados do item

id	RCAP_cbc8abc811ce33791ce2776b9c159023
oai_identifier_str	oai:run.unl.pt:10362/159584
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	Deep Test to Transformers Architecture in Named Entity RecognitionNatural Language ProcessingNamed Entity RecognitionMachine LearningStatisticalIndependent DomainDomínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e InformáticaNamed Entity Recognition is a task of Natural Language Processing, which aims to extract and classify named entities such as ”Queen of England”. Depending on the objective of the extraction, the entities can be classified with different labels. These labels usually are Person, Organization, and Location but can be extended and include sub-entities like cars, countries, etc., or very different such as when the scope of the classification is biological, and the entities are Genes or Virus. These entities are extracted from raw text, which may be a well-structured scientific document or an internet post, and written in any language. These constraints create a considerable challenge to create an independent domain model. So, most of the authors have focused on English documents, which is the most explored language and contain more labeled data, which requires a significant amount of human resources. More recently, approaches are focused on Transformers architecture models, which may take up to days to train and consume millions of labeled entities. My approach is a statistical one, which means it will be language-independent while still requiring much computation power. This model will combine multiple techniques such as Bag of Words, Steeming, and Word2Vec to compute his features. Then, it will be compared with two transformer-based models, that although they have similar architecture, they have respectful differences. The three models will be tested in multiple datasets, each with its challenges, to conduct deep research on each model’s strengths and weaknesses. After a tough evaluation process the three models achieved performances of over 90% in datasets with high number of samples. The biggest challenge were the datasets with lower data, where the Pipeline achieved better performances than the transformer-based models.Named Entity Recognition é uma tarefa no Processamento de Língua Natural, que tem como objectivo extrair e classificar entidades como ”Rainha da Inglaterra”. Dependendo do objectivo da extração, as entidades podem ser classificadas em diferentes categorias. As categorias mais comuns são: Pessoa, Organização e Local, mas podem ser estendidas e incluir sub-entidades como carros, países, entre outros. Existem ainda categorias muito diferentes, por exemplo, quando o texto é do domínio da Biologia e as categorias são Genes ou Vírus. Essas entidades são extraídas de diferentes tipos de texto como documentos científicos estruturados corretamente ou um post da internet, podendo ser escritos em qualquer idioma. Estes constrangimentos criam um enorme desafio, sendo muito ambicioso criar um modelo independente do idioma. Acontece que a maioria dos autores está focado em documentos em inglês, uma vez que este é o idioma mais explorado e aquele que contém mais dados rotulados. Para obter estes dados são necessários recursos humanos capazes de os classificar à mão. Mais recentemente, as abordagens estão focadas em modelos de Deep Learning que podem levar dias para treinar e consomem milhões de entidades rotuladas. A minha abordagem é uma abordagem estatística, o que significa que será independente da língua, embora ainda necessite de muito poder de computação. Este modelo combinará múltiplas técnicas tais como Bag of Words, Steeming, e Word2Vec para caracterizar os dados. De seguida, será comparado com dois modelos baseados em transformers, que embora tenham uma arquitectura semelhante, têm diferenças significativas. Os três modelos serão testados em múltiplos conjuntos de dados, cada um com os seus desafios, para conduzir uma pesquisa profunda sobre os pontos fortes e fracos de cada modelo. Após uma extenso processo de avaliação os três modelos obtiveram métricas superiores a 90% em datasets com grandes quantidades de dados. O maior desafio foram os datasets com menos dados onde o Pipeline obteve métricas superiores aos modelos baseados em transformers.Silva, JoaquimRUNAntunes, Gonçalo André Santos2023-11-06T15:12:29Z2022-122022-12-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/159584enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T05:41:58Zoai:run.unl.pt:10362/159584Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:57:36.069281Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	Deep Test to Transformers Architecture in Named Entity Recognition
title	Deep Test to Transformers Architecture in Named Entity Recognition
spellingShingle	Deep Test to Transformers Architecture in Named Entity Recognition Antunes, Gonçalo André Santos Natural Language Processing Named Entity Recognition Machine Learning Statistical Independent Domain Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
title_short	Deep Test to Transformers Architecture in Named Entity Recognition
title_full	Deep Test to Transformers Architecture in Named Entity Recognition
title_fullStr	Deep Test to Transformers Architecture in Named Entity Recognition
title_full_unstemmed	Deep Test to Transformers Architecture in Named Entity Recognition
title_sort	Deep Test to Transformers Architecture in Named Entity Recognition
author	Antunes, Gonçalo André Santos
author_facet	Antunes, Gonçalo André Santos
author_role	author
dc.contributor.none.fl_str_mv	Silva, Joaquim RUN
dc.contributor.author.fl_str_mv	Antunes, Gonçalo André Santos
dc.subject.por.fl_str_mv	Natural Language Processing Named Entity Recognition Machine Learning Statistical Independent Domain Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
topic	Natural Language Processing Named Entity Recognition Machine Learning Statistical Independent Domain Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
description	Named Entity Recognition is a task of Natural Language Processing, which aims to extract and classify named entities such as ”Queen of England”. Depending on the objective of the extraction, the entities can be classified with different labels. These labels usually are Person, Organization, and Location but can be extended and include sub-entities like cars, countries, etc., or very different such as when the scope of the classification is biological, and the entities are Genes or Virus. These entities are extracted from raw text, which may be a well-structured scientific document or an internet post, and written in any language. These constraints create a considerable challenge to create an independent domain model. So, most of the authors have focused on English documents, which is the most explored language and contain more labeled data, which requires a significant amount of human resources. More recently, approaches are focused on Transformers architecture models, which may take up to days to train and consume millions of labeled entities. My approach is a statistical one, which means it will be language-independent while still requiring much computation power. This model will combine multiple techniques such as Bag of Words, Steeming, and Word2Vec to compute his features. Then, it will be compared with two transformer-based models, that although they have similar architecture, they have respectful differences. The three models will be tested in multiple datasets, each with its challenges, to conduct deep research on each model’s strengths and weaknesses. After a tough evaluation process the three models achieved performances of over 90% in datasets with high number of samples. The biggest challenge were the datasets with lower data, where the Pipeline achieved better performances than the transformer-based models.
publishDate	2022
dc.date.none.fl_str_mv	2022-12 2022-12-01T00:00:00Z 2023-11-06T15:12:29Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10362/159584
url	http://hdl.handle.net/10362/159584
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799138158155137024

Deep Test to Transformers Architecture in Named Entity Recognition

Registros relacionados