Embeddings for Named Entity Recognition in Geoscience Portuguese Literature

Detalhes bibliográficos
Autor(a) principal: Consoli, Bernardo
Data de Publicação: 2020
Outros Autores: Santos, Joaquim, Gomes, Diogo, Cordeiro, Fabio, Vieira, Renata, Moreira, Viviane
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10174/29161
Resumo: This work focuses on Portuguese Named Entity Recognition (NER) in the Geology domain. The only domain-specific dataset in the Portuguese language annotated for Named Entity Recognition is the GeoCorpus. Our approach relies on Bidirecional Long Short-Term Memory - Conditional Random Fields neural networks (BiLSTM-CRF) - a widely used type of network for this area of research - that use vector and tensor embedding representations. We used three types of embedding models (Word Embeddings, Flair Embeddings, and Stacked Embeddings) under two versions (domain-specific and generalized). We originally trained the domain specific Flair Embeddings model with a generalized context in mind, but we fine-tuned with domain-specific Oil and Gas corpora, as there simply was not enough domain corpora to properly train such a model. We evaluated each of these embeddings separately, as well as we stacked with another embedding. Finally, we achieved state-of-the-art results for this domain with one of our embeddings, and we performed an error analysis on the language model that achieved the best results. Furthermore, we investigated the effects of domain-specific versus generalized embeddings.
id RCAP_b5dcb7d876d774a7ba7c7a2c2e2a767b
oai_identifier_str oai:dspace.uevora.pt:10174/29161
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Embeddings for Named Entity Recognition in Geoscience Portuguese LiteratureLanguage modelsNamed entitiesThis work focuses on Portuguese Named Entity Recognition (NER) in the Geology domain. The only domain-specific dataset in the Portuguese language annotated for Named Entity Recognition is the GeoCorpus. Our approach relies on Bidirecional Long Short-Term Memory - Conditional Random Fields neural networks (BiLSTM-CRF) - a widely used type of network for this area of research - that use vector and tensor embedding representations. We used three types of embedding models (Word Embeddings, Flair Embeddings, and Stacked Embeddings) under two versions (domain-specific and generalized). We originally trained the domain specific Flair Embeddings model with a generalized context in mind, but we fine-tuned with domain-specific Oil and Gas corpora, as there simply was not enough domain corpora to properly train such a model. We evaluated each of these embeddings separately, as well as we stacked with another embedding. Finally, we achieved state-of-the-art results for this domain with one of our embeddings, and we performed an error analysis on the language model that achieved the best results. Furthermore, we investigated the effects of domain-specific versus generalized embeddings.UIDB/00057/2020, CEECIND/01997/2017LREC2021-02-18T14:34:55Z2021-02-182020-05-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articlehttp://hdl.handle.net/10174/29161http://hdl.handle.net/10174/29161engCONSOLI, Bernardo, et al. Embeddings for Named Entity Recognition in Geoscience Portuguese Literature. In: Proceedings of The 12th Language Resources and Evaluation Conference. 2020. p. 4625-4630.https://www.aclweb.org/anthology/2020.lrec-1.568/ndndndndrenatav@uevora.ptnd299Consoli, BernardoSantos, JoaquimGomes, DiogoCordeiro, FabioVieira, RenataMoreira, Vivianeinfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-01-03T19:25:37Zoai:dspace.uevora.pt:10174/29161Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T01:18:43.002269Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Embeddings for Named Entity Recognition in Geoscience Portuguese Literature
title Embeddings for Named Entity Recognition in Geoscience Portuguese Literature
spellingShingle Embeddings for Named Entity Recognition in Geoscience Portuguese Literature
Consoli, Bernardo
Language models
Named entities
title_short Embeddings for Named Entity Recognition in Geoscience Portuguese Literature
title_full Embeddings for Named Entity Recognition in Geoscience Portuguese Literature
title_fullStr Embeddings for Named Entity Recognition in Geoscience Portuguese Literature
title_full_unstemmed Embeddings for Named Entity Recognition in Geoscience Portuguese Literature
title_sort Embeddings for Named Entity Recognition in Geoscience Portuguese Literature
author Consoli, Bernardo
author_facet Consoli, Bernardo
Santos, Joaquim
Gomes, Diogo
Cordeiro, Fabio
Vieira, Renata
Moreira, Viviane
author_role author
author2 Santos, Joaquim
Gomes, Diogo
Cordeiro, Fabio
Vieira, Renata
Moreira, Viviane
author2_role author
author
author
author
author
dc.contributor.author.fl_str_mv Consoli, Bernardo
Santos, Joaquim
Gomes, Diogo
Cordeiro, Fabio
Vieira, Renata
Moreira, Viviane
dc.subject.por.fl_str_mv Language models
Named entities
topic Language models
Named entities
description This work focuses on Portuguese Named Entity Recognition (NER) in the Geology domain. The only domain-specific dataset in the Portuguese language annotated for Named Entity Recognition is the GeoCorpus. Our approach relies on Bidirecional Long Short-Term Memory - Conditional Random Fields neural networks (BiLSTM-CRF) - a widely used type of network for this area of research - that use vector and tensor embedding representations. We used three types of embedding models (Word Embeddings, Flair Embeddings, and Stacked Embeddings) under two versions (domain-specific and generalized). We originally trained the domain specific Flair Embeddings model with a generalized context in mind, but we fine-tuned with domain-specific Oil and Gas corpora, as there simply was not enough domain corpora to properly train such a model. We evaluated each of these embeddings separately, as well as we stacked with another embedding. Finally, we achieved state-of-the-art results for this domain with one of our embeddings, and we performed an error analysis on the language model that achieved the best results. Furthermore, we investigated the effects of domain-specific versus generalized embeddings.
publishDate 2020
dc.date.none.fl_str_mv 2020-05-01T00:00:00Z
2021-02-18T14:34:55Z
2021-02-18
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10174/29161
http://hdl.handle.net/10174/29161
url http://hdl.handle.net/10174/29161
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv CONSOLI, Bernardo, et al. Embeddings for Named Entity Recognition in Geoscience Portuguese Literature. In: Proceedings of The 12th Language Resources and Evaluation Conference. 2020. p. 4625-4630.
https://www.aclweb.org/anthology/2020.lrec-1.568/
nd
nd
nd
nd
renatav@uevora.pt
nd
299
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv LREC
publisher.none.fl_str_mv LREC
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799136669034610688