Embeddings for Named Entity Recognition in Geoscience Portuguese Literature
Autor(a) principal: | |
---|---|
Data de Publicação: | 2020 |
Outros Autores: | , , , , |
Tipo de documento: | Artigo |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10174/29161 |
Resumo: | This work focuses on Portuguese Named Entity Recognition (NER) in the Geology domain. The only domain-specific dataset in the Portuguese language annotated for Named Entity Recognition is the GeoCorpus. Our approach relies on Bidirecional Long Short-Term Memory - Conditional Random Fields neural networks (BiLSTM-CRF) - a widely used type of network for this area of research - that use vector and tensor embedding representations. We used three types of embedding models (Word Embeddings, Flair Embeddings, and Stacked Embeddings) under two versions (domain-specific and generalized). We originally trained the domain specific Flair Embeddings model with a generalized context in mind, but we fine-tuned with domain-specific Oil and Gas corpora, as there simply was not enough domain corpora to properly train such a model. We evaluated each of these embeddings separately, as well as we stacked with another embedding. Finally, we achieved state-of-the-art results for this domain with one of our embeddings, and we performed an error analysis on the language model that achieved the best results. Furthermore, we investigated the effects of domain-specific versus generalized embeddings. |
id |
RCAP_b5dcb7d876d774a7ba7c7a2c2e2a767b |
---|---|
oai_identifier_str |
oai:dspace.uevora.pt:10174/29161 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Embeddings for Named Entity Recognition in Geoscience Portuguese LiteratureLanguage modelsNamed entitiesThis work focuses on Portuguese Named Entity Recognition (NER) in the Geology domain. The only domain-specific dataset in the Portuguese language annotated for Named Entity Recognition is the GeoCorpus. Our approach relies on Bidirecional Long Short-Term Memory - Conditional Random Fields neural networks (BiLSTM-CRF) - a widely used type of network for this area of research - that use vector and tensor embedding representations. We used three types of embedding models (Word Embeddings, Flair Embeddings, and Stacked Embeddings) under two versions (domain-specific and generalized). We originally trained the domain specific Flair Embeddings model with a generalized context in mind, but we fine-tuned with domain-specific Oil and Gas corpora, as there simply was not enough domain corpora to properly train such a model. We evaluated each of these embeddings separately, as well as we stacked with another embedding. Finally, we achieved state-of-the-art results for this domain with one of our embeddings, and we performed an error analysis on the language model that achieved the best results. Furthermore, we investigated the effects of domain-specific versus generalized embeddings.UIDB/00057/2020, CEECIND/01997/2017LREC2021-02-18T14:34:55Z2021-02-182020-05-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articlehttp://hdl.handle.net/10174/29161http://hdl.handle.net/10174/29161engCONSOLI, Bernardo, et al. Embeddings for Named Entity Recognition in Geoscience Portuguese Literature. In: Proceedings of The 12th Language Resources and Evaluation Conference. 2020. p. 4625-4630.https://www.aclweb.org/anthology/2020.lrec-1.568/ndndndndrenatav@uevora.ptnd299Consoli, BernardoSantos, JoaquimGomes, DiogoCordeiro, FabioVieira, RenataMoreira, Vivianeinfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-01-03T19:25:37Zoai:dspace.uevora.pt:10174/29161Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T01:18:43.002269Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Embeddings for Named Entity Recognition in Geoscience Portuguese Literature |
title |
Embeddings for Named Entity Recognition in Geoscience Portuguese Literature |
spellingShingle |
Embeddings for Named Entity Recognition in Geoscience Portuguese Literature Consoli, Bernardo Language models Named entities |
title_short |
Embeddings for Named Entity Recognition in Geoscience Portuguese Literature |
title_full |
Embeddings for Named Entity Recognition in Geoscience Portuguese Literature |
title_fullStr |
Embeddings for Named Entity Recognition in Geoscience Portuguese Literature |
title_full_unstemmed |
Embeddings for Named Entity Recognition in Geoscience Portuguese Literature |
title_sort |
Embeddings for Named Entity Recognition in Geoscience Portuguese Literature |
author |
Consoli, Bernardo |
author_facet |
Consoli, Bernardo Santos, Joaquim Gomes, Diogo Cordeiro, Fabio Vieira, Renata Moreira, Viviane |
author_role |
author |
author2 |
Santos, Joaquim Gomes, Diogo Cordeiro, Fabio Vieira, Renata Moreira, Viviane |
author2_role |
author author author author author |
dc.contributor.author.fl_str_mv |
Consoli, Bernardo Santos, Joaquim Gomes, Diogo Cordeiro, Fabio Vieira, Renata Moreira, Viviane |
dc.subject.por.fl_str_mv |
Language models Named entities |
topic |
Language models Named entities |
description |
This work focuses on Portuguese Named Entity Recognition (NER) in the Geology domain. The only domain-specific dataset in the Portuguese language annotated for Named Entity Recognition is the GeoCorpus. Our approach relies on Bidirecional Long Short-Term Memory - Conditional Random Fields neural networks (BiLSTM-CRF) - a widely used type of network for this area of research - that use vector and tensor embedding representations. We used three types of embedding models (Word Embeddings, Flair Embeddings, and Stacked Embeddings) under two versions (domain-specific and generalized). We originally trained the domain specific Flair Embeddings model with a generalized context in mind, but we fine-tuned with domain-specific Oil and Gas corpora, as there simply was not enough domain corpora to properly train such a model. We evaluated each of these embeddings separately, as well as we stacked with another embedding. Finally, we achieved state-of-the-art results for this domain with one of our embeddings, and we performed an error analysis on the language model that achieved the best results. Furthermore, we investigated the effects of domain-specific versus generalized embeddings. |
publishDate |
2020 |
dc.date.none.fl_str_mv |
2020-05-01T00:00:00Z 2021-02-18T14:34:55Z 2021-02-18 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10174/29161 http://hdl.handle.net/10174/29161 |
url |
http://hdl.handle.net/10174/29161 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
CONSOLI, Bernardo, et al. Embeddings for Named Entity Recognition in Geoscience Portuguese Literature. In: Proceedings of The 12th Language Resources and Evaluation Conference. 2020. p. 4625-4630. https://www.aclweb.org/anthology/2020.lrec-1.568/ nd nd nd nd renatav@uevora.pt nd 299 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.publisher.none.fl_str_mv |
LREC |
publisher.none.fl_str_mv |
LREC |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799136669034610688 |