Automatic Multilingual Recognition of Named Entities in Various Domains, for the Purposes of Machine Translation Anonymization

Detalhes bibliográficos
Autor(a) principal: Menezes, Miguel
Data de Publicação: 2022
Outros Autores: Cabarrão, Vera, Moniz, Helena, Mota, Pedro
Tipo de documento: Artigo
Idioma: por
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: https://doi.org/10.26334/2183-9077/rapln9ano2022a12
Resumo: The following article describes the research developed at Unbabel, a Portuguese Machine-Translation start-up, that combines Machine Translation (MT) with human post-edition with a focus on customer service content. With the work carried out within a real multilingual AI powered, human-refined, MT industry, we aim to contribute to furthering MT quality and good-practices, by exposing the importance of having continuously in development, robust Named Entity Recognition systems for General Data Protection Regulation (GDPR) compliance. We will report three different experiments, resulting from a shared work with Unbabel´s linguists and Unbabel´s Artificial Intelligence (AI) engineering team, matured over a year. The first experiment focused on developing a methodology for the identification and annotation of domain-specific Named Entities (NEs) for the Food-Industry. The devised methodology allows the construction of gold standards for building domain specific NER systems and can be applied for a myriad of different domains. With the implementation of the designed method, we were able to identify the following domain-specific NEs set: Restaurant Names; Restaurant Chains; Dishes; Beverage, Ingredients. The second and third experiments explored the possibilities of constructing, in a semi-automatically way, multilingual NER gold standards for different domains and language pairs, using aligners that project Named Entities across a parallel corpus. Both experiments made it possible to benchmark four different open-source aligners (SimAlign; Fastalign; AwesomeAlign; Eflomal), allowing to identify the one with better performance and, simultaneously, validate the aforementioned approach. This work should be taken as a statement of multidisciplinary, proving and validating the much-needed articulation between different scientific fields that compose and characterize the area of Natural Language Processing (NLP).
id RCAP_ff0f73daee1a16ea565889d9c54dfacd
oai_identifier_str oai:ojs3.ojs.apl.pt:article/146
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Automatic Multilingual Recognition of Named Entities in Various Domains, for the Purposes of Machine Translation AnonymizationReconhecimento Automático Multilingue de Entidades Mencionadas em Diversos Domínios, para Efeitos de Anonimização de Tradução AutomáticaTradução AutomáticaEntidades MencionadasAnotaçãoSistemas de AlinhamentoMachine-TranslationNamed EntitiesAnnotationGold StandardsAlignersThe following article describes the research developed at Unbabel, a Portuguese Machine-Translation start-up, that combines Machine Translation (MT) with human post-edition with a focus on customer service content. With the work carried out within a real multilingual AI powered, human-refined, MT industry, we aim to contribute to furthering MT quality and good-practices, by exposing the importance of having continuously in development, robust Named Entity Recognition systems for General Data Protection Regulation (GDPR) compliance. We will report three different experiments, resulting from a shared work with Unbabel´s linguists and Unbabel´s Artificial Intelligence (AI) engineering team, matured over a year. The first experiment focused on developing a methodology for the identification and annotation of domain-specific Named Entities (NEs) for the Food-Industry. The devised methodology allows the construction of gold standards for building domain specific NER systems and can be applied for a myriad of different domains. With the implementation of the designed method, we were able to identify the following domain-specific NEs set: Restaurant Names; Restaurant Chains; Dishes; Beverage, Ingredients. The second and third experiments explored the possibilities of constructing, in a semi-automatically way, multilingual NER gold standards for different domains and language pairs, using aligners that project Named Entities across a parallel corpus. Both experiments made it possible to benchmark four different open-source aligners (SimAlign; Fastalign; AwesomeAlign; Eflomal), allowing to identify the one with better performance and, simultaneously, validate the aforementioned approach. This work should be taken as a statement of multidisciplinary, proving and validating the much-needed articulation between different scientific fields that compose and characterize the area of Natural Language Processing (NLP).Associação Portuguesa de Linguística2022-10-25info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttps://doi.org/10.26334/2183-9077/rapln9ano2022a12https://doi.org/10.26334/2183-9077/rapln9ano2022a12Revista da Associação Portuguesa de Linguística; No. 9 (2022): Journal of the Portuguese Linguistics Association; 169-185Revista da Associação Portuguesa de Linguística; N.º 9 (2022): Revista da Associação Portuguesa de Linguística; 169-1852183-907710.26334/2183-9077/rapln9ano2022reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAPporhttps://ojs.apl.pt/index.php/rapl/article/view/146https://ojs.apl.pt/index.php/rapl/article/view/146/149Direitos de Autor (c) 2022 Miguel Menezes, Vera Cabarrão, Helena Moniz, Pedro Motainfo:eu-repo/semantics/openAccessMenezes, MiguelCabarrão, VeraMoniz, HelenaMota, Pedro2023-12-02T10:17:58Zoai:ojs3.ojs.apl.pt:article/146Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T20:36:02.903926Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Automatic Multilingual Recognition of Named Entities in Various Domains, for the Purposes of Machine Translation Anonymization
Reconhecimento Automático Multilingue de Entidades Mencionadas em Diversos Domínios, para Efeitos de Anonimização de Tradução Automática
title Automatic Multilingual Recognition of Named Entities in Various Domains, for the Purposes of Machine Translation Anonymization
spellingShingle Automatic Multilingual Recognition of Named Entities in Various Domains, for the Purposes of Machine Translation Anonymization
Menezes, Miguel
Tradução Automática
Entidades Mencionadas
Anotação
Sistemas de Alinhamento
Machine-Translation
Named Entities
Annotation
Gold Standards
Aligners
title_short Automatic Multilingual Recognition of Named Entities in Various Domains, for the Purposes of Machine Translation Anonymization
title_full Automatic Multilingual Recognition of Named Entities in Various Domains, for the Purposes of Machine Translation Anonymization
title_fullStr Automatic Multilingual Recognition of Named Entities in Various Domains, for the Purposes of Machine Translation Anonymization
title_full_unstemmed Automatic Multilingual Recognition of Named Entities in Various Domains, for the Purposes of Machine Translation Anonymization
title_sort Automatic Multilingual Recognition of Named Entities in Various Domains, for the Purposes of Machine Translation Anonymization
author Menezes, Miguel
author_facet Menezes, Miguel
Cabarrão, Vera
Moniz, Helena
Mota, Pedro
author_role author
author2 Cabarrão, Vera
Moniz, Helena
Mota, Pedro
author2_role author
author
author
dc.contributor.author.fl_str_mv Menezes, Miguel
Cabarrão, Vera
Moniz, Helena
Mota, Pedro
dc.subject.por.fl_str_mv Tradução Automática
Entidades Mencionadas
Anotação
Sistemas de Alinhamento
Machine-Translation
Named Entities
Annotation
Gold Standards
Aligners
topic Tradução Automática
Entidades Mencionadas
Anotação
Sistemas de Alinhamento
Machine-Translation
Named Entities
Annotation
Gold Standards
Aligners
description The following article describes the research developed at Unbabel, a Portuguese Machine-Translation start-up, that combines Machine Translation (MT) with human post-edition with a focus on customer service content. With the work carried out within a real multilingual AI powered, human-refined, MT industry, we aim to contribute to furthering MT quality and good-practices, by exposing the importance of having continuously in development, robust Named Entity Recognition systems for General Data Protection Regulation (GDPR) compliance. We will report three different experiments, resulting from a shared work with Unbabel´s linguists and Unbabel´s Artificial Intelligence (AI) engineering team, matured over a year. The first experiment focused on developing a methodology for the identification and annotation of domain-specific Named Entities (NEs) for the Food-Industry. The devised methodology allows the construction of gold standards for building domain specific NER systems and can be applied for a myriad of different domains. With the implementation of the designed method, we were able to identify the following domain-specific NEs set: Restaurant Names; Restaurant Chains; Dishes; Beverage, Ingredients. The second and third experiments explored the possibilities of constructing, in a semi-automatically way, multilingual NER gold standards for different domains and language pairs, using aligners that project Named Entities across a parallel corpus. Both experiments made it possible to benchmark four different open-source aligners (SimAlign; Fastalign; AwesomeAlign; Eflomal), allowing to identify the one with better performance and, simultaneously, validate the aforementioned approach. This work should be taken as a statement of multidisciplinary, proving and validating the much-needed articulation between different scientific fields that compose and characterize the area of Natural Language Processing (NLP).
publishDate 2022
dc.date.none.fl_str_mv 2022-10-25
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://doi.org/10.26334/2183-9077/rapln9ano2022a12
https://doi.org/10.26334/2183-9077/rapln9ano2022a12
url https://doi.org/10.26334/2183-9077/rapln9ano2022a12
dc.language.iso.fl_str_mv por
language por
dc.relation.none.fl_str_mv https://ojs.apl.pt/index.php/rapl/article/view/146
https://ojs.apl.pt/index.php/rapl/article/view/146/149
dc.rights.driver.fl_str_mv Direitos de Autor (c) 2022 Miguel Menezes, Vera Cabarrão, Helena Moniz, Pedro Mota
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Direitos de Autor (c) 2022 Miguel Menezes, Vera Cabarrão, Helena Moniz, Pedro Mota
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Associação Portuguesa de Linguística
publisher.none.fl_str_mv Associação Portuguesa de Linguística
dc.source.none.fl_str_mv Revista da Associação Portuguesa de Linguística; No. 9 (2022): Journal of the Portuguese Linguistics Association; 169-185
Revista da Associação Portuguesa de Linguística; N.º 9 (2022): Revista da Associação Portuguesa de Linguística; 169-185
2183-9077
10.26334/2183-9077/rapln9ano2022
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799133623832543232