Automatic Multilingual Recognition of Named Entities in Various Domains, for the Purposes of Machine Translation Anonymization
Autor(a) principal: | |
---|---|
Data de Publicação: | 2022 |
Outros Autores: | , , |
Tipo de documento: | Artigo |
Idioma: | por |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | https://doi.org/10.26334/2183-9077/rapln9ano2022a12 |
Resumo: | The following article describes the research developed at Unbabel, a Portuguese Machine-Translation start-up, that combines Machine Translation (MT) with human post-edition with a focus on customer service content. With the work carried out within a real multilingual AI powered, human-refined, MT industry, we aim to contribute to furthering MT quality and good-practices, by exposing the importance of having continuously in development, robust Named Entity Recognition systems for General Data Protection Regulation (GDPR) compliance. We will report three different experiments, resulting from a shared work with Unbabel´s linguists and Unbabel´s Artificial Intelligence (AI) engineering team, matured over a year. The first experiment focused on developing a methodology for the identification and annotation of domain-specific Named Entities (NEs) for the Food-Industry. The devised methodology allows the construction of gold standards for building domain specific NER systems and can be applied for a myriad of different domains. With the implementation of the designed method, we were able to identify the following domain-specific NEs set: Restaurant Names; Restaurant Chains; Dishes; Beverage, Ingredients. The second and third experiments explored the possibilities of constructing, in a semi-automatically way, multilingual NER gold standards for different domains and language pairs, using aligners that project Named Entities across a parallel corpus. Both experiments made it possible to benchmark four different open-source aligners (SimAlign; Fastalign; AwesomeAlign; Eflomal), allowing to identify the one with better performance and, simultaneously, validate the aforementioned approach. This work should be taken as a statement of multidisciplinary, proving and validating the much-needed articulation between different scientific fields that compose and characterize the area of Natural Language Processing (NLP). |
id |
RCAP_ff0f73daee1a16ea565889d9c54dfacd |
---|---|
oai_identifier_str |
oai:ojs3.ojs.apl.pt:article/146 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Automatic Multilingual Recognition of Named Entities in Various Domains, for the Purposes of Machine Translation AnonymizationReconhecimento Automático Multilingue de Entidades Mencionadas em Diversos Domínios, para Efeitos de Anonimização de Tradução AutomáticaTradução AutomáticaEntidades MencionadasAnotaçãoSistemas de AlinhamentoMachine-TranslationNamed EntitiesAnnotationGold StandardsAlignersThe following article describes the research developed at Unbabel, a Portuguese Machine-Translation start-up, that combines Machine Translation (MT) with human post-edition with a focus on customer service content. With the work carried out within a real multilingual AI powered, human-refined, MT industry, we aim to contribute to furthering MT quality and good-practices, by exposing the importance of having continuously in development, robust Named Entity Recognition systems for General Data Protection Regulation (GDPR) compliance. We will report three different experiments, resulting from a shared work with Unbabel´s linguists and Unbabel´s Artificial Intelligence (AI) engineering team, matured over a year. The first experiment focused on developing a methodology for the identification and annotation of domain-specific Named Entities (NEs) for the Food-Industry. The devised methodology allows the construction of gold standards for building domain specific NER systems and can be applied for a myriad of different domains. With the implementation of the designed method, we were able to identify the following domain-specific NEs set: Restaurant Names; Restaurant Chains; Dishes; Beverage, Ingredients. The second and third experiments explored the possibilities of constructing, in a semi-automatically way, multilingual NER gold standards for different domains and language pairs, using aligners that project Named Entities across a parallel corpus. Both experiments made it possible to benchmark four different open-source aligners (SimAlign; Fastalign; AwesomeAlign; Eflomal), allowing to identify the one with better performance and, simultaneously, validate the aforementioned approach. This work should be taken as a statement of multidisciplinary, proving and validating the much-needed articulation between different scientific fields that compose and characterize the area of Natural Language Processing (NLP).Associação Portuguesa de Linguística2022-10-25info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttps://doi.org/10.26334/2183-9077/rapln9ano2022a12https://doi.org/10.26334/2183-9077/rapln9ano2022a12Revista da Associação Portuguesa de Linguística; No. 9 (2022): Journal of the Portuguese Linguistics Association; 169-185Revista da Associação Portuguesa de Linguística; N.º 9 (2022): Revista da Associação Portuguesa de Linguística; 169-1852183-907710.26334/2183-9077/rapln9ano2022reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAPporhttps://ojs.apl.pt/index.php/rapl/article/view/146https://ojs.apl.pt/index.php/rapl/article/view/146/149Direitos de Autor (c) 2022 Miguel Menezes, Vera Cabarrão, Helena Moniz, Pedro Motainfo:eu-repo/semantics/openAccessMenezes, MiguelCabarrão, VeraMoniz, HelenaMota, Pedro2023-12-02T10:17:58Zoai:ojs3.ojs.apl.pt:article/146Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T20:36:02.903926Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Automatic Multilingual Recognition of Named Entities in Various Domains, for the Purposes of Machine Translation Anonymization Reconhecimento Automático Multilingue de Entidades Mencionadas em Diversos Domínios, para Efeitos de Anonimização de Tradução Automática |
title |
Automatic Multilingual Recognition of Named Entities in Various Domains, for the Purposes of Machine Translation Anonymization |
spellingShingle |
Automatic Multilingual Recognition of Named Entities in Various Domains, for the Purposes of Machine Translation Anonymization Menezes, Miguel Tradução Automática Entidades Mencionadas Anotação Sistemas de Alinhamento Machine-Translation Named Entities Annotation Gold Standards Aligners |
title_short |
Automatic Multilingual Recognition of Named Entities in Various Domains, for the Purposes of Machine Translation Anonymization |
title_full |
Automatic Multilingual Recognition of Named Entities in Various Domains, for the Purposes of Machine Translation Anonymization |
title_fullStr |
Automatic Multilingual Recognition of Named Entities in Various Domains, for the Purposes of Machine Translation Anonymization |
title_full_unstemmed |
Automatic Multilingual Recognition of Named Entities in Various Domains, for the Purposes of Machine Translation Anonymization |
title_sort |
Automatic Multilingual Recognition of Named Entities in Various Domains, for the Purposes of Machine Translation Anonymization |
author |
Menezes, Miguel |
author_facet |
Menezes, Miguel Cabarrão, Vera Moniz, Helena Mota, Pedro |
author_role |
author |
author2 |
Cabarrão, Vera Moniz, Helena Mota, Pedro |
author2_role |
author author author |
dc.contributor.author.fl_str_mv |
Menezes, Miguel Cabarrão, Vera Moniz, Helena Mota, Pedro |
dc.subject.por.fl_str_mv |
Tradução Automática Entidades Mencionadas Anotação Sistemas de Alinhamento Machine-Translation Named Entities Annotation Gold Standards Aligners |
topic |
Tradução Automática Entidades Mencionadas Anotação Sistemas de Alinhamento Machine-Translation Named Entities Annotation Gold Standards Aligners |
description |
The following article describes the research developed at Unbabel, a Portuguese Machine-Translation start-up, that combines Machine Translation (MT) with human post-edition with a focus on customer service content. With the work carried out within a real multilingual AI powered, human-refined, MT industry, we aim to contribute to furthering MT quality and good-practices, by exposing the importance of having continuously in development, robust Named Entity Recognition systems for General Data Protection Regulation (GDPR) compliance. We will report three different experiments, resulting from a shared work with Unbabel´s linguists and Unbabel´s Artificial Intelligence (AI) engineering team, matured over a year. The first experiment focused on developing a methodology for the identification and annotation of domain-specific Named Entities (NEs) for the Food-Industry. The devised methodology allows the construction of gold standards for building domain specific NER systems and can be applied for a myriad of different domains. With the implementation of the designed method, we were able to identify the following domain-specific NEs set: Restaurant Names; Restaurant Chains; Dishes; Beverage, Ingredients. The second and third experiments explored the possibilities of constructing, in a semi-automatically way, multilingual NER gold standards for different domains and language pairs, using aligners that project Named Entities across a parallel corpus. Both experiments made it possible to benchmark four different open-source aligners (SimAlign; Fastalign; AwesomeAlign; Eflomal), allowing to identify the one with better performance and, simultaneously, validate the aforementioned approach. This work should be taken as a statement of multidisciplinary, proving and validating the much-needed articulation between different scientific fields that compose and characterize the area of Natural Language Processing (NLP). |
publishDate |
2022 |
dc.date.none.fl_str_mv |
2022-10-25 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://doi.org/10.26334/2183-9077/rapln9ano2022a12 https://doi.org/10.26334/2183-9077/rapln9ano2022a12 |
url |
https://doi.org/10.26334/2183-9077/rapln9ano2022a12 |
dc.language.iso.fl_str_mv |
por |
language |
por |
dc.relation.none.fl_str_mv |
https://ojs.apl.pt/index.php/rapl/article/view/146 https://ojs.apl.pt/index.php/rapl/article/view/146/149 |
dc.rights.driver.fl_str_mv |
Direitos de Autor (c) 2022 Miguel Menezes, Vera Cabarrão, Helena Moniz, Pedro Mota info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
Direitos de Autor (c) 2022 Miguel Menezes, Vera Cabarrão, Helena Moniz, Pedro Mota |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Associação Portuguesa de Linguística |
publisher.none.fl_str_mv |
Associação Portuguesa de Linguística |
dc.source.none.fl_str_mv |
Revista da Associação Portuguesa de Linguística; No. 9 (2022): Journal of the Portuguese Linguistics Association; 169-185 Revista da Associação Portuguesa de Linguística; N.º 9 (2022): Revista da Associação Portuguesa de Linguística; 169-185 2183-9077 10.26334/2183-9077/rapln9ano2022 reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799133623832543232 |