Impact of the quality of source texts created by users and agents and the propagation of errors in Machine Translation systems
Autor(a) principal: | |
---|---|
Data de Publicação: | 2022 |
Outros Autores: | , |
Tipo de documento: | Artigo |
Idioma: | por |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | https://doi.org/10.26334/2183-9077/rapln9ano2022a10 |
Resumo: | This paper proposes a typology concerning errors and linguistic structures found in the source text that have an impact on Machine Translation (MT). The main objectives of this project were firstly, to make a comparison between error typologies and analyze them according to their suitability; analyze annotated data and build a data-driven typology while adapting the previous existing typologies; make a distinction between the errors produced by users and agents in the online Customer Support domain; test the proposed typology with three case studies; methodize patterns in the errors found and verify their impact in MT systems; finally, create a typology ready for production for its particular field. At first, it was made a comparison between different typologies, whether they consider a bilingual or monolingual level (e.g. Unbabel Error Typology, MQM Typology (Lommel et al., 2014b) and SCATE MT Error Taxonomy (Tezcan et al., 2017). This comparison allowed us to verify the differences and similarities between them and, also, which issue types have been previously used. In order to build a data-driven typology, both sides of Customer Support were analyzed — user and agent — as they present different writing structures and are influenced by different factors. The results of that analysis were assessed through the annotation process with a bilingual error typology and were calculated with one of the most highly used manual evaluation metrics in translation quality evaluation — Multidimensional Quality Metrics (MQM), proposed in the QTLaunchPad project (2014), funded by the European Union. Through this analysis, it was then possible to build a data-driven typology — Source Typology. In order to aid future annotators of this typology, we provided guidelines concerning the annotation process and elaborate on the new additions of the typology. In the interest of confirming the reliability of this typology, three case studies were conducted in an internal pilot, with a total of 26,855 words, 2802 errors and 239 linguistic structures (represented in the ‘Neutral’ severity — associated with conversational markers, segmentation, emoticons, etc., characteristics of oral speech) annotated, with different purposes and taking into account several language pairs. In these studies, we verified the effectiveness of the new additions, as well as the transfer of source text errors to the target text. Besides that, it was also analyzed whether the linguistic structures annotated with the ‘Neutral’ severity had in fact any impact on the MT systems. This testing allowed us to confirm the effectiveness and reliability of the Source Typology, including what needs improvement. |
id |
RCAP_c09f2fe1f5b35143040e2f2be4a4eb83 |
---|---|
oai_identifier_str |
oai:ojs3.ojs.apl.pt:article/143 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Impact of the quality of source texts created by users and agents and the propagation of errors in Machine Translation systemsImpacto da qualidade de textos de partida criados por utilizadores e agentes e a propagação de erros em sistemas de Tradução Automáticatexto de partidaanotação de errostradução automáticaapoio ao clientesource texterror annotationMachine TranslationCustomer SupportThis paper proposes a typology concerning errors and linguistic structures found in the source text that have an impact on Machine Translation (MT). The main objectives of this project were firstly, to make a comparison between error typologies and analyze them according to their suitability; analyze annotated data and build a data-driven typology while adapting the previous existing typologies; make a distinction between the errors produced by users and agents in the online Customer Support domain; test the proposed typology with three case studies; methodize patterns in the errors found and verify their impact in MT systems; finally, create a typology ready for production for its particular field. At first, it was made a comparison between different typologies, whether they consider a bilingual or monolingual level (e.g. Unbabel Error Typology, MQM Typology (Lommel et al., 2014b) and SCATE MT Error Taxonomy (Tezcan et al., 2017). This comparison allowed us to verify the differences and similarities between them and, also, which issue types have been previously used. In order to build a data-driven typology, both sides of Customer Support were analyzed — user and agent — as they present different writing structures and are influenced by different factors. The results of that analysis were assessed through the annotation process with a bilingual error typology and were calculated with one of the most highly used manual evaluation metrics in translation quality evaluation — Multidimensional Quality Metrics (MQM), proposed in the QTLaunchPad project (2014), funded by the European Union. Through this analysis, it was then possible to build a data-driven typology — Source Typology. In order to aid future annotators of this typology, we provided guidelines concerning the annotation process and elaborate on the new additions of the typology. In the interest of confirming the reliability of this typology, three case studies were conducted in an internal pilot, with a total of 26,855 words, 2802 errors and 239 linguistic structures (represented in the ‘Neutral’ severity — associated with conversational markers, segmentation, emoticons, etc., characteristics of oral speech) annotated, with different purposes and taking into account several language pairs. In these studies, we verified the effectiveness of the new additions, as well as the transfer of source text errors to the target text. Besides that, it was also analyzed whether the linguistic structures annotated with the ‘Neutral’ severity had in fact any impact on the MT systems. This testing allowed us to confirm the effectiveness and reliability of the Source Typology, including what needs improvement.Associação Portuguesa de Linguística2022-10-25info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttps://doi.org/10.26334/2183-9077/rapln9ano2022a10https://doi.org/10.26334/2183-9077/rapln9ano2022a10Revista da Associação Portuguesa de Linguística; No. 9 (2022): Journal of the Portuguese Linguistics Association; 133-149Revista da Associação Portuguesa de Linguística; N.º 9 (2022): Revista da Associação Portuguesa de Linguística; 133-1492183-907710.26334/2183-9077/rapln9ano2022reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAPporhttps://ojs.apl.pt/index.php/rapl/article/view/143https://ojs.apl.pt/index.php/rapl/article/view/143/138Direitos de Autor (c) 2022 Madalena Gonçalves, Marianna Buchicchio, Helena Monizinfo:eu-repo/semantics/openAccessGonçalves, MadalenaBuchicchio, MariannaMoniz, Helena2023-12-02T10:17:56Zoai:ojs3.ojs.apl.pt:article/143Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T20:36:02.791520Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Impact of the quality of source texts created by users and agents and the propagation of errors in Machine Translation systems Impacto da qualidade de textos de partida criados por utilizadores e agentes e a propagação de erros em sistemas de Tradução Automática |
title |
Impact of the quality of source texts created by users and agents and the propagation of errors in Machine Translation systems |
spellingShingle |
Impact of the quality of source texts created by users and agents and the propagation of errors in Machine Translation systems Gonçalves, Madalena texto de partida anotação de erros tradução automática apoio ao cliente source text error annotation Machine Translation Customer Support |
title_short |
Impact of the quality of source texts created by users and agents and the propagation of errors in Machine Translation systems |
title_full |
Impact of the quality of source texts created by users and agents and the propagation of errors in Machine Translation systems |
title_fullStr |
Impact of the quality of source texts created by users and agents and the propagation of errors in Machine Translation systems |
title_full_unstemmed |
Impact of the quality of source texts created by users and agents and the propagation of errors in Machine Translation systems |
title_sort |
Impact of the quality of source texts created by users and agents and the propagation of errors in Machine Translation systems |
author |
Gonçalves, Madalena |
author_facet |
Gonçalves, Madalena Buchicchio, Marianna Moniz, Helena |
author_role |
author |
author2 |
Buchicchio, Marianna Moniz, Helena |
author2_role |
author author |
dc.contributor.author.fl_str_mv |
Gonçalves, Madalena Buchicchio, Marianna Moniz, Helena |
dc.subject.por.fl_str_mv |
texto de partida anotação de erros tradução automática apoio ao cliente source text error annotation Machine Translation Customer Support |
topic |
texto de partida anotação de erros tradução automática apoio ao cliente source text error annotation Machine Translation Customer Support |
description |
This paper proposes a typology concerning errors and linguistic structures found in the source text that have an impact on Machine Translation (MT). The main objectives of this project were firstly, to make a comparison between error typologies and analyze them according to their suitability; analyze annotated data and build a data-driven typology while adapting the previous existing typologies; make a distinction between the errors produced by users and agents in the online Customer Support domain; test the proposed typology with three case studies; methodize patterns in the errors found and verify their impact in MT systems; finally, create a typology ready for production for its particular field. At first, it was made a comparison between different typologies, whether they consider a bilingual or monolingual level (e.g. Unbabel Error Typology, MQM Typology (Lommel et al., 2014b) and SCATE MT Error Taxonomy (Tezcan et al., 2017). This comparison allowed us to verify the differences and similarities between them and, also, which issue types have been previously used. In order to build a data-driven typology, both sides of Customer Support were analyzed — user and agent — as they present different writing structures and are influenced by different factors. The results of that analysis were assessed through the annotation process with a bilingual error typology and were calculated with one of the most highly used manual evaluation metrics in translation quality evaluation — Multidimensional Quality Metrics (MQM), proposed in the QTLaunchPad project (2014), funded by the European Union. Through this analysis, it was then possible to build a data-driven typology — Source Typology. In order to aid future annotators of this typology, we provided guidelines concerning the annotation process and elaborate on the new additions of the typology. In the interest of confirming the reliability of this typology, three case studies were conducted in an internal pilot, with a total of 26,855 words, 2802 errors and 239 linguistic structures (represented in the ‘Neutral’ severity — associated with conversational markers, segmentation, emoticons, etc., characteristics of oral speech) annotated, with different purposes and taking into account several language pairs. In these studies, we verified the effectiveness of the new additions, as well as the transfer of source text errors to the target text. Besides that, it was also analyzed whether the linguistic structures annotated with the ‘Neutral’ severity had in fact any impact on the MT systems. This testing allowed us to confirm the effectiveness and reliability of the Source Typology, including what needs improvement. |
publishDate |
2022 |
dc.date.none.fl_str_mv |
2022-10-25 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://doi.org/10.26334/2183-9077/rapln9ano2022a10 https://doi.org/10.26334/2183-9077/rapln9ano2022a10 |
url |
https://doi.org/10.26334/2183-9077/rapln9ano2022a10 |
dc.language.iso.fl_str_mv |
por |
language |
por |
dc.relation.none.fl_str_mv |
https://ojs.apl.pt/index.php/rapl/article/view/143 https://ojs.apl.pt/index.php/rapl/article/view/143/138 |
dc.rights.driver.fl_str_mv |
Direitos de Autor (c) 2022 Madalena Gonçalves, Marianna Buchicchio, Helena Moniz info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
Direitos de Autor (c) 2022 Madalena Gonçalves, Marianna Buchicchio, Helena Moniz |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Associação Portuguesa de Linguística |
publisher.none.fl_str_mv |
Associação Portuguesa de Linguística |
dc.source.none.fl_str_mv |
Revista da Associação Portuguesa de Linguística; No. 9 (2022): Journal of the Portuguese Linguistics Association; 133-149 Revista da Associação Portuguesa de Linguística; N.º 9 (2022): Revista da Associação Portuguesa de Linguística; 133-149 2183-9077 10.26334/2183-9077/rapln9ano2022 reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799133623827300352 |