Impact of the quality of source texts created by users and agents and the propagation of errors in Machine Translation systems

Gonçalves, Madalena; Buchicchio, Marianna; Moniz, Helena

Impact of the quality of source texts created by users and agents and the propagation of errors in Machine Translation systems

Detalhes bibliográficos
Autor(a) principal:	Gonçalves, Madalena
Data de Publicação:	2022
Outros Autores:	Buchicchio, Marianna, Moniz, Helena
Tipo de documento:	Artigo
Idioma:	por
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	https://doi.org/10.26334/2183-9077/rapln9ano2022a10
Resumo:	This paper proposes a typology concerning errors and linguistic structures found in the source text that have an impact on Machine Translation (MT). The main objectives of this project were firstly, to make a comparison between error typologies and analyze them according to their suitability; analyze annotated data and build a data-driven typology while adapting the previous existing typologies; make a distinction between the errors produced by users and agents in the online Customer Support domain; test the proposed typology with three case studies; methodize patterns in the errors found and verify their impact in MT systems; finally, create a typology ready for production for its particular field. At first, it was made a comparison between different typologies, whether they consider a bilingual or monolingual level (e.g. Unbabel Error Typology, MQM Typology (Lommel et al., 2014b) and SCATE MT Error Taxonomy (Tezcan et al., 2017). This comparison allowed us to verify the differences and similarities between them and, also, which issue types have been previously used. In order to build a data-driven typology, both sides of Customer Support were analyzed — user and agent — as they present different writing structures and are influenced by different factors. The results of that analysis were assessed through the annotation process with a bilingual error typology and were calculated with one of the most highly used manual evaluation metrics in translation quality evaluation — Multidimensional Quality Metrics (MQM), proposed in the QTLaunchPad project (2014), funded by the European Union. Through this analysis, it was then possible to build a data-driven typology — Source Typology. In order to aid future annotators of this typology, we provided guidelines concerning the annotation process and elaborate on the new additions of the typology. In the interest of confirming the reliability of this typology, three case studies were conducted in an internal pilot, with a total of 26,855 words, 2802 errors and 239 linguistic structures (represented in the ‘Neutral’ severity — associated with conversational markers, segmentation, emoticons, etc., characteristics of oral speech) annotated, with different purposes and taking into account several language pairs. In these studies, we verified the effectiveness of the new additions, as well as the transfer of source text errors to the target text. Besides that, it was also analyzed whether the linguistic structures annotated with the ‘Neutral’ severity had in fact any impact on the MT systems. This testing allowed us to confirm the effectiveness and reliability of the Source Typology, including what needs improvement.

Metadados do item

id	RCAP_c09f2fe1f5b35143040e2f2be4a4eb83
oai_identifier_str	oai:ojs3.ojs.apl.pt:article/143
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	Impact of the quality of source texts created by users and agents and the propagation of errors in Machine Translation systemsImpacto da qualidade de textos de partida criados por utilizadores e agentes e a propagação de erros em sistemas de Tradução Automáticatexto de partidaanotação de errostradução automáticaapoio ao clientesource texterror annotationMachine TranslationCustomer SupportThis paper proposes a typology concerning errors and linguistic structures found in the source text that have an impact on Machine Translation (MT). The main objectives of this project were firstly, to make a comparison between error typologies and analyze them according to their suitability; analyze annotated data and build a data-driven typology while adapting the previous existing typologies; make a distinction between the errors produced by users and agents in the online Customer Support domain; test the proposed typology with three case studies; methodize patterns in the errors found and verify their impact in MT systems; finally, create a typology ready for production for its particular field. At first, it was made a comparison between different typologies, whether they consider a bilingual or monolingual level (e.g. Unbabel Error Typology, MQM Typology (Lommel et al., 2014b) and SCATE MT Error Taxonomy (Tezcan et al., 2017). This comparison allowed us to verify the differences and similarities between them and, also, which issue types have been previously used. In order to build a data-driven typology, both sides of Customer Support were analyzed — user and agent — as they present different writing structures and are influenced by different factors. The results of that analysis were assessed through the annotation process with a bilingual error typology and were calculated with one of the most highly used manual evaluation metrics in translation quality evaluation — Multidimensional Quality Metrics (MQM), proposed in the QTLaunchPad project (2014), funded by the European Union. Through this analysis, it was then possible to build a data-driven typology — Source Typology. In order to aid future annotators of this typology, we provided guidelines concerning the annotation process and elaborate on the new additions of the typology. In the interest of confirming the reliability of this typology, three case studies were conducted in an internal pilot, with a total of 26,855 words, 2802 errors and 239 linguistic structures (represented in the ‘Neutral’ severity — associated with conversational markers, segmentation, emoticons, etc., characteristics of oral speech) annotated, with different purposes and taking into account several language pairs. In these studies, we verified the effectiveness of the new additions, as well as the transfer of source text errors to the target text. Besides that, it was also analyzed whether the linguistic structures annotated with the ‘Neutral’ severity had in fact any impact on the MT systems. This testing allowed us to confirm the effectiveness and reliability of the Source Typology, including what needs improvement.Associação Portuguesa de Linguística2022-10-25info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttps://doi.org/10.26334/2183-9077/rapln9ano2022a10https://doi.org/10.26334/2183-9077/rapln9ano2022a10Revista da Associação Portuguesa de Linguística; No. 9 (2022): Journal of the Portuguese Linguistics Association; 133-149Revista da Associação Portuguesa de Linguística; N.º 9 (2022): Revista da Associação Portuguesa de Linguística; 133-1492183-907710.26334/2183-9077/rapln9ano2022reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAPporhttps://ojs.apl.pt/index.php/rapl/article/view/143https://ojs.apl.pt/index.php/rapl/article/view/143/138Direitos de Autor (c) 2022 Madalena Gonçalves, Marianna Buchicchio, Helena Monizinfo:eu-repo/semantics/openAccessGonçalves, MadalenaBuchicchio, MariannaMoniz, Helena2023-12-02T10:17:56Zoai:ojs3.ojs.apl.pt:article/143Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T20:36:02.791520Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	Impact of the quality of source texts created by users and agents and the propagation of errors in Machine Translation systems Impacto da qualidade de textos de partida criados por utilizadores e agentes e a propagação de erros em sistemas de Tradução Automática
title	Impact of the quality of source texts created by users and agents and the propagation of errors in Machine Translation systems
spellingShingle	Impact of the quality of source texts created by users and agents and the propagation of errors in Machine Translation systems Gonçalves, Madalena texto de partida anotação de erros tradução automática apoio ao cliente source text error annotation Machine Translation Customer Support
title_short	Impact of the quality of source texts created by users and agents and the propagation of errors in Machine Translation systems
title_full	Impact of the quality of source texts created by users and agents and the propagation of errors in Machine Translation systems
title_fullStr	Impact of the quality of source texts created by users and agents and the propagation of errors in Machine Translation systems
title_full_unstemmed	Impact of the quality of source texts created by users and agents and the propagation of errors in Machine Translation systems
title_sort	Impact of the quality of source texts created by users and agents and the propagation of errors in Machine Translation systems
author	Gonçalves, Madalena
author_facet	Gonçalves, Madalena Buchicchio, Marianna Moniz, Helena
author_role	author
author2	Buchicchio, Marianna Moniz, Helena
author2_role	author author
dc.contributor.author.fl_str_mv	Gonçalves, Madalena Buchicchio, Marianna Moniz, Helena
dc.subject.por.fl_str_mv	texto de partida anotação de erros tradução automática apoio ao cliente source text error annotation Machine Translation Customer Support
topic	texto de partida anotação de erros tradução automática apoio ao cliente source text error annotation Machine Translation Customer Support
description	This paper proposes a typology concerning errors and linguistic structures found in the source text that have an impact on Machine Translation (MT). The main objectives of this project were firstly, to make a comparison between error typologies and analyze them according to their suitability; analyze annotated data and build a data-driven typology while adapting the previous existing typologies; make a distinction between the errors produced by users and agents in the online Customer Support domain; test the proposed typology with three case studies; methodize patterns in the errors found and verify their impact in MT systems; finally, create a typology ready for production for its particular field. At first, it was made a comparison between different typologies, whether they consider a bilingual or monolingual level (e.g. Unbabel Error Typology, MQM Typology (Lommel et al., 2014b) and SCATE MT Error Taxonomy (Tezcan et al., 2017). This comparison allowed us to verify the differences and similarities between them and, also, which issue types have been previously used. In order to build a data-driven typology, both sides of Customer Support were analyzed — user and agent — as they present different writing structures and are influenced by different factors. The results of that analysis were assessed through the annotation process with a bilingual error typology and were calculated with one of the most highly used manual evaluation metrics in translation quality evaluation — Multidimensional Quality Metrics (MQM), proposed in the QTLaunchPad project (2014), funded by the European Union. Through this analysis, it was then possible to build a data-driven typology — Source Typology. In order to aid future annotators of this typology, we provided guidelines concerning the annotation process and elaborate on the new additions of the typology. In the interest of confirming the reliability of this typology, three case studies were conducted in an internal pilot, with a total of 26,855 words, 2802 errors and 239 linguistic structures (represented in the ‘Neutral’ severity — associated with conversational markers, segmentation, emoticons, etc., characteristics of oral speech) annotated, with different purposes and taking into account several language pairs. In these studies, we verified the effectiveness of the new additions, as well as the transfer of source text errors to the target text. Besides that, it was also analyzed whether the linguistic structures annotated with the ‘Neutral’ severity had in fact any impact on the MT systems. This testing allowed us to confirm the effectiveness and reliability of the Source Typology, including what needs improvement.
publishDate	2022
dc.date.none.fl_str_mv	2022-10-25
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/article
format	article
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://doi.org/10.26334/2183-9077/rapln9ano2022a10 https://doi.org/10.26334/2183-9077/rapln9ano2022a10
url	https://doi.org/10.26334/2183-9077/rapln9ano2022a10
dc.language.iso.fl_str_mv	por
language	por
dc.relation.none.fl_str_mv	https://ojs.apl.pt/index.php/rapl/article/view/143 https://ojs.apl.pt/index.php/rapl/article/view/143/138
dc.rights.driver.fl_str_mv	Direitos de Autor (c) 2022 Madalena Gonçalves, Marianna Buchicchio, Helena Moniz info:eu-repo/semantics/openAccess
rights_invalid_str_mv	Direitos de Autor (c) 2022 Madalena Gonçalves, Marianna Buchicchio, Helena Moniz
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Associação Portuguesa de Linguística
publisher.none.fl_str_mv	Associação Portuguesa de Linguística
dc.source.none.fl_str_mv	Revista da Associação Portuguesa de Linguística; No. 9 (2022): Journal of the Portuguese Linguistics Association; 133-149 Revista da Associação Portuguesa de Linguística; N.º 9 (2022): Revista da Associação Portuguesa de Linguística; 133-149 2183-9077 10.26334/2183-9077/rapln9ano2022 reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799133623827300352

Impact of the quality of source texts created by users and agents and the propagation of errors in Machine Translation systems

Registros relacionados