Traduction d'unités polylexicales du portugais en français par MT@EC et eTranslation

Detalhes bibliográficos
Autor(a) principal: Bacquelaine, Françoise
Data de Publicação: 2022
Tipo de documento: Artigo
Idioma: fra
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: https://hdl.handle.net/10216/147338
Resumo: This paper aims to determine the extent to which the shift from statistical machine translation (SMT) to neural machine translation (NMT) improved the performance of European Union machine translation systems between 2015 and 2021 in terms of multi-word unit translation and domain coverage. To do so, we chose to test these systems on machine translation into French of multi-word units expressing quantitative and qualitative progression in Portuguese from Portugal. These units consist of the 2-gram 'cada vez' and a comparative adjective or adverb (cada vez COMP), and their word-for-word translation into French is not idiomatic (*chaque fois COMP). The most frequent translation into French is 'de COMP en COMP'. This implies that these multi-word units must be translated 'en bloc', but their identification is not straightforward. On the one hand, COMP is not fixed and may include one (mais / plus, menos / moins, maior / plus grand, menor / plus petit, melhor / meilleur - mieux, pior / pire - plus mal) or several words (mais or menos N, ADJ, ADV). On the other hand, the 2-gram 'cada vez' can be part of other multi-word units expressing iteration (de cada vez (que)/(à) chaque fois (que)), or 'dropper' ([a certain quantity] de cada vez/ à la fois), This raises the challenge of ambiguity, well known to biotranslators and still often problematic for NMT. Moreover, units expressing quantitative or qualitative progression may raise other translation challenges when they are coordinate (with or without repetition of the 2-gram 'cada vez'), when they are split (cada vez (...) COMP), or when they combine with verbs or nouns to form extended translation units whose translation into French can result in a more concise solution we refer to as 'lexicalisation'. We established a biotranslation model based on a manually aligned French-Portuguese parallel literary corpus and online searchable French-Portuguese aligned corpora (translation memories). We selected a sample of occurrences of these multi-word units including several translation challenges. These occurrences were selected from a Portuguese journalistic corpus. They belong therefore to general language, whereas the EU's translation memories cover the domains dealt with by its institutions, which represents an additional challenge, considering the critical importance of domain coverage in the data to NMT performance quality. The selected occurrences were translated into French by the EU SMT system in 2015 (MT@EC) and 2019 (eTranslation Legacy) and by eTranslation (the EU NMT system) in 2019 and 2021. Firstly, MT output was analysed according to two general criteria: 'non- literality', that is translation into French without 'chaque', and acceptability from a semantic point of view, that is MT output without any false meaning, opposite meaning or nonsense. Then we looked at specific challenges, some of which could lead to original solutions, worthy of a professional human translator, such as lexicalisation, change of grammatical category or 'recategorisation' and 'naturalisation', that is phraseological or syntactic rearrangement that makes the target text more idiomatic. The results show that MT is improving, especially according to the criterion of non-literality. Original solutions are still rare, but they are diversifying in NMT output. Nevertheless, NMT remains imperfect, not least because of the inherent ambiguity of natural languages and the inevitable gaps in the data on which these systems are based. (...)
id RCAP_2cadd149b155d687ef58ce4d72e04b5a
oai_identifier_str oai:repositorio-aberto.up.pt:10216/147338
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Traduction d'unités polylexicales du portugais en français par MT@EC et eTranslationThis paper aims to determine the extent to which the shift from statistical machine translation (SMT) to neural machine translation (NMT) improved the performance of European Union machine translation systems between 2015 and 2021 in terms of multi-word unit translation and domain coverage. To do so, we chose to test these systems on machine translation into French of multi-word units expressing quantitative and qualitative progression in Portuguese from Portugal. These units consist of the 2-gram 'cada vez' and a comparative adjective or adverb (cada vez COMP), and their word-for-word translation into French is not idiomatic (*chaque fois COMP). The most frequent translation into French is 'de COMP en COMP'. This implies that these multi-word units must be translated 'en bloc', but their identification is not straightforward. On the one hand, COMP is not fixed and may include one (mais / plus, menos / moins, maior / plus grand, menor / plus petit, melhor / meilleur - mieux, pior / pire - plus mal) or several words (mais or menos N, ADJ, ADV). On the other hand, the 2-gram 'cada vez' can be part of other multi-word units expressing iteration (de cada vez (que)/(à) chaque fois (que)), or 'dropper' ([a certain quantity] de cada vez/ à la fois), This raises the challenge of ambiguity, well known to biotranslators and still often problematic for NMT. Moreover, units expressing quantitative or qualitative progression may raise other translation challenges when they are coordinate (with or without repetition of the 2-gram 'cada vez'), when they are split (cada vez (...) COMP), or when they combine with verbs or nouns to form extended translation units whose translation into French can result in a more concise solution we refer to as 'lexicalisation'. We established a biotranslation model based on a manually aligned French-Portuguese parallel literary corpus and online searchable French-Portuguese aligned corpora (translation memories). We selected a sample of occurrences of these multi-word units including several translation challenges. These occurrences were selected from a Portuguese journalistic corpus. They belong therefore to general language, whereas the EU's translation memories cover the domains dealt with by its institutions, which represents an additional challenge, considering the critical importance of domain coverage in the data to NMT performance quality. The selected occurrences were translated into French by the EU SMT system in 2015 (MT@EC) and 2019 (eTranslation Legacy) and by eTranslation (the EU NMT system) in 2019 and 2021. Firstly, MT output was analysed according to two general criteria: 'non- literality', that is translation into French without 'chaque', and acceptability from a semantic point of view, that is MT output without any false meaning, opposite meaning or nonsense. Then we looked at specific challenges, some of which could lead to original solutions, worthy of a professional human translator, such as lexicalisation, change of grammatical category or 'recategorisation' and 'naturalisation', that is phraseological or syntactic rearrangement that makes the target text more idiomatic. The results show that MT is improving, especially according to the criterion of non-literality. Original solutions are still rare, but they are diversifying in NMT output. Nevertheless, NMT remains imperfect, not least because of the inherent ambiguity of natural languages and the inevitable gaps in the data on which these systems are based. (...)2022-082022-08-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttps://hdl.handle.net/10216/147338fra1112-3974Bacquelaine, Françoiseinfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-29T14:09:09Zoai:repositorio-aberto.up.pt:10216/147338Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T23:56:00.098220Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Traduction d'unités polylexicales du portugais en français par MT@EC et eTranslation
title Traduction d'unités polylexicales du portugais en français par MT@EC et eTranslation
spellingShingle Traduction d'unités polylexicales du portugais en français par MT@EC et eTranslation
Bacquelaine, Françoise
title_short Traduction d'unités polylexicales du portugais en français par MT@EC et eTranslation
title_full Traduction d'unités polylexicales du portugais en français par MT@EC et eTranslation
title_fullStr Traduction d'unités polylexicales du portugais en français par MT@EC et eTranslation
title_full_unstemmed Traduction d'unités polylexicales du portugais en français par MT@EC et eTranslation
title_sort Traduction d'unités polylexicales du portugais en français par MT@EC et eTranslation
author Bacquelaine, Françoise
author_facet Bacquelaine, Françoise
author_role author
dc.contributor.author.fl_str_mv Bacquelaine, Françoise
description This paper aims to determine the extent to which the shift from statistical machine translation (SMT) to neural machine translation (NMT) improved the performance of European Union machine translation systems between 2015 and 2021 in terms of multi-word unit translation and domain coverage. To do so, we chose to test these systems on machine translation into French of multi-word units expressing quantitative and qualitative progression in Portuguese from Portugal. These units consist of the 2-gram 'cada vez' and a comparative adjective or adverb (cada vez COMP), and their word-for-word translation into French is not idiomatic (*chaque fois COMP). The most frequent translation into French is 'de COMP en COMP'. This implies that these multi-word units must be translated 'en bloc', but their identification is not straightforward. On the one hand, COMP is not fixed and may include one (mais / plus, menos / moins, maior / plus grand, menor / plus petit, melhor / meilleur - mieux, pior / pire - plus mal) or several words (mais or menos N, ADJ, ADV). On the other hand, the 2-gram 'cada vez' can be part of other multi-word units expressing iteration (de cada vez (que)/(à) chaque fois (que)), or 'dropper' ([a certain quantity] de cada vez/ à la fois), This raises the challenge of ambiguity, well known to biotranslators and still often problematic for NMT. Moreover, units expressing quantitative or qualitative progression may raise other translation challenges when they are coordinate (with or without repetition of the 2-gram 'cada vez'), when they are split (cada vez (...) COMP), or when they combine with verbs or nouns to form extended translation units whose translation into French can result in a more concise solution we refer to as 'lexicalisation'. We established a biotranslation model based on a manually aligned French-Portuguese parallel literary corpus and online searchable French-Portuguese aligned corpora (translation memories). We selected a sample of occurrences of these multi-word units including several translation challenges. These occurrences were selected from a Portuguese journalistic corpus. They belong therefore to general language, whereas the EU's translation memories cover the domains dealt with by its institutions, which represents an additional challenge, considering the critical importance of domain coverage in the data to NMT performance quality. The selected occurrences were translated into French by the EU SMT system in 2015 (MT@EC) and 2019 (eTranslation Legacy) and by eTranslation (the EU NMT system) in 2019 and 2021. Firstly, MT output was analysed according to two general criteria: 'non- literality', that is translation into French without 'chaque', and acceptability from a semantic point of view, that is MT output without any false meaning, opposite meaning or nonsense. Then we looked at specific challenges, some of which could lead to original solutions, worthy of a professional human translator, such as lexicalisation, change of grammatical category or 'recategorisation' and 'naturalisation', that is phraseological or syntactic rearrangement that makes the target text more idiomatic. The results show that MT is improving, especially according to the criterion of non-literality. Original solutions are still rare, but they are diversifying in NMT output. Nevertheless, NMT remains imperfect, not least because of the inherent ambiguity of natural languages and the inevitable gaps in the data on which these systems are based. (...)
publishDate 2022
dc.date.none.fl_str_mv 2022-08
2022-08-01T00:00:00Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://hdl.handle.net/10216/147338
url https://hdl.handle.net/10216/147338
dc.language.iso.fl_str_mv fra
language fra
dc.relation.none.fl_str_mv 1112-3974
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799135880451981312