Traduction d'unités polylexicales du portugais en français par MT@EC et eTranslation
Autor(a) principal: | |
---|---|
Data de Publicação: | 2022 |
Tipo de documento: | Artigo |
Idioma: | fra |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | https://hdl.handle.net/10216/147338 |
Resumo: | This paper aims to determine the extent to which the shift from statistical machine translation (SMT) to neural machine translation (NMT) improved the performance of European Union machine translation systems between 2015 and 2021 in terms of multi-word unit translation and domain coverage. To do so, we chose to test these systems on machine translation into French of multi-word units expressing quantitative and qualitative progression in Portuguese from Portugal. These units consist of the 2-gram 'cada vez' and a comparative adjective or adverb (cada vez COMP), and their word-for-word translation into French is not idiomatic (*chaque fois COMP). The most frequent translation into French is 'de COMP en COMP'. This implies that these multi-word units must be translated 'en bloc', but their identification is not straightforward. On the one hand, COMP is not fixed and may include one (mais / plus, menos / moins, maior / plus grand, menor / plus petit, melhor / meilleur - mieux, pior / pire - plus mal) or several words (mais or menos N, ADJ, ADV). On the other hand, the 2-gram 'cada vez' can be part of other multi-word units expressing iteration (de cada vez (que)/(à) chaque fois (que)), or 'dropper' ([a certain quantity] de cada vez/ à la fois), This raises the challenge of ambiguity, well known to biotranslators and still often problematic for NMT. Moreover, units expressing quantitative or qualitative progression may raise other translation challenges when they are coordinate (with or without repetition of the 2-gram 'cada vez'), when they are split (cada vez (...) COMP), or when they combine with verbs or nouns to form extended translation units whose translation into French can result in a more concise solution we refer to as 'lexicalisation'. We established a biotranslation model based on a manually aligned French-Portuguese parallel literary corpus and online searchable French-Portuguese aligned corpora (translation memories). We selected a sample of occurrences of these multi-word units including several translation challenges. These occurrences were selected from a Portuguese journalistic corpus. They belong therefore to general language, whereas the EU's translation memories cover the domains dealt with by its institutions, which represents an additional challenge, considering the critical importance of domain coverage in the data to NMT performance quality. The selected occurrences were translated into French by the EU SMT system in 2015 (MT@EC) and 2019 (eTranslation Legacy) and by eTranslation (the EU NMT system) in 2019 and 2021. Firstly, MT output was analysed according to two general criteria: 'non- literality', that is translation into French without 'chaque', and acceptability from a semantic point of view, that is MT output without any false meaning, opposite meaning or nonsense. Then we looked at specific challenges, some of which could lead to original solutions, worthy of a professional human translator, such as lexicalisation, change of grammatical category or 'recategorisation' and 'naturalisation', that is phraseological or syntactic rearrangement that makes the target text more idiomatic. The results show that MT is improving, especially according to the criterion of non-literality. Original solutions are still rare, but they are diversifying in NMT output. Nevertheless, NMT remains imperfect, not least because of the inherent ambiguity of natural languages and the inevitable gaps in the data on which these systems are based. (...) |
id |
RCAP_2cadd149b155d687ef58ce4d72e04b5a |
---|---|
oai_identifier_str |
oai:repositorio-aberto.up.pt:10216/147338 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Traduction d'unités polylexicales du portugais en français par MT@EC et eTranslationThis paper aims to determine the extent to which the shift from statistical machine translation (SMT) to neural machine translation (NMT) improved the performance of European Union machine translation systems between 2015 and 2021 in terms of multi-word unit translation and domain coverage. To do so, we chose to test these systems on machine translation into French of multi-word units expressing quantitative and qualitative progression in Portuguese from Portugal. These units consist of the 2-gram 'cada vez' and a comparative adjective or adverb (cada vez COMP), and their word-for-word translation into French is not idiomatic (*chaque fois COMP). The most frequent translation into French is 'de COMP en COMP'. This implies that these multi-word units must be translated 'en bloc', but their identification is not straightforward. On the one hand, COMP is not fixed and may include one (mais / plus, menos / moins, maior / plus grand, menor / plus petit, melhor / meilleur - mieux, pior / pire - plus mal) or several words (mais or menos N, ADJ, ADV). On the other hand, the 2-gram 'cada vez' can be part of other multi-word units expressing iteration (de cada vez (que)/(à) chaque fois (que)), or 'dropper' ([a certain quantity] de cada vez/ à la fois), This raises the challenge of ambiguity, well known to biotranslators and still often problematic for NMT. Moreover, units expressing quantitative or qualitative progression may raise other translation challenges when they are coordinate (with or without repetition of the 2-gram 'cada vez'), when they are split (cada vez (...) COMP), or when they combine with verbs or nouns to form extended translation units whose translation into French can result in a more concise solution we refer to as 'lexicalisation'. We established a biotranslation model based on a manually aligned French-Portuguese parallel literary corpus and online searchable French-Portuguese aligned corpora (translation memories). We selected a sample of occurrences of these multi-word units including several translation challenges. These occurrences were selected from a Portuguese journalistic corpus. They belong therefore to general language, whereas the EU's translation memories cover the domains dealt with by its institutions, which represents an additional challenge, considering the critical importance of domain coverage in the data to NMT performance quality. The selected occurrences were translated into French by the EU SMT system in 2015 (MT@EC) and 2019 (eTranslation Legacy) and by eTranslation (the EU NMT system) in 2019 and 2021. Firstly, MT output was analysed according to two general criteria: 'non- literality', that is translation into French without 'chaque', and acceptability from a semantic point of view, that is MT output without any false meaning, opposite meaning or nonsense. Then we looked at specific challenges, some of which could lead to original solutions, worthy of a professional human translator, such as lexicalisation, change of grammatical category or 'recategorisation' and 'naturalisation', that is phraseological or syntactic rearrangement that makes the target text more idiomatic. The results show that MT is improving, especially according to the criterion of non-literality. Original solutions are still rare, but they are diversifying in NMT output. Nevertheless, NMT remains imperfect, not least because of the inherent ambiguity of natural languages and the inevitable gaps in the data on which these systems are based. (...)2022-082022-08-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttps://hdl.handle.net/10216/147338fra1112-3974Bacquelaine, Françoiseinfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-29T14:09:09Zoai:repositorio-aberto.up.pt:10216/147338Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T23:56:00.098220Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Traduction d'unités polylexicales du portugais en français par MT@EC et eTranslation |
title |
Traduction d'unités polylexicales du portugais en français par MT@EC et eTranslation |
spellingShingle |
Traduction d'unités polylexicales du portugais en français par MT@EC et eTranslation Bacquelaine, Françoise |
title_short |
Traduction d'unités polylexicales du portugais en français par MT@EC et eTranslation |
title_full |
Traduction d'unités polylexicales du portugais en français par MT@EC et eTranslation |
title_fullStr |
Traduction d'unités polylexicales du portugais en français par MT@EC et eTranslation |
title_full_unstemmed |
Traduction d'unités polylexicales du portugais en français par MT@EC et eTranslation |
title_sort |
Traduction d'unités polylexicales du portugais en français par MT@EC et eTranslation |
author |
Bacquelaine, Françoise |
author_facet |
Bacquelaine, Françoise |
author_role |
author |
dc.contributor.author.fl_str_mv |
Bacquelaine, Françoise |
description |
This paper aims to determine the extent to which the shift from statistical machine translation (SMT) to neural machine translation (NMT) improved the performance of European Union machine translation systems between 2015 and 2021 in terms of multi-word unit translation and domain coverage. To do so, we chose to test these systems on machine translation into French of multi-word units expressing quantitative and qualitative progression in Portuguese from Portugal. These units consist of the 2-gram 'cada vez' and a comparative adjective or adverb (cada vez COMP), and their word-for-word translation into French is not idiomatic (*chaque fois COMP). The most frequent translation into French is 'de COMP en COMP'. This implies that these multi-word units must be translated 'en bloc', but their identification is not straightforward. On the one hand, COMP is not fixed and may include one (mais / plus, menos / moins, maior / plus grand, menor / plus petit, melhor / meilleur - mieux, pior / pire - plus mal) or several words (mais or menos N, ADJ, ADV). On the other hand, the 2-gram 'cada vez' can be part of other multi-word units expressing iteration (de cada vez (que)/(à) chaque fois (que)), or 'dropper' ([a certain quantity] de cada vez/ à la fois), This raises the challenge of ambiguity, well known to biotranslators and still often problematic for NMT. Moreover, units expressing quantitative or qualitative progression may raise other translation challenges when they are coordinate (with or without repetition of the 2-gram 'cada vez'), when they are split (cada vez (...) COMP), or when they combine with verbs or nouns to form extended translation units whose translation into French can result in a more concise solution we refer to as 'lexicalisation'. We established a biotranslation model based on a manually aligned French-Portuguese parallel literary corpus and online searchable French-Portuguese aligned corpora (translation memories). We selected a sample of occurrences of these multi-word units including several translation challenges. These occurrences were selected from a Portuguese journalistic corpus. They belong therefore to general language, whereas the EU's translation memories cover the domains dealt with by its institutions, which represents an additional challenge, considering the critical importance of domain coverage in the data to NMT performance quality. The selected occurrences were translated into French by the EU SMT system in 2015 (MT@EC) and 2019 (eTranslation Legacy) and by eTranslation (the EU NMT system) in 2019 and 2021. Firstly, MT output was analysed according to two general criteria: 'non- literality', that is translation into French without 'chaque', and acceptability from a semantic point of view, that is MT output without any false meaning, opposite meaning or nonsense. Then we looked at specific challenges, some of which could lead to original solutions, worthy of a professional human translator, such as lexicalisation, change of grammatical category or 'recategorisation' and 'naturalisation', that is phraseological or syntactic rearrangement that makes the target text more idiomatic. The results show that MT is improving, especially according to the criterion of non-literality. Original solutions are still rare, but they are diversifying in NMT output. Nevertheless, NMT remains imperfect, not least because of the inherent ambiguity of natural languages and the inevitable gaps in the data on which these systems are based. (...) |
publishDate |
2022 |
dc.date.none.fl_str_mv |
2022-08 2022-08-01T00:00:00Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://hdl.handle.net/10216/147338 |
url |
https://hdl.handle.net/10216/147338 |
dc.language.iso.fl_str_mv |
fra |
language |
fra |
dc.relation.none.fl_str_mv |
1112-3974 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799135880451981312 |