Parallel corpora word alignment and applications

Detalhes bibliográficos
Autor(a) principal: Simões, Alberto
Data de Publicação: 2004
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/1822/677
Resumo: Parallel corpora are valuable resources on natural language processing and, in special, on the translation area. They can be used not only by translators, but also analyzed and processed by computers to learn and extract information about the languages. In this document, we talk about some processes related with the parallel corpora life cycle. We will focus on the parallel corpora word alignment. The necessity for a robust word aligner arrived with the TerminUM project which goal is to gather parallel corpora from different sources, align, analyze and use them to create bilingual resources like terminology or translation memories for machine translation. Aligner, an open-source word aligner developed by Djoerd Hiemstra. Its results were interesting but it worked only for small sized corpora. The work done began with the reengineering of Twente-Aligner, followed by the analysis of the alignment results and the development of several tools based on the extracted probabilistic dictionaries. The re-engineering process was based on formal methods: the algorithms and data structures were formalized, optimized and re-implemented. The timings and alignment results were analysed. The speed improvement derived from the re-engineering process and the scale-up derived of the alignment by chunks, permitted the alignment of bigger corpora. Bigger corpora makes dictionaries quality raise, and this makes new problems and new ideas possible. The probabilistic dictionaries created by the alignment process were used in different tasks. A first pair of tools was developed to search the dictionaries and their relation to the corpora. The probabilistic dictionaries were used to calculate a measure of how two sentences are translations of each other. This naive measure was used to prototype tools for aligning word sequences, to extract multiword terminology from corpora, and a “by example” machine translation software.
id RCAP_2b67608e1f9f307e292053b091c3d309
oai_identifier_str oai:repositorium.sdum.uminho.pt:1822/677
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Parallel corpora word alignment and applications681.380182.035Parallel corpora are valuable resources on natural language processing and, in special, on the translation area. They can be used not only by translators, but also analyzed and processed by computers to learn and extract information about the languages. In this document, we talk about some processes related with the parallel corpora life cycle. We will focus on the parallel corpora word alignment. The necessity for a robust word aligner arrived with the TerminUM project which goal is to gather parallel corpora from different sources, align, analyze and use them to create bilingual resources like terminology or translation memories for machine translation. Aligner, an open-source word aligner developed by Djoerd Hiemstra. Its results were interesting but it worked only for small sized corpora. The work done began with the reengineering of Twente-Aligner, followed by the analysis of the alignment results and the development of several tools based on the extracted probabilistic dictionaries. The re-engineering process was based on formal methods: the algorithms and data structures were formalized, optimized and re-implemented. The timings and alignment results were analysed. The speed improvement derived from the re-engineering process and the scale-up derived of the alignment by chunks, permitted the alignment of bigger corpora. Bigger corpora makes dictionaries quality raise, and this makes new problems and new ideas possible. The probabilistic dictionaries created by the alignment process were used in different tasks. A first pair of tools was developed to search the dictionaries and their relation to the corpora. The probabilistic dictionaries were used to calculate a measure of how two sentences are translations of each other. This naive measure was used to prototype tools for aligning word sequences, to extract multiword terminology from corpora, and a “by example” machine translation software.Os corpora paralelos são recursos muito valiosos no processamento da linguagem natural e, em especial, na área da tradução. Podem ser usados não só por tradutores, mas também analisados e processados por computadores para aprender e extrair informação sobre as línguas. Neste documento, falamos sobre alguns dos processos relacionados como ciclo de vida dos corpora paralelos. Iremo-nos focar no alinhamento de corpora paralelo à palavra. A necessidade de um alinhador à palavra robusto apareceu com o projecto TerminUM, que tem como principal objectivo recolher corpora paralelos de diferentes fontes, alinhar e usá-los para criar recursos bilingues como terminologia ou memórias de tradução para tradução automática. O ponto de arranque foi o Twente-Aligner, um alinhador à palavra open-source, desenvolvido por Djoerd Hiemstra. Os seus resultados eram interessantes mas só funcionava para corpora de tamanhos pequenos. O trabalho realizado iniciou com a re-engenharia do Twente-Aligner, seguida pela análise dos resultados do alinhamento e o desenvolvimento de várias ferramentas baseadas nos dicionários probabilísticos extraídos. O processo de re-engenharia foi baseado em métodos formais: os algoritmos e estruturas de dados foram formalizados, optimizados e re-implementados. Os tempos e resultados de alinhamento foram analizados. Os melhoramentos em velocidade derivados do processo de re-engenharia e a escalabilidade derivada do alinhamento por fatias, permitiu o alinhamento de corpora maiores. Corpora maiores fazem aumentar a qualidade dos dicionários, o que torna novos problemas e ideias possíveis. Os dicionários probabilísticos criados pelo processo de alinhamento foram usados em tarefas diferentes. Um primeiro par de ferramentas foi desenvolvido para procurar nos dicionários e a sua relação com os corpora. Os dicionários probabilísticos foram usados para calcular uma medida de quão duas frases são tradução uma da outra. Esta medida foi usada para prototipar ferramentas para o alinhamento de sequências de palavras, extrair terminologia multipalavra dos corpora, e uma aplicação automática de tradução "por exemplo".Universidade do MinhoSimões, Alberto20042004-01-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/1822/677enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-07-21T11:55:13Zoai:repositorium.sdum.uminho.pt:1822/677Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T18:44:44.592560Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Parallel corpora word alignment and applications
title Parallel corpora word alignment and applications
spellingShingle Parallel corpora word alignment and applications
Simões, Alberto
681.3
801
82.035
title_short Parallel corpora word alignment and applications
title_full Parallel corpora word alignment and applications
title_fullStr Parallel corpora word alignment and applications
title_full_unstemmed Parallel corpora word alignment and applications
title_sort Parallel corpora word alignment and applications
author Simões, Alberto
author_facet Simões, Alberto
author_role author
dc.contributor.none.fl_str_mv Universidade do Minho
dc.contributor.author.fl_str_mv Simões, Alberto
dc.subject.por.fl_str_mv 681.3
801
82.035
topic 681.3
801
82.035
description Parallel corpora are valuable resources on natural language processing and, in special, on the translation area. They can be used not only by translators, but also analyzed and processed by computers to learn and extract information about the languages. In this document, we talk about some processes related with the parallel corpora life cycle. We will focus on the parallel corpora word alignment. The necessity for a robust word aligner arrived with the TerminUM project which goal is to gather parallel corpora from different sources, align, analyze and use them to create bilingual resources like terminology or translation memories for machine translation. Aligner, an open-source word aligner developed by Djoerd Hiemstra. Its results were interesting but it worked only for small sized corpora. The work done began with the reengineering of Twente-Aligner, followed by the analysis of the alignment results and the development of several tools based on the extracted probabilistic dictionaries. The re-engineering process was based on formal methods: the algorithms and data structures were formalized, optimized and re-implemented. The timings and alignment results were analysed. The speed improvement derived from the re-engineering process and the scale-up derived of the alignment by chunks, permitted the alignment of bigger corpora. Bigger corpora makes dictionaries quality raise, and this makes new problems and new ideas possible. The probabilistic dictionaries created by the alignment process were used in different tasks. A first pair of tools was developed to search the dictionaries and their relation to the corpora. The probabilistic dictionaries were used to calculate a measure of how two sentences are translations of each other. This naive measure was used to prototype tools for aligning word sequences, to extract multiword terminology from corpora, and a “by example” machine translation software.
publishDate 2004
dc.date.none.fl_str_mv 2004
2004-01-01T00:00:00Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/1822/677
url http://hdl.handle.net/1822/677
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799132197474533376