Parallel corpora word alignment and applications

Simões, Alberto

Parallel corpora word alignment and applications

Detalhes bibliográficos
Autor(a) principal:	Simões, Alberto
Data de Publicação:	2004
Tipo de documento:	Dissertação
Idioma:	eng
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	http://hdl.handle.net/1822/677
Resumo:	Parallel corpora are valuable resources on natural language processing and, in special, on the translation area. They can be used not only by translators, but also analyzed and processed by computers to learn and extract information about the languages. In this document, we talk about some processes related with the parallel corpora life cycle. We will focus on the parallel corpora word alignment. The necessity for a robust word aligner arrived with the TerminUM project which goal is to gather parallel corpora from different sources, align, analyze and use them to create bilingual resources like terminology or translation memories for machine translation. Aligner, an open-source word aligner developed by Djoerd Hiemstra. Its results were interesting but it worked only for small sized corpora. The work done began with the reengineering of Twente-Aligner, followed by the analysis of the alignment results and the development of several tools based on the extracted probabilistic dictionaries. The re-engineering process was based on formal methods: the algorithms and data structures were formalized, optimized and re-implemented. The timings and alignment results were analysed. The speed improvement derived from the re-engineering process and the scale-up derived of the alignment by chunks, permitted the alignment of bigger corpora. Bigger corpora makes dictionaries quality raise, and this makes new problems and new ideas possible. The probabilistic dictionaries created by the alignment process were used in different tasks. A first pair of tools was developed to search the dictionaries and their relation to the corpora. The probabilistic dictionaries were used to calculate a measure of how two sentences are translations of each other. This naive measure was used to prototype tools for aligning word sequences, to extract multiword terminology from corpora, and a “by example” machine translation software.

Metadados do item

id	RCAP_2b67608e1f9f307e292053b091c3d309
oai_identifier_str	oai:repositorium.sdum.uminho.pt:1822/677
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	Parallel corpora word alignment and applications681.380182.035Parallel corpora are valuable resources on natural language processing and, in special, on the translation area. They can be used not only by translators, but also analyzed and processed by computers to learn and extract information about the languages. In this document, we talk about some processes related with the parallel corpora life cycle. We will focus on the parallel corpora word alignment. The necessity for a robust word aligner arrived with the TerminUM project which goal is to gather parallel corpora from different sources, align, analyze and use them to create bilingual resources like terminology or translation memories for machine translation. Aligner, an open-source word aligner developed by Djoerd Hiemstra. Its results were interesting but it worked only for small sized corpora. The work done began with the reengineering of Twente-Aligner, followed by the analysis of the alignment results and the development of several tools based on the extracted probabilistic dictionaries. The re-engineering process was based on formal methods: the algorithms and data structures were formalized, optimized and re-implemented. The timings and alignment results were analysed. The speed improvement derived from the re-engineering process and the scale-up derived of the alignment by chunks, permitted the alignment of bigger corpora. Bigger corpora makes dictionaries quality raise, and this makes new problems and new ideas possible. The probabilistic dictionaries created by the alignment process were used in different tasks. A first pair of tools was developed to search the dictionaries and their relation to the corpora. The probabilistic dictionaries were used to calculate a measure of how two sentences are translations of each other. This naive measure was used to prototype tools for aligning word sequences, to extract multiword terminology from corpora, and a “by example” machine translation software.Os corpora paralelos são recursos muito valiosos no processamento da linguagem natural e, em especial, na área da tradução. Podem ser usados não só por tradutores, mas também analisados e processados por computadores para aprender e extrair informação sobre as línguas. Neste documento, falamos sobre alguns dos processos relacionados como ciclo de vida dos corpora paralelos. Iremo-nos focar no alinhamento de corpora paralelo à palavra. A necessidade de um alinhador à palavra robusto apareceu com o projecto TerminUM, que tem como principal objectivo recolher corpora paralelos de diferentes fontes, alinhar e usá-los para criar recursos bilingues como terminologia ou memórias de tradução para tradução automática. O ponto de arranque foi o Twente-Aligner, um alinhador à palavra open-source, desenvolvido por Djoerd Hiemstra. Os seus resultados eram interessantes mas só funcionava para corpora de tamanhos pequenos. O trabalho realizado iniciou com a re-engenharia do Twente-Aligner, seguida pela análise dos resultados do alinhamento e o desenvolvimento de várias ferramentas baseadas nos dicionários probabilísticos extraídos. O processo de re-engenharia foi baseado em métodos formais: os algoritmos e estruturas de dados foram formalizados, optimizados e re-implementados. Os tempos e resultados de alinhamento foram analizados. Os melhoramentos em velocidade derivados do processo de re-engenharia e a escalabilidade derivada do alinhamento por fatias, permitiu o alinhamento de corpora maiores. Corpora maiores fazem aumentar a qualidade dos dicionários, o que torna novos problemas e ideias possíveis. Os dicionários probabilísticos criados pelo processo de alinhamento foram usados em tarefas diferentes. Um primeiro par de ferramentas foi desenvolvido para procurar nos dicionários e a sua relação com os corpora. Os dicionários probabilísticos foram usados para calcular uma medida de quão duas frases são tradução uma da outra. Esta medida foi usada para prototipar ferramentas para o alinhamento de sequências de palavras, extrair terminologia multipalavra dos corpora, e uma aplicação automática de tradução "por exemplo".Universidade do MinhoSimões, Alberto20042004-01-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/1822/677enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-07-21T11:55:13Zoai:repositorium.sdum.uminho.pt:1822/677Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T18:44:44.592560Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	Parallel corpora word alignment and applications
title	Parallel corpora word alignment and applications
spellingShingle	Parallel corpora word alignment and applications Simões, Alberto 681.3 801 82.035
title_short	Parallel corpora word alignment and applications
title_full	Parallel corpora word alignment and applications
title_fullStr	Parallel corpora word alignment and applications
title_full_unstemmed	Parallel corpora word alignment and applications
title_sort	Parallel corpora word alignment and applications
author	Simões, Alberto
author_facet	Simões, Alberto
author_role	author
dc.contributor.none.fl_str_mv	Universidade do Minho
dc.contributor.author.fl_str_mv	Simões, Alberto
dc.subject.por.fl_str_mv	681.3 801 82.035
topic	681.3 801 82.035
description	Parallel corpora are valuable resources on natural language processing and, in special, on the translation area. They can be used not only by translators, but also analyzed and processed by computers to learn and extract information about the languages. In this document, we talk about some processes related with the parallel corpora life cycle. We will focus on the parallel corpora word alignment. The necessity for a robust word aligner arrived with the TerminUM project which goal is to gather parallel corpora from different sources, align, analyze and use them to create bilingual resources like terminology or translation memories for machine translation. Aligner, an open-source word aligner developed by Djoerd Hiemstra. Its results were interesting but it worked only for small sized corpora. The work done began with the reengineering of Twente-Aligner, followed by the analysis of the alignment results and the development of several tools based on the extracted probabilistic dictionaries. The re-engineering process was based on formal methods: the algorithms and data structures were formalized, optimized and re-implemented. The timings and alignment results were analysed. The speed improvement derived from the re-engineering process and the scale-up derived of the alignment by chunks, permitted the alignment of bigger corpora. Bigger corpora makes dictionaries quality raise, and this makes new problems and new ideas possible. The probabilistic dictionaries created by the alignment process were used in different tasks. A first pair of tools was developed to search the dictionaries and their relation to the corpora. The probabilistic dictionaries were used to calculate a measure of how two sentences are translations of each other. This naive measure was used to prototype tools for aligning word sequences, to extract multiword terminology from corpora, and a “by example” machine translation software.
publishDate	2004
dc.date.none.fl_str_mv	2004 2004-01-01T00:00:00Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/1822/677
url	http://hdl.handle.net/1822/677
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799132197474533376

Parallel corpora word alignment and applications

Registros relacionados