Descoberta automática de expressões multipalavras a partir de textos paralelos

Detalhes bibliográficos
Autor(a) principal: Vargas, Natalie Lourenço
Data de Publicação: 2018
Tipo de documento: Dissertação
Idioma: por
Título da fonte: Repositório Institucional da UFSCAR
Texto Completo: https://repositorio.ufscar.br/handle/ufscar/10836
Resumo: Multiword Expressions (MWEs) are a current challenge for Natural Language Processing field and there are different proposed automatic methods to treat and discovery them. We propose in this work two new bilingual discover methods in parallel texts, which were implemented as the Bilingual Discovery MWE Toolkit (BiDiMWEToolkit). The proposed methods were based on similar ideas in related works and they use bilingual word embeddings in order to find the best MWEs translations automatically discovered. In the first method, source and target MWEs are extracted separately from morphossyntatic patterns already defined and they are paired based on billingual word embeddings. In the second method, we just extracted source MWEs and the best translations are defined using bilingual word embeddings. As a result of our presented experiments, we concluded that both methods are capable of performing billingual discovery but the second method has prove to be more complete than the first method: (1) it capable of generating translations without target MWEs, so it wasn’t necessary to have prior knowledge about the target language, (2) and capable of generating translations composed by one word, covering the cases when MWE translations are not an expression.
id SCAR_11440e76892d0f2a8239702658e31b8a
oai_identifier_str oai:repositorio.ufscar.br:ufscar/10836
network_acronym_str SCAR
network_name_str Repositório Institucional da UFSCAR
repository_id_str 4322
spelling Vargas, Natalie LourençoCaseli, Helena de Medeiroshttp://lattes.cnpq.br/6608582057810385http://lattes.cnpq.br/74809330154103391bcf41e1-a100-47d5-bfd8-a22aa5f407652019-01-14T18:49:29Z2019-01-14T18:49:29Z2018-10-11VARGAS, Natalie Lourenço. Descoberta automática de expressões multipalavras a partir de textos paralelos. 2018. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de São Carlos, São Carlos, 2018. Disponível em: https://repositorio.ufscar.br/handle/ufscar/10836.https://repositorio.ufscar.br/handle/ufscar/10836Multiword Expressions (MWEs) are a current challenge for Natural Language Processing field and there are different proposed automatic methods to treat and discovery them. We propose in this work two new bilingual discover methods in parallel texts, which were implemented as the Bilingual Discovery MWE Toolkit (BiDiMWEToolkit). The proposed methods were based on similar ideas in related works and they use bilingual word embeddings in order to find the best MWEs translations automatically discovered. In the first method, source and target MWEs are extracted separately from morphossyntatic patterns already defined and they are paired based on billingual word embeddings. In the second method, we just extracted source MWEs and the best translations are defined using bilingual word embeddings. As a result of our presented experiments, we concluded that both methods are capable of performing billingual discovery but the second method has prove to be more complete than the first method: (1) it capable of generating translations without target MWEs, so it wasn’t necessary to have prior knowledge about the target language, (2) and capable of generating translations composed by one word, covering the cases when MWE translations are not an expression.Expressões Multipalavras (EMs) são um desafio atual para a área de Processamento de Linguagem Natural e existem diferentes métodos automáticos propostos para descobri-las e tratá-las nos textos. Neste trabalho propomos dois métodos de descoberta de EMs de forma bilíngue em textos paralelos, os quais foram implementados em uma ferramenta que recebeu o nome de Bilingual Discovery MWE Toolkit (BiDiMWEToolkit). Com embasamento em ideias similares na literatura, os métodos propostos utilizam vetores de palavras (word embeddings) bilíngues para encontrar as melhores traduções para as EMs descobertas automaticamente. No primeiro método proposto, as EMs fonte e alvo são extraídas separadamente, a partir de padrões morfossintáticos pré-definidos, e, em seguida, ocorre o paralelismo das candidatas com base nos vetores bilíngues de palavras. No segundo método, a extração de EMs ocorre apenas no lado fonte, seguida da definição das melhores traduções (EMs ou não) alvo com base nos vetores bilíngues de palavras. Nos experimentos apresentados neste trabalho, para o par de idiomas português-inglês, concluímos que os dois métodos são capazes de realizar a descoberta bilíngue. O segundo método, contudo, apresenta duas vantagens em relação ao primeiro: (1) é capaz de gerar traduções sem a necessidade de ter conhecimento prévio da língua alvo, e (2) é capaz de gerar traduções contendo apenas uma palavra, abrangendo os casos em que EMs não são traduzidas necessariamente como expressões.Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)CNPq: 132890/2016-0porUniversidade Federal de São CarlosCâmpus São CarlosPrograma de Pós-Graduação em Ciência da Computação - PPGCCUFSCarExpressões multipalavrasEMDescoberta bilíngueDescoberta monolíngueCórpus paraleloMultiword expressionsMWEBilingual discoveryMonolingual discoveryParallel corpusCIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::METODOLOGIA E TECNICAS DA COMPUTACAODescoberta automática de expressões multipalavras a partir de textos paralelosinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisOnline600600e36d4e63-960d-4f5c-9c93-f8b7f5f93d65info:eu-repo/semantics/openAccessreponame:Repositório Institucional da UFSCARinstname:Universidade Federal de São Carlos (UFSCAR)instacron:UFSCARORIGINALdissertacao-natalie.pdfdissertacao-natalie.pdfapplication/pdf876803https://repositorio.ufscar.br/bitstream/ufscar/10836/1/dissertacao-natalie.pdf3cc1b0767078abaf81abdb1dfc66acddMD51LICENSElicense.txtlicense.txttext/plain; charset=utf-81957https://repositorio.ufscar.br/bitstream/ufscar/10836/3/license.txtae0398b6f8b235e40ad82cba6c50031dMD53TEXTdissertacao-natalie.pdf.txtdissertacao-natalie.pdf.txtExtracted texttext/plain207628https://repositorio.ufscar.br/bitstream/ufscar/10836/4/dissertacao-natalie.pdf.txt8d8f3465a10ae9b3ddc924a0f371453eMD54THUMBNAILdissertacao-natalie.pdf.jpgdissertacao-natalie.pdf.jpgIM Thumbnailimage/jpeg8273https://repositorio.ufscar.br/bitstream/ufscar/10836/5/dissertacao-natalie.pdf.jpg311aa85e5934e9229f7f0f49ad3a8e0dMD55ufscar/108362023-09-18 18:31:19.016oai:repositorio.ufscar.br:ufscar/10836TElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEKCkNvbSBhIGFwcmVzZW50YcOnw6NvIGRlc3RhIGxpY2Vuw6dhLCB2b2PDqiAobyBhdXRvciAoZXMpIG91IG8gdGl0dWxhciBkb3MgZGlyZWl0b3MgZGUgYXV0b3IpIGNvbmNlZGUgw6AgVW5pdmVyc2lkYWRlCkZlZGVyYWwgZGUgU8OjbyBDYXJsb3MgbyBkaXJlaXRvIG7Do28tZXhjbHVzaXZvIGRlIHJlcHJvZHV6aXIsICB0cmFkdXppciAoY29uZm9ybWUgZGVmaW5pZG8gYWJhaXhvKSwgZS9vdQpkaXN0cmlidWlyIGEgc3VhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyAoaW5jbHVpbmRvIG8gcmVzdW1vKSBwb3IgdG9kbyBvIG11bmRvIG5vIGZvcm1hdG8gaW1wcmVzc28gZSBlbGV0csO0bmljbyBlCmVtIHF1YWxxdWVyIG1laW8sIGluY2x1aW5kbyBvcyBmb3JtYXRvcyDDoXVkaW8gb3UgdsOtZGVvLgoKVm9jw6ogY29uY29yZGEgcXVlIGEgVUZTQ2FyIHBvZGUsIHNlbSBhbHRlcmFyIG8gY29udGXDumRvLCB0cmFuc3BvciBhIHN1YSB0ZXNlIG91IGRpc3NlcnRhw6fDo28KcGFyYSBxdWFscXVlciBtZWlvIG91IGZvcm1hdG8gcGFyYSBmaW5zIGRlIHByZXNlcnZhw6fDo28uCgpWb2PDqiB0YW1iw6ltIGNvbmNvcmRhIHF1ZSBhIFVGU0NhciBwb2RlIG1hbnRlciBtYWlzIGRlIHVtYSBjw7NwaWEgYSBzdWEgdGVzZSBvdQpkaXNzZXJ0YcOnw6NvIHBhcmEgZmlucyBkZSBzZWd1cmFuw6dhLCBiYWNrLXVwIGUgcHJlc2VydmHDp8Ojby4KClZvY8OqIGRlY2xhcmEgcXVlIGEgc3VhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyDDqSBvcmlnaW5hbCBlIHF1ZSB2b2PDqiB0ZW0gbyBwb2RlciBkZSBjb25jZWRlciBvcyBkaXJlaXRvcyBjb250aWRvcwpuZXN0YSBsaWNlbsOnYS4gVm9jw6ogdGFtYsOpbSBkZWNsYXJhIHF1ZSBvIGRlcMOzc2l0byBkYSBzdWEgdGVzZSBvdSBkaXNzZXJ0YcOnw6NvIG7Do28sIHF1ZSBzZWphIGRlIHNldQpjb25oZWNpbWVudG8sIGluZnJpbmdlIGRpcmVpdG9zIGF1dG9yYWlzIGRlIG5pbmd1w6ltLgoKQ2FzbyBhIHN1YSB0ZXNlIG91IGRpc3NlcnRhw6fDo28gY29udGVuaGEgbWF0ZXJpYWwgcXVlIHZvY8OqIG7Do28gcG9zc3VpIGEgdGl0dWxhcmlkYWRlIGRvcyBkaXJlaXRvcyBhdXRvcmFpcywgdm9jw6oKZGVjbGFyYSBxdWUgb2J0ZXZlIGEgcGVybWlzc8OjbyBpcnJlc3RyaXRhIGRvIGRldGVudG9yIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBwYXJhIGNvbmNlZGVyIMOgIFVGU0NhcgpvcyBkaXJlaXRvcyBhcHJlc2VudGFkb3MgbmVzdGEgbGljZW7Dp2EsIGUgcXVlIGVzc2UgbWF0ZXJpYWwgZGUgcHJvcHJpZWRhZGUgZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUKaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3Ugbm8gY29udGXDumRvIGRhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyBvcmEgZGVwb3NpdGFkYS4KCkNBU08gQSBURVNFIE9VIERJU1NFUlRBw4fDg08gT1JBIERFUE9TSVRBREEgVEVOSEEgU0lETyBSRVNVTFRBRE8gREUgVU0gUEFUUk9Dw41OSU8gT1UKQVBPSU8gREUgVU1BIEFHw4pOQ0lBIERFIEZPTUVOVE8gT1UgT1VUUk8gT1JHQU5JU01PIFFVRSBOw4NPIFNFSkEgQSBVRlNDYXIsClZPQ8OKIERFQ0xBUkEgUVVFIFJFU1BFSVRPVSBUT0RPUyBFIFFVQUlTUVVFUiBESVJFSVRPUyBERSBSRVZJU8ODTyBDT01PClRBTULDiU0gQVMgREVNQUlTIE9CUklHQcOHw5VFUyBFWElHSURBUyBQT1IgQ09OVFJBVE8gT1UgQUNPUkRPLgoKQSBVRlNDYXIgc2UgY29tcHJvbWV0ZSBhIGlkZW50aWZpY2FyIGNsYXJhbWVudGUgbyBzZXUgbm9tZSAocykgb3UgbyhzKSBub21lKHMpIGRvKHMpCmRldGVudG9yKGVzKSBkb3MgZGlyZWl0b3MgYXV0b3JhaXMgZGEgdGVzZSBvdSBkaXNzZXJ0YcOnw6NvLCBlIG7Do28gZmFyw6EgcXVhbHF1ZXIgYWx0ZXJhw6fDo28sIGFsw6ltIGRhcXVlbGFzCmNvbmNlZGlkYXMgcG9yIGVzdGEgbGljZW7Dp2EuCg==Repositório InstitucionalPUBhttps://repositorio.ufscar.br/oai/requestopendoar:43222023-09-18T18:31:19Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)false
dc.title.por.fl_str_mv Descoberta automática de expressões multipalavras a partir de textos paralelos
title Descoberta automática de expressões multipalavras a partir de textos paralelos
spellingShingle Descoberta automática de expressões multipalavras a partir de textos paralelos
Vargas, Natalie Lourenço
Expressões multipalavras
EM
Descoberta bilíngue
Descoberta monolíngue
Córpus paralelo
Multiword expressions
MWE
Bilingual discovery
Monolingual discovery
Parallel corpus
CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::METODOLOGIA E TECNICAS DA COMPUTACAO
title_short Descoberta automática de expressões multipalavras a partir de textos paralelos
title_full Descoberta automática de expressões multipalavras a partir de textos paralelos
title_fullStr Descoberta automática de expressões multipalavras a partir de textos paralelos
title_full_unstemmed Descoberta automática de expressões multipalavras a partir de textos paralelos
title_sort Descoberta automática de expressões multipalavras a partir de textos paralelos
author Vargas, Natalie Lourenço
author_facet Vargas, Natalie Lourenço
author_role author
dc.contributor.authorlattes.por.fl_str_mv http://lattes.cnpq.br/7480933015410339
dc.contributor.author.fl_str_mv Vargas, Natalie Lourenço
dc.contributor.advisor1.fl_str_mv Caseli, Helena de Medeiros
dc.contributor.advisor1Lattes.fl_str_mv http://lattes.cnpq.br/6608582057810385
dc.contributor.authorID.fl_str_mv 1bcf41e1-a100-47d5-bfd8-a22aa5f40765
contributor_str_mv Caseli, Helena de Medeiros
dc.subject.por.fl_str_mv Expressões multipalavras
EM
Descoberta bilíngue
Descoberta monolíngue
Córpus paralelo
topic Expressões multipalavras
EM
Descoberta bilíngue
Descoberta monolíngue
Córpus paralelo
Multiword expressions
MWE
Bilingual discovery
Monolingual discovery
Parallel corpus
CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::METODOLOGIA E TECNICAS DA COMPUTACAO
dc.subject.eng.fl_str_mv Multiword expressions
MWE
Bilingual discovery
Monolingual discovery
Parallel corpus
dc.subject.cnpq.fl_str_mv CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO::METODOLOGIA E TECNICAS DA COMPUTACAO
description Multiword Expressions (MWEs) are a current challenge for Natural Language Processing field and there are different proposed automatic methods to treat and discovery them. We propose in this work two new bilingual discover methods in parallel texts, which were implemented as the Bilingual Discovery MWE Toolkit (BiDiMWEToolkit). The proposed methods were based on similar ideas in related works and they use bilingual word embeddings in order to find the best MWEs translations automatically discovered. In the first method, source and target MWEs are extracted separately from morphossyntatic patterns already defined and they are paired based on billingual word embeddings. In the second method, we just extracted source MWEs and the best translations are defined using bilingual word embeddings. As a result of our presented experiments, we concluded that both methods are capable of performing billingual discovery but the second method has prove to be more complete than the first method: (1) it capable of generating translations without target MWEs, so it wasn’t necessary to have prior knowledge about the target language, (2) and capable of generating translations composed by one word, covering the cases when MWE translations are not an expression.
publishDate 2018
dc.date.issued.fl_str_mv 2018-10-11
dc.date.accessioned.fl_str_mv 2019-01-14T18:49:29Z
dc.date.available.fl_str_mv 2019-01-14T18:49:29Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.citation.fl_str_mv VARGAS, Natalie Lourenço. Descoberta automática de expressões multipalavras a partir de textos paralelos. 2018. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de São Carlos, São Carlos, 2018. Disponível em: https://repositorio.ufscar.br/handle/ufscar/10836.
dc.identifier.uri.fl_str_mv https://repositorio.ufscar.br/handle/ufscar/10836
identifier_str_mv VARGAS, Natalie Lourenço. Descoberta automática de expressões multipalavras a partir de textos paralelos. 2018. Dissertação (Mestrado em Ciência da Computação) – Universidade Federal de São Carlos, São Carlos, 2018. Disponível em: https://repositorio.ufscar.br/handle/ufscar/10836.
url https://repositorio.ufscar.br/handle/ufscar/10836
dc.language.iso.fl_str_mv por
language por
dc.relation.confidence.fl_str_mv 600
600
dc.relation.authority.fl_str_mv e36d4e63-960d-4f5c-9c93-f8b7f5f93d65
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv Universidade Federal de São Carlos
Câmpus São Carlos
dc.publisher.program.fl_str_mv Programa de Pós-Graduação em Ciência da Computação - PPGCC
dc.publisher.initials.fl_str_mv UFSCar
publisher.none.fl_str_mv Universidade Federal de São Carlos
Câmpus São Carlos
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFSCAR
instname:Universidade Federal de São Carlos (UFSCAR)
instacron:UFSCAR
instname_str Universidade Federal de São Carlos (UFSCAR)
instacron_str UFSCAR
institution UFSCAR
reponame_str Repositório Institucional da UFSCAR
collection Repositório Institucional da UFSCAR
bitstream.url.fl_str_mv https://repositorio.ufscar.br/bitstream/ufscar/10836/1/dissertacao-natalie.pdf
https://repositorio.ufscar.br/bitstream/ufscar/10836/3/license.txt
https://repositorio.ufscar.br/bitstream/ufscar/10836/4/dissertacao-natalie.pdf.txt
https://repositorio.ufscar.br/bitstream/ufscar/10836/5/dissertacao-natalie.pdf.jpg
bitstream.checksum.fl_str_mv 3cc1b0767078abaf81abdb1dfc66acdd
ae0398b6f8b235e40ad82cba6c50031d
8d8f3465a10ae9b3ddc924a0f371453e
311aa85e5934e9229f7f0f49ad3a8e0d
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
MD5
repository.name.fl_str_mv Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)
repository.mail.fl_str_mv
_version_ 1802136351070486528