A generic and open framework for multiword expressions treatment : from acquisition to applications

Detalhes bibliográficos
Autor(a) principal: Ramisch, Carlos Eduardo
Data de Publicação: 2012
Tipo de documento: Tese
Idioma: eng
Título da fonte: Biblioteca Digital de Teses e Dissertações da UFRGS
Texto Completo: http://hdl.handle.net/10183/65777
Resumo: The treatment of multiword expressions (MWEs), like take off, bus stop and big deal, is a challenge for NLP applications. This kind of linguistic construction is not only arbitrary but also much more frequent than one would initially guess. This thesis investigates the behaviour of MWEs across different languages, domains and construction types, proposing and evaluating an integrated methodological framework for their acquisition. There have been many theoretical proposals to define, characterise and classify MWEs. We adopt generic definition stating that MWEs are word combinations which must be treated as a unit at some level of linguistic processing. They present a variable degree of institutionalisation, arbitrariness, heterogeneity and limited syntactic and semantic variability. There has been much research on automatic MWE acquisition in the recent decades, and the state of the art covers a large number of techniques and languages. Other tasks involving MWEs, namely disambiguation, interpretation, representation and applications, have received less emphasis in the field. The first main contribution of this thesis is the proposal of an original methodological framework for automatic MWE acquisition from monolingual corpora. This framework is generic, language independent, integrated and contains a freely available implementation, the mwetoolkit. It is composed of independent modules which may themselves use multiple techniques to solve a specific sub-task in MWE acquisition. The evaluation of MWE acquisition is modelled using four independent axes. We underline that the evaluation results depend on parameters of the acquisition context, e.g., nature and size of corpora, language and type of MWE, analysis depth, and existing resources. The second main contribution of this thesis is the application-oriented evaluation of our methodology proposal in two applications: computer-assisted lexicography and statistical machine translation. For the former, we evaluate the usefulness of automatic MWE acquisition with the mwetoolkit for creating three lexicons: Greek nominal expressions, Portuguese complex predicates and Portuguese sentiment expressions. For the latter, we test several integration strategies in order to improve the treatment given to English phrasal verbs when translated by a standard statistical MT system into Portuguese. Both applications can benefit from automatic MWE acquisition, as the expressions acquired automatically from corpora can both speed up and improve the quality of the results. The promising results of previous and ongoing experiments encourage further investigation about the optimal way to integrate MWE treatment into other applications. Thus, we conclude the thesis with an overview of the past, ongoing and future work.
id URGS_1d9634e4c23708e4430cc9266f7b8bf9
oai_identifier_str oai:www.lume.ufrgs.br:10183/65777
network_acronym_str URGS
network_name_str Biblioteca Digital de Teses e Dissertações da UFRGS
repository_id_str 1853
spelling Ramisch, Carlos EduardoVillavicencio, AlineBoitet, Christian2013-01-31T01:41:19Z2012http://hdl.handle.net/10183/65777000870122The treatment of multiword expressions (MWEs), like take off, bus stop and big deal, is a challenge for NLP applications. This kind of linguistic construction is not only arbitrary but also much more frequent than one would initially guess. This thesis investigates the behaviour of MWEs across different languages, domains and construction types, proposing and evaluating an integrated methodological framework for their acquisition. There have been many theoretical proposals to define, characterise and classify MWEs. We adopt generic definition stating that MWEs are word combinations which must be treated as a unit at some level of linguistic processing. They present a variable degree of institutionalisation, arbitrariness, heterogeneity and limited syntactic and semantic variability. There has been much research on automatic MWE acquisition in the recent decades, and the state of the art covers a large number of techniques and languages. Other tasks involving MWEs, namely disambiguation, interpretation, representation and applications, have received less emphasis in the field. The first main contribution of this thesis is the proposal of an original methodological framework for automatic MWE acquisition from monolingual corpora. This framework is generic, language independent, integrated and contains a freely available implementation, the mwetoolkit. It is composed of independent modules which may themselves use multiple techniques to solve a specific sub-task in MWE acquisition. The evaluation of MWE acquisition is modelled using four independent axes. We underline that the evaluation results depend on parameters of the acquisition context, e.g., nature and size of corpora, language and type of MWE, analysis depth, and existing resources. The second main contribution of this thesis is the application-oriented evaluation of our methodology proposal in two applications: computer-assisted lexicography and statistical machine translation. For the former, we evaluate the usefulness of automatic MWE acquisition with the mwetoolkit for creating three lexicons: Greek nominal expressions, Portuguese complex predicates and Portuguese sentiment expressions. For the latter, we test several integration strategies in order to improve the treatment given to English phrasal verbs when translated by a standard statistical MT system into Portuguese. Both applications can benefit from automatic MWE acquisition, as the expressions acquired automatically from corpora can both speed up and improve the quality of the results. The promising results of previous and ongoing experiments encourage further investigation about the optimal way to integrate MWE treatment into other applications. Thus, we conclude the thesis with an overview of the past, ongoing and future work.application/pdfengLinguagem naturalLinguística computacionalNatural language processingComputational linguisticsMultiword expressionsLexical acquisitionMachine translationLexicographyCorpus linguisticsA generic and open framework for multiword expressions treatment : from acquisition to applicationsinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisUniversidade Federal do Rio Grande do SulInstituto de InformáticaPrograma de Pós-Graduação em ComputaçãoPorto Alegre, BR-RS2012doutoradoinfo:eu-repo/semantics/openAccessreponame:Biblioteca Digital de Teses e Dissertações da UFRGSinstname:Universidade Federal do Rio Grande do Sul (UFRGS)instacron:UFRGSORIGINAL000870122.pdf000870122.pdfTexto completo (inglês)application/pdf2392081http://www.lume.ufrgs.br/bitstream/10183/65777/1/000870122.pdfc8ac7fa921d60dd64bbf4875bbdb468dMD51TEXT000870122.pdf.txt000870122.pdf.txtExtracted Texttext/plain720049http://www.lume.ufrgs.br/bitstream/10183/65777/2/000870122.pdf.txt492b874a2be4150a03743c57e41b5582MD52THUMBNAIL000870122.pdf.jpg000870122.pdf.jpgGenerated Thumbnailimage/jpeg1041http://www.lume.ufrgs.br/bitstream/10183/65777/3/000870122.pdf.jpg02693be9486ff19750e602f8e93f9109MD5310183/657772021-05-07 05:05:44.633832oai:www.lume.ufrgs.br:10183/65777Biblioteca Digital de Teses e Dissertaçõeshttps://lume.ufrgs.br/handle/10183/2PUBhttps://lume.ufrgs.br/oai/requestlume@ufrgs.br||lume@ufrgs.bropendoar:18532021-05-07T08:05:44Biblioteca Digital de Teses e Dissertações da UFRGS - Universidade Federal do Rio Grande do Sul (UFRGS)false
dc.title.pt_BR.fl_str_mv A generic and open framework for multiword expressions treatment : from acquisition to applications
title A generic and open framework for multiword expressions treatment : from acquisition to applications
spellingShingle A generic and open framework for multiword expressions treatment : from acquisition to applications
Ramisch, Carlos Eduardo
Linguagem natural
Linguística computacional
Natural language processing
Computational linguistics
Multiword expressions
Lexical acquisition
Machine translation
Lexicography
Corpus linguistics
title_short A generic and open framework for multiword expressions treatment : from acquisition to applications
title_full A generic and open framework for multiword expressions treatment : from acquisition to applications
title_fullStr A generic and open framework for multiword expressions treatment : from acquisition to applications
title_full_unstemmed A generic and open framework for multiword expressions treatment : from acquisition to applications
title_sort A generic and open framework for multiword expressions treatment : from acquisition to applications
author Ramisch, Carlos Eduardo
author_facet Ramisch, Carlos Eduardo
author_role author
dc.contributor.author.fl_str_mv Ramisch, Carlos Eduardo
dc.contributor.advisor1.fl_str_mv Villavicencio, Aline
dc.contributor.advisor-co1.fl_str_mv Boitet, Christian
contributor_str_mv Villavicencio, Aline
Boitet, Christian
dc.subject.por.fl_str_mv Linguagem natural
Linguística computacional
topic Linguagem natural
Linguística computacional
Natural language processing
Computational linguistics
Multiword expressions
Lexical acquisition
Machine translation
Lexicography
Corpus linguistics
dc.subject.eng.fl_str_mv Natural language processing
Computational linguistics
Multiword expressions
Lexical acquisition
Machine translation
Lexicography
Corpus linguistics
description The treatment of multiword expressions (MWEs), like take off, bus stop and big deal, is a challenge for NLP applications. This kind of linguistic construction is not only arbitrary but also much more frequent than one would initially guess. This thesis investigates the behaviour of MWEs across different languages, domains and construction types, proposing and evaluating an integrated methodological framework for their acquisition. There have been many theoretical proposals to define, characterise and classify MWEs. We adopt generic definition stating that MWEs are word combinations which must be treated as a unit at some level of linguistic processing. They present a variable degree of institutionalisation, arbitrariness, heterogeneity and limited syntactic and semantic variability. There has been much research on automatic MWE acquisition in the recent decades, and the state of the art covers a large number of techniques and languages. Other tasks involving MWEs, namely disambiguation, interpretation, representation and applications, have received less emphasis in the field. The first main contribution of this thesis is the proposal of an original methodological framework for automatic MWE acquisition from monolingual corpora. This framework is generic, language independent, integrated and contains a freely available implementation, the mwetoolkit. It is composed of independent modules which may themselves use multiple techniques to solve a specific sub-task in MWE acquisition. The evaluation of MWE acquisition is modelled using four independent axes. We underline that the evaluation results depend on parameters of the acquisition context, e.g., nature and size of corpora, language and type of MWE, analysis depth, and existing resources. The second main contribution of this thesis is the application-oriented evaluation of our methodology proposal in two applications: computer-assisted lexicography and statistical machine translation. For the former, we evaluate the usefulness of automatic MWE acquisition with the mwetoolkit for creating three lexicons: Greek nominal expressions, Portuguese complex predicates and Portuguese sentiment expressions. For the latter, we test several integration strategies in order to improve the treatment given to English phrasal verbs when translated by a standard statistical MT system into Portuguese. Both applications can benefit from automatic MWE acquisition, as the expressions acquired automatically from corpora can both speed up and improve the quality of the results. The promising results of previous and ongoing experiments encourage further investigation about the optimal way to integrate MWE treatment into other applications. Thus, we conclude the thesis with an overview of the past, ongoing and future work.
publishDate 2012
dc.date.issued.fl_str_mv 2012
dc.date.accessioned.fl_str_mv 2013-01-31T01:41:19Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10183/65777
dc.identifier.nrb.pt_BR.fl_str_mv 000870122
url http://hdl.handle.net/10183/65777
identifier_str_mv 000870122
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Biblioteca Digital de Teses e Dissertações da UFRGS
instname:Universidade Federal do Rio Grande do Sul (UFRGS)
instacron:UFRGS
instname_str Universidade Federal do Rio Grande do Sul (UFRGS)
instacron_str UFRGS
institution UFRGS
reponame_str Biblioteca Digital de Teses e Dissertações da UFRGS
collection Biblioteca Digital de Teses e Dissertações da UFRGS
bitstream.url.fl_str_mv http://www.lume.ufrgs.br/bitstream/10183/65777/1/000870122.pdf
http://www.lume.ufrgs.br/bitstream/10183/65777/2/000870122.pdf.txt
http://www.lume.ufrgs.br/bitstream/10183/65777/3/000870122.pdf.jpg
bitstream.checksum.fl_str_mv c8ac7fa921d60dd64bbf4875bbdb468d
492b874a2be4150a03743c57e41b5582
02693be9486ff19750e602f8e93f9109
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da UFRGS - Universidade Federal do Rio Grande do Sul (UFRGS)
repository.mail.fl_str_mv lume@ufrgs.br||lume@ufrgs.br
_version_ 1810085249123614720