A generic and open framework for multiword expressions treatment : from acquisition to applications
Autor(a) principal: | |
---|---|
Data de Publicação: | 2012 |
Tipo de documento: | Tese |
Idioma: | eng |
Título da fonte: | Biblioteca Digital de Teses e Dissertações da UFRGS |
Texto Completo: | http://hdl.handle.net/10183/65777 |
Resumo: | The treatment of multiword expressions (MWEs), like take off, bus stop and big deal, is a challenge for NLP applications. This kind of linguistic construction is not only arbitrary but also much more frequent than one would initially guess. This thesis investigates the behaviour of MWEs across different languages, domains and construction types, proposing and evaluating an integrated methodological framework for their acquisition. There have been many theoretical proposals to define, characterise and classify MWEs. We adopt generic definition stating that MWEs are word combinations which must be treated as a unit at some level of linguistic processing. They present a variable degree of institutionalisation, arbitrariness, heterogeneity and limited syntactic and semantic variability. There has been much research on automatic MWE acquisition in the recent decades, and the state of the art covers a large number of techniques and languages. Other tasks involving MWEs, namely disambiguation, interpretation, representation and applications, have received less emphasis in the field. The first main contribution of this thesis is the proposal of an original methodological framework for automatic MWE acquisition from monolingual corpora. This framework is generic, language independent, integrated and contains a freely available implementation, the mwetoolkit. It is composed of independent modules which may themselves use multiple techniques to solve a specific sub-task in MWE acquisition. The evaluation of MWE acquisition is modelled using four independent axes. We underline that the evaluation results depend on parameters of the acquisition context, e.g., nature and size of corpora, language and type of MWE, analysis depth, and existing resources. The second main contribution of this thesis is the application-oriented evaluation of our methodology proposal in two applications: computer-assisted lexicography and statistical machine translation. For the former, we evaluate the usefulness of automatic MWE acquisition with the mwetoolkit for creating three lexicons: Greek nominal expressions, Portuguese complex predicates and Portuguese sentiment expressions. For the latter, we test several integration strategies in order to improve the treatment given to English phrasal verbs when translated by a standard statistical MT system into Portuguese. Both applications can benefit from automatic MWE acquisition, as the expressions acquired automatically from corpora can both speed up and improve the quality of the results. The promising results of previous and ongoing experiments encourage further investigation about the optimal way to integrate MWE treatment into other applications. Thus, we conclude the thesis with an overview of the past, ongoing and future work. |
id |
URGS_1d9634e4c23708e4430cc9266f7b8bf9 |
---|---|
oai_identifier_str |
oai:www.lume.ufrgs.br:10183/65777 |
network_acronym_str |
URGS |
network_name_str |
Biblioteca Digital de Teses e Dissertações da UFRGS |
repository_id_str |
1853 |
spelling |
Ramisch, Carlos EduardoVillavicencio, AlineBoitet, Christian2013-01-31T01:41:19Z2012http://hdl.handle.net/10183/65777000870122The treatment of multiword expressions (MWEs), like take off, bus stop and big deal, is a challenge for NLP applications. This kind of linguistic construction is not only arbitrary but also much more frequent than one would initially guess. This thesis investigates the behaviour of MWEs across different languages, domains and construction types, proposing and evaluating an integrated methodological framework for their acquisition. There have been many theoretical proposals to define, characterise and classify MWEs. We adopt generic definition stating that MWEs are word combinations which must be treated as a unit at some level of linguistic processing. They present a variable degree of institutionalisation, arbitrariness, heterogeneity and limited syntactic and semantic variability. There has been much research on automatic MWE acquisition in the recent decades, and the state of the art covers a large number of techniques and languages. Other tasks involving MWEs, namely disambiguation, interpretation, representation and applications, have received less emphasis in the field. The first main contribution of this thesis is the proposal of an original methodological framework for automatic MWE acquisition from monolingual corpora. This framework is generic, language independent, integrated and contains a freely available implementation, the mwetoolkit. It is composed of independent modules which may themselves use multiple techniques to solve a specific sub-task in MWE acquisition. The evaluation of MWE acquisition is modelled using four independent axes. We underline that the evaluation results depend on parameters of the acquisition context, e.g., nature and size of corpora, language and type of MWE, analysis depth, and existing resources. The second main contribution of this thesis is the application-oriented evaluation of our methodology proposal in two applications: computer-assisted lexicography and statistical machine translation. For the former, we evaluate the usefulness of automatic MWE acquisition with the mwetoolkit for creating three lexicons: Greek nominal expressions, Portuguese complex predicates and Portuguese sentiment expressions. For the latter, we test several integration strategies in order to improve the treatment given to English phrasal verbs when translated by a standard statistical MT system into Portuguese. Both applications can benefit from automatic MWE acquisition, as the expressions acquired automatically from corpora can both speed up and improve the quality of the results. The promising results of previous and ongoing experiments encourage further investigation about the optimal way to integrate MWE treatment into other applications. Thus, we conclude the thesis with an overview of the past, ongoing and future work.application/pdfengLinguagem naturalLinguística computacionalNatural language processingComputational linguisticsMultiword expressionsLexical acquisitionMachine translationLexicographyCorpus linguisticsA generic and open framework for multiword expressions treatment : from acquisition to applicationsinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisUniversidade Federal do Rio Grande do SulInstituto de InformáticaPrograma de Pós-Graduação em ComputaçãoPorto Alegre, BR-RS2012doutoradoinfo:eu-repo/semantics/openAccessreponame:Biblioteca Digital de Teses e Dissertações da UFRGSinstname:Universidade Federal do Rio Grande do Sul (UFRGS)instacron:UFRGSORIGINAL000870122.pdf000870122.pdfTexto completo (inglês)application/pdf2392081http://www.lume.ufrgs.br/bitstream/10183/65777/1/000870122.pdfc8ac7fa921d60dd64bbf4875bbdb468dMD51TEXT000870122.pdf.txt000870122.pdf.txtExtracted Texttext/plain720049http://www.lume.ufrgs.br/bitstream/10183/65777/2/000870122.pdf.txt492b874a2be4150a03743c57e41b5582MD52THUMBNAIL000870122.pdf.jpg000870122.pdf.jpgGenerated Thumbnailimage/jpeg1041http://www.lume.ufrgs.br/bitstream/10183/65777/3/000870122.pdf.jpg02693be9486ff19750e602f8e93f9109MD5310183/657772021-05-07 05:05:44.633832oai:www.lume.ufrgs.br:10183/65777Biblioteca Digital de Teses e Dissertaçõeshttps://lume.ufrgs.br/handle/10183/2PUBhttps://lume.ufrgs.br/oai/requestlume@ufrgs.br||lume@ufrgs.bropendoar:18532021-05-07T08:05:44Biblioteca Digital de Teses e Dissertações da UFRGS - Universidade Federal do Rio Grande do Sul (UFRGS)false |
dc.title.pt_BR.fl_str_mv |
A generic and open framework for multiword expressions treatment : from acquisition to applications |
title |
A generic and open framework for multiword expressions treatment : from acquisition to applications |
spellingShingle |
A generic and open framework for multiword expressions treatment : from acquisition to applications Ramisch, Carlos Eduardo Linguagem natural Linguística computacional Natural language processing Computational linguistics Multiword expressions Lexical acquisition Machine translation Lexicography Corpus linguistics |
title_short |
A generic and open framework for multiword expressions treatment : from acquisition to applications |
title_full |
A generic and open framework for multiword expressions treatment : from acquisition to applications |
title_fullStr |
A generic and open framework for multiword expressions treatment : from acquisition to applications |
title_full_unstemmed |
A generic and open framework for multiword expressions treatment : from acquisition to applications |
title_sort |
A generic and open framework for multiword expressions treatment : from acquisition to applications |
author |
Ramisch, Carlos Eduardo |
author_facet |
Ramisch, Carlos Eduardo |
author_role |
author |
dc.contributor.author.fl_str_mv |
Ramisch, Carlos Eduardo |
dc.contributor.advisor1.fl_str_mv |
Villavicencio, Aline |
dc.contributor.advisor-co1.fl_str_mv |
Boitet, Christian |
contributor_str_mv |
Villavicencio, Aline Boitet, Christian |
dc.subject.por.fl_str_mv |
Linguagem natural Linguística computacional |
topic |
Linguagem natural Linguística computacional Natural language processing Computational linguistics Multiword expressions Lexical acquisition Machine translation Lexicography Corpus linguistics |
dc.subject.eng.fl_str_mv |
Natural language processing Computational linguistics Multiword expressions Lexical acquisition Machine translation Lexicography Corpus linguistics |
description |
The treatment of multiword expressions (MWEs), like take off, bus stop and big deal, is a challenge for NLP applications. This kind of linguistic construction is not only arbitrary but also much more frequent than one would initially guess. This thesis investigates the behaviour of MWEs across different languages, domains and construction types, proposing and evaluating an integrated methodological framework for their acquisition. There have been many theoretical proposals to define, characterise and classify MWEs. We adopt generic definition stating that MWEs are word combinations which must be treated as a unit at some level of linguistic processing. They present a variable degree of institutionalisation, arbitrariness, heterogeneity and limited syntactic and semantic variability. There has been much research on automatic MWE acquisition in the recent decades, and the state of the art covers a large number of techniques and languages. Other tasks involving MWEs, namely disambiguation, interpretation, representation and applications, have received less emphasis in the field. The first main contribution of this thesis is the proposal of an original methodological framework for automatic MWE acquisition from monolingual corpora. This framework is generic, language independent, integrated and contains a freely available implementation, the mwetoolkit. It is composed of independent modules which may themselves use multiple techniques to solve a specific sub-task in MWE acquisition. The evaluation of MWE acquisition is modelled using four independent axes. We underline that the evaluation results depend on parameters of the acquisition context, e.g., nature and size of corpora, language and type of MWE, analysis depth, and existing resources. The second main contribution of this thesis is the application-oriented evaluation of our methodology proposal in two applications: computer-assisted lexicography and statistical machine translation. For the former, we evaluate the usefulness of automatic MWE acquisition with the mwetoolkit for creating three lexicons: Greek nominal expressions, Portuguese complex predicates and Portuguese sentiment expressions. For the latter, we test several integration strategies in order to improve the treatment given to English phrasal verbs when translated by a standard statistical MT system into Portuguese. Both applications can benefit from automatic MWE acquisition, as the expressions acquired automatically from corpora can both speed up and improve the quality of the results. The promising results of previous and ongoing experiments encourage further investigation about the optimal way to integrate MWE treatment into other applications. Thus, we conclude the thesis with an overview of the past, ongoing and future work. |
publishDate |
2012 |
dc.date.issued.fl_str_mv |
2012 |
dc.date.accessioned.fl_str_mv |
2013-01-31T01:41:19Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
format |
doctoralThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10183/65777 |
dc.identifier.nrb.pt_BR.fl_str_mv |
000870122 |
url |
http://hdl.handle.net/10183/65777 |
identifier_str_mv |
000870122 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:Biblioteca Digital de Teses e Dissertações da UFRGS instname:Universidade Federal do Rio Grande do Sul (UFRGS) instacron:UFRGS |
instname_str |
Universidade Federal do Rio Grande do Sul (UFRGS) |
instacron_str |
UFRGS |
institution |
UFRGS |
reponame_str |
Biblioteca Digital de Teses e Dissertações da UFRGS |
collection |
Biblioteca Digital de Teses e Dissertações da UFRGS |
bitstream.url.fl_str_mv |
http://www.lume.ufrgs.br/bitstream/10183/65777/1/000870122.pdf http://www.lume.ufrgs.br/bitstream/10183/65777/2/000870122.pdf.txt http://www.lume.ufrgs.br/bitstream/10183/65777/3/000870122.pdf.jpg |
bitstream.checksum.fl_str_mv |
c8ac7fa921d60dd64bbf4875bbdb468d 492b874a2be4150a03743c57e41b5582 02693be9486ff19750e602f8e93f9109 |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 MD5 |
repository.name.fl_str_mv |
Biblioteca Digital de Teses e Dissertações da UFRGS - Universidade Federal do Rio Grande do Sul (UFRGS) |
repository.mail.fl_str_mv |
lume@ufrgs.br||lume@ufrgs.br |
_version_ |
1810085249123614720 |