A flexible compositional approach to word sense disambiguation

Detalhes bibliográficos
Autor(a) principal: Alex de Paula Barros
Data de Publicação: 2018
Tipo de documento: Dissertação
Idioma: por
Título da fonte: Repositório Institucional da UFMG
Texto Completo: http://hdl.handle.net/1843/SLSC-BBKGTM
Resumo: Word sense disambiguation is identifying which sense of a word is used in a sentence when the word has multiple meanings. Supervised machine learning methods in which a classifier is trained for each distinct word on a corpus of manually sense-annotated examples have been the most successful algorithms to date. One possible drawback is their lack of flexibility due to requiring annotated examples for every word in the vocabulary. In contrast, knowledge-based methods do not require a classifier for each distinct word and are often built over lexico-semantic resources like ontologies, thesaurus or machine-readable dictionaries. In this work, we propose a flexible compositional algorithm based on context-gloss comparisons, that compares local context of a word represented by its neighbor words with glosses of the possible senses a word can assume using a semantic distance measure. The algorithm has three components, each based on a different information source: (i) sense frequency, obtained by counting the number of times a word occurs with each meaning in an annotated corpus, (ii) extended gloss, obtained by expanding a word dictionary definition using related words in an ontology (e.g., car and automobile), and (iii) sense usage examples, obtained from inventories that provide sentences with usage examples for some senses. Our compositional approach is flexible in the sense that it is not dependent on annotated examples and works well even when some or all of the three aforementioned knowledge sources are not available. We evaluated the performance of our algorithm for all possible combinations of the three components, simulating different scenarios of knowledge sources availability. The algorithm achieves an F1 score of 67.5 when all components are available, presenting a favorable result when compared with a state-of-the-art knowledge-based system that achieves an F1 score of 66.4
id UFMG_8ac78be0d9806641f90bc3eb4b7411ed
oai_identifier_str oai:repositorio.ufmg.br:1843/SLSC-BBKGTM
network_acronym_str UFMG
network_name_str Repositório Institucional da UFMG
repository_id_str
spelling A flexible compositional approach to word sense disambiguationNatural Language ProcessingWord Sense DisambiguationRecuperação da informaçãoComputaçãoProcessamento de linguagem natural (Computação)Word sense disambiguation is identifying which sense of a word is used in a sentence when the word has multiple meanings. Supervised machine learning methods in which a classifier is trained for each distinct word on a corpus of manually sense-annotated examples have been the most successful algorithms to date. One possible drawback is their lack of flexibility due to requiring annotated examples for every word in the vocabulary. In contrast, knowledge-based methods do not require a classifier for each distinct word and are often built over lexico-semantic resources like ontologies, thesaurus or machine-readable dictionaries. In this work, we propose a flexible compositional algorithm based on context-gloss comparisons, that compares local context of a word represented by its neighbor words with glosses of the possible senses a word can assume using a semantic distance measure. The algorithm has three components, each based on a different information source: (i) sense frequency, obtained by counting the number of times a word occurs with each meaning in an annotated corpus, (ii) extended gloss, obtained by expanding a word dictionary definition using related words in an ontology (e.g., car and automobile), and (iii) sense usage examples, obtained from inventories that provide sentences with usage examples for some senses. Our compositional approach is flexible in the sense that it is not dependent on annotated examples and works well even when some or all of the three aforementioned knowledge sources are not available. We evaluated the performance of our algorithm for all possible combinations of the three components, simulating different scenarios of knowledge sources availability. The algorithm achieves an F1 score of 67.5 when all components are available, presenting a favorable result when compared with a state-of-the-art knowledge-based system that achieves an F1 score of 66.4Word sense disambiguation é a tarefa de identificar qual o significado de uma palavra é utilizado em uma sentença quando a palavra possui múltiplos sentidos. Métodos supervisionados de aprendizado de máquina em que um classificador é treinado para cada palavra distinta em um corpus com o significados das palavras manualmente anotados têm obtido os melhores resultados. Uma possível desvantagem destes métodos é a falta de flexibilidade devido à necessidade de exemplos anotados para cada palavra no vocabulário. Em contraste, os métodos baseados em conhecimento não requerem um classificador para cada palavra distinta e são frequentemente construídos sobre recursos léxico-semânticos como ontologias ou tesauros. Neste trabalho, propomos um algoritmo composicional flexível baseado em comparações entre contexto e glosa, que compara o contexto local de uma palavra, representada por suas palavras vizinhas, com glosas dos possíveis sentidos que uma palavra pode assumir usando uma medida de distância semântica. O algoritmo possui três componentes, cada um baseado em uma fonte de informação diferente: (i) frequência de sentido, obtida pela contagem do número de vezes que uma palavra ocorre com cada significado em um corpus anotado, (ii) glosa estendida, expansão da definição de palavras no dicionário usando palavras relacionadas em uma ontologia (por exemplo, carro e automóvel), e (iii) exemplos de uso de sentido, obtidos de dicionários que fornecem frases com exemplos de uso para os sentidos das palavras. Nossa abordagem composicional é flexível no sentido de que não depende de exemplos anotados e funciona bem, mesmo quando algumas ou todas as três fontes de conhecimento mencionadas acima não estão disponíveis. Avaliamos o desempenho de nosso algoritmo para todas as combinações possíveis dos três componentes, simulando diferentes cenários de disponibilidade de fontes de conhecimento. O algoritmo alcança um F1 score de 67,5 quando todos os componentes estão disponíveis, apresentando um resultado favorável quando comparado com o estado da arte em sistemas baseado em conhecimento que atinge um F1 score de 66,4.Universidade Federal de Minas GeraisUFMGNivio ZivianiAdriano Alonso VelosoFlavio Vinicius Diniz de FigueiredoRenato Antonio Celso FerreiraWladmir Cardoso BrandãoAlex de Paula Barros2019-08-10T12:28:57Z2019-08-10T12:28:57Z2018-07-27info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/1843/SLSC-BBKGTMinfo:eu-repo/semantics/openAccessporreponame:Repositório Institucional da UFMGinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMG2019-11-14T10:52:02Zoai:repositorio.ufmg.br:1843/SLSC-BBKGTMRepositório InstitucionalPUBhttps://repositorio.ufmg.br/oairepositorio@ufmg.bropendoar:2019-11-14T10:52:02Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)false
dc.title.none.fl_str_mv A flexible compositional approach to word sense disambiguation
title A flexible compositional approach to word sense disambiguation
spellingShingle A flexible compositional approach to word sense disambiguation
Alex de Paula Barros
Natural Language Processing
Word Sense Disambiguation
Recuperação da informação
Computação
Processamento de linguagem natural (Computação)
title_short A flexible compositional approach to word sense disambiguation
title_full A flexible compositional approach to word sense disambiguation
title_fullStr A flexible compositional approach to word sense disambiguation
title_full_unstemmed A flexible compositional approach to word sense disambiguation
title_sort A flexible compositional approach to word sense disambiguation
author Alex de Paula Barros
author_facet Alex de Paula Barros
author_role author
dc.contributor.none.fl_str_mv Nivio Ziviani
Adriano Alonso Veloso
Flavio Vinicius Diniz de Figueiredo
Renato Antonio Celso Ferreira
Wladmir Cardoso Brandão
dc.contributor.author.fl_str_mv Alex de Paula Barros
dc.subject.por.fl_str_mv Natural Language Processing
Word Sense Disambiguation
Recuperação da informação
Computação
Processamento de linguagem natural (Computação)
topic Natural Language Processing
Word Sense Disambiguation
Recuperação da informação
Computação
Processamento de linguagem natural (Computação)
description Word sense disambiguation is identifying which sense of a word is used in a sentence when the word has multiple meanings. Supervised machine learning methods in which a classifier is trained for each distinct word on a corpus of manually sense-annotated examples have been the most successful algorithms to date. One possible drawback is their lack of flexibility due to requiring annotated examples for every word in the vocabulary. In contrast, knowledge-based methods do not require a classifier for each distinct word and are often built over lexico-semantic resources like ontologies, thesaurus or machine-readable dictionaries. In this work, we propose a flexible compositional algorithm based on context-gloss comparisons, that compares local context of a word represented by its neighbor words with glosses of the possible senses a word can assume using a semantic distance measure. The algorithm has three components, each based on a different information source: (i) sense frequency, obtained by counting the number of times a word occurs with each meaning in an annotated corpus, (ii) extended gloss, obtained by expanding a word dictionary definition using related words in an ontology (e.g., car and automobile), and (iii) sense usage examples, obtained from inventories that provide sentences with usage examples for some senses. Our compositional approach is flexible in the sense that it is not dependent on annotated examples and works well even when some or all of the three aforementioned knowledge sources are not available. We evaluated the performance of our algorithm for all possible combinations of the three components, simulating different scenarios of knowledge sources availability. The algorithm achieves an F1 score of 67.5 when all components are available, presenting a favorable result when compared with a state-of-the-art knowledge-based system that achieves an F1 score of 66.4
publishDate 2018
dc.date.none.fl_str_mv 2018-07-27
2019-08-10T12:28:57Z
2019-08-10T12:28:57Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/1843/SLSC-BBKGTM
url http://hdl.handle.net/1843/SLSC-BBKGTM
dc.language.iso.fl_str_mv por
language por
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Universidade Federal de Minas Gerais
UFMG
publisher.none.fl_str_mv Universidade Federal de Minas Gerais
UFMG
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFMG
instname:Universidade Federal de Minas Gerais (UFMG)
instacron:UFMG
instname_str Universidade Federal de Minas Gerais (UFMG)
instacron_str UFMG
institution UFMG
reponame_str Repositório Institucional da UFMG
collection Repositório Institucional da UFMG
repository.name.fl_str_mv Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)
repository.mail.fl_str_mv repositorio@ufmg.br
_version_ 1816829648184541184