Evaluating data imbalance approaches for classifying semantic relations using machine learning and word embeddings

Detalhes bibliográficos
Autor(a) principal: Domingues, Gabriel Couto
Data de Publicação: 2023
Tipo de documento: Trabalho de conclusão de curso
Idioma: eng
Título da fonte: Repositório Institucional da UFRGS
Texto Completo: http://hdl.handle.net/10183/267624
Resumo: Explicit knowledge models are artifacts that represent domain knowledge in an explicit way and can be used in different ways, including structuring data, supporting information retrieval and reasoning. The identification and classification of semantic relationships between concepts is a critical task in the development of knowledge models. This work investigates the use of machine learning approaches and pre-trained static word embeddings to classify semantic relationships between concepts, evaluating different techniques to deal with the challenges imposed by data imbalance in this context. We proposed a methodology for building datasets for the task of semantic relationship classification from word embeddings using WordNet as a semantic reference. By applying the proposed methodology, we generated two different datasets, with two variations, for the target task. Finally, we evaluated a set of general approaches for dealing with data imbalance in classification tasks. Our results indicated that while some strategies like SMOTE showed promise in specific metrics, the baseline model consistently achieved superior performance in terms of F1 score.
id UFRGS-2_da212795b0814357f858428c88c68677
oai_identifier_str oai:www.lume.ufrgs.br:10183/267624
network_acronym_str UFRGS-2
network_name_str Repositório Institucional da UFRGS
repository_id_str
spelling Domingues, Gabriel CoutoCarbonera, Joel LuisLopes Junior, Alcides Gonçalves2023-11-25T03:26:22Z2023http://hdl.handle.net/10183/267624001187681Explicit knowledge models are artifacts that represent domain knowledge in an explicit way and can be used in different ways, including structuring data, supporting information retrieval and reasoning. The identification and classification of semantic relationships between concepts is a critical task in the development of knowledge models. This work investigates the use of machine learning approaches and pre-trained static word embeddings to classify semantic relationships between concepts, evaluating different techniques to deal with the challenges imposed by data imbalance in this context. We proposed a methodology for building datasets for the task of semantic relationship classification from word embeddings using WordNet as a semantic reference. By applying the proposed methodology, we generated two different datasets, with two variations, for the target task. Finally, we evaluated a set of general approaches for dealing with data imbalance in classification tasks. Our results indicated that while some strategies like SMOTE showed promise in specific metrics, the baseline model consistently achieved superior performance in terms of F1 score.Modelos de conhecimento explícito são artefatos que representam conhecimento de domí- nio de forma explícita e podem ser usados de diferentes maneiras, incluindo estruturação de dados e suporte à recuperação de informações e raciocínio. A identificação e classificação das relações semânticas entre conceitos é uma tarefa crítica no desenvolvimento de modelos de conhecimento. Este trabalho investiga o uso de abordagens de aprendizado de máquina e word embeddings estáticos pré-treinados para classificar relações semânticas entre conceitos, avaliando diferentes técnicas para lidar com os desafios impostos por dados desbalanceados neste contexto. Propomos uma metodologia para construir conjuntos de dados para a tarefa de classificação de relações semânticas a partir de word embeddings usando o WordNet como referência semântica. Ao aplicar a metodologia proposta, geramos dois conjuntos de dados diferentes, com duas variações, para a tarefa de classificação. Por fim, avaliamos um conjunto de abordagens gerais para lidar com desbalanceamento de dados em tarefas de classificação. Nossos resultados indicaram que, enquanto algumas estratégias, como o SMOTE, mostraram promessa em métricas específicas, o modelo base demonstrou consistentemente um desempenho superior em termos de F1 score.application/pdfengAprendizado de máquinaRedes neuraisSemântica computacionalWord EmbeddingsSupervised LearningOntologiesKnowledge GraphsWordNetEvaluating data imbalance approaches for classifying semantic relations using machine learning and word embeddingsinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/bachelorThesisUniversidade Federal do Rio Grande do SulInstituto de InformáticaPorto Alegre, BR-RS2023Ciência da Computação: Ênfase em Ciência da Computação: Bachareladograduaçãoinfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da UFRGSinstname:Universidade Federal do Rio Grande do Sul (UFRGS)instacron:UFRGSTEXT001187681.pdf.txt001187681.pdf.txtExtracted Texttext/plain114979http://www.lume.ufrgs.br/bitstream/10183/267624/2/001187681.pdf.txtb05776512cdb57bb038e01af1f4f86feMD52ORIGINAL001187681.pdfTexto completo (inglês)application/pdf7370989http://www.lume.ufrgs.br/bitstream/10183/267624/1/001187681.pdf0507fca2c20a7f360583d95bff0985b6MD5110183/2676242023-11-26 04:25:54.705086oai:www.lume.ufrgs.br:10183/267624Repositório de PublicaçõesPUBhttps://lume.ufrgs.br/oai/requestopendoar:2023-11-26T06:25:54Repositório Institucional da UFRGS - Universidade Federal do Rio Grande do Sul (UFRGS)false
dc.title.pt_BR.fl_str_mv Evaluating data imbalance approaches for classifying semantic relations using machine learning and word embeddings
title Evaluating data imbalance approaches for classifying semantic relations using machine learning and word embeddings
spellingShingle Evaluating data imbalance approaches for classifying semantic relations using machine learning and word embeddings
Domingues, Gabriel Couto
Aprendizado de máquina
Redes neurais
Semântica computacional
Word Embeddings
Supervised Learning
Ontologies
Knowledge Graphs
WordNet
title_short Evaluating data imbalance approaches for classifying semantic relations using machine learning and word embeddings
title_full Evaluating data imbalance approaches for classifying semantic relations using machine learning and word embeddings
title_fullStr Evaluating data imbalance approaches for classifying semantic relations using machine learning and word embeddings
title_full_unstemmed Evaluating data imbalance approaches for classifying semantic relations using machine learning and word embeddings
title_sort Evaluating data imbalance approaches for classifying semantic relations using machine learning and word embeddings
author Domingues, Gabriel Couto
author_facet Domingues, Gabriel Couto
author_role author
dc.contributor.author.fl_str_mv Domingues, Gabriel Couto
dc.contributor.advisor1.fl_str_mv Carbonera, Joel Luis
dc.contributor.advisor-co1.fl_str_mv Lopes Junior, Alcides Gonçalves
contributor_str_mv Carbonera, Joel Luis
Lopes Junior, Alcides Gonçalves
dc.subject.por.fl_str_mv Aprendizado de máquina
Redes neurais
Semântica computacional
topic Aprendizado de máquina
Redes neurais
Semântica computacional
Word Embeddings
Supervised Learning
Ontologies
Knowledge Graphs
WordNet
dc.subject.eng.fl_str_mv Word Embeddings
Supervised Learning
Ontologies
Knowledge Graphs
WordNet
description Explicit knowledge models are artifacts that represent domain knowledge in an explicit way and can be used in different ways, including structuring data, supporting information retrieval and reasoning. The identification and classification of semantic relationships between concepts is a critical task in the development of knowledge models. This work investigates the use of machine learning approaches and pre-trained static word embeddings to classify semantic relationships between concepts, evaluating different techniques to deal with the challenges imposed by data imbalance in this context. We proposed a methodology for building datasets for the task of semantic relationship classification from word embeddings using WordNet as a semantic reference. By applying the proposed methodology, we generated two different datasets, with two variations, for the target task. Finally, we evaluated a set of general approaches for dealing with data imbalance in classification tasks. Our results indicated that while some strategies like SMOTE showed promise in specific metrics, the baseline model consistently achieved superior performance in terms of F1 score.
publishDate 2023
dc.date.accessioned.fl_str_mv 2023-11-25T03:26:22Z
dc.date.issued.fl_str_mv 2023
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/bachelorThesis
format bachelorThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10183/267624
dc.identifier.nrb.pt_BR.fl_str_mv 001187681
url http://hdl.handle.net/10183/267624
identifier_str_mv 001187681
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFRGS
instname:Universidade Federal do Rio Grande do Sul (UFRGS)
instacron:UFRGS
instname_str Universidade Federal do Rio Grande do Sul (UFRGS)
instacron_str UFRGS
institution UFRGS
reponame_str Repositório Institucional da UFRGS
collection Repositório Institucional da UFRGS
bitstream.url.fl_str_mv http://www.lume.ufrgs.br/bitstream/10183/267624/2/001187681.pdf.txt
http://www.lume.ufrgs.br/bitstream/10183/267624/1/001187681.pdf
bitstream.checksum.fl_str_mv b05776512cdb57bb038e01af1f4f86fe
0507fca2c20a7f360583d95bff0985b6
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
repository.name.fl_str_mv Repositório Institucional da UFRGS - Universidade Federal do Rio Grande do Sul (UFRGS)
repository.mail.fl_str_mv
_version_ 1815447353042141184