Embedding Propagation over Heterogeneous Information Networks

Carmo, Paulo Ricardo Viviurka do

Embedding Propagation over Heterogeneous Information Networks

Detalhes bibliográficos
Autor(a) principal:	Carmo, Paulo Ricardo Viviurka do
Data de Publicação:	2022
Tipo de documento:	Dissertação
Idioma:	eng
Título da fonte:	Biblioteca Digital de Teses e Dissertações da USP
Texto Completo:	https://www.teses.usp.br/teses/disponiveis/55/55134/tde-11012023-172819/
Resumo:	In order to use text data in machine learning tasks, they must be cleaned and transformed to a structured representation. Recently, neural embeddings have been used to encode text data in low dimensionality latent spaces. For example, BERT pre-trained neural language models can position words, sentences, or documents with fixed dimension embedding vectors. Another way to model text data is to use heterogeneous information networks. That structure models multi-typed data respecting relations and characteristics. Heterogeneous information networks also have their challenges for use with off-the-shelf machine learning methods. Network embedding methods allow the extraction of embedding vectors for each node in an information network. However, these methods usually use only network topology, and sometimes, metadata for the relationships. Embedding propagation methods allow previously generated features with pre-trained methods to be propagated through all network nodes. Information networks that contain some nodes with textual information can use pre-trained neural language models features for propagation. This masters dissertation presents an embedding propagation method for heterogeneous information networks with some textual nodes. The proposed method combines pre-trained neural language models to the topology of heterogeneous information networks through a regularization function to generate embedding for non-textual nodes. Three papers on use case experiments to evaluate and validate the proposed method are presented, where one paper extends the experiments from another: (1) Embedding Propagation over Heterogeneous Event Networks presents the results of the proposed method for event analysis where it achieved the best performance by at least 3% MRR@k in all scenarios; (2) TRENCHANT: TRENd prediCtion on Heterogeneous informAtion NeTworks extends Commodities trend link prediction on heterogeneous information networks where the proposed method is evaluated against network embeddings in the task of predicting price trends for commodities, and it achieved the best performance in some scenarios, where its best results 8% better F1 when predicting weekly soybean price trends; and (3) NatUKE: Benchmark for Natural Product Knowledge Extraction from Academic Literature that evaluates the use of network embedding methods for unsupervised knowledge extraction and the proposed method achieved the best performance in most scenarios, more notably it achieved 43% more Hits@1 than baselines when extraction the isolation process type to obtain a molecule from a certain species. The presented papers show, in three different use cases and experiments, that the proposed method achieves the research goals of propagating the initial embedding from some textual nodes to the remaining nodes in a heterogeneous information network and allowing dynamic insertion of new nodes in the embedding propagation process.

Metadados do item

id	USP_4d27322d658306906429951451909d1a
oai_identifier_str	oai:teses.usp.br:tde-11012023-172819
network_acronym_str	USP
network_name_str	Biblioteca Digital de Teses e Dissertações da USP
repository_id_str	2721
spelling	Embedding Propagation over Heterogeneous Information NetworksPropagação de Embeddings em Redes Heterogêneas de InformaçãoEmbedding propagationHeterogeneous information networkNetwork embeddingNetwork embeddingPropagação de embeddingsRedes heterogêneasIn order to use text data in machine learning tasks, they must be cleaned and transformed to a structured representation. Recently, neural embeddings have been used to encode text data in low dimensionality latent spaces. For example, BERT pre-trained neural language models can position words, sentences, or documents with fixed dimension embedding vectors. Another way to model text data is to use heterogeneous information networks. That structure models multi-typed data respecting relations and characteristics. Heterogeneous information networks also have their challenges for use with off-the-shelf machine learning methods. Network embedding methods allow the extraction of embedding vectors for each node in an information network. However, these methods usually use only network topology, and sometimes, metadata for the relationships. Embedding propagation methods allow previously generated features with pre-trained methods to be propagated through all network nodes. Information networks that contain some nodes with textual information can use pre-trained neural language models features for propagation. This masters dissertation presents an embedding propagation method for heterogeneous information networks with some textual nodes. The proposed method combines pre-trained neural language models to the topology of heterogeneous information networks through a regularization function to generate embedding for non-textual nodes. Three papers on use case experiments to evaluate and validate the proposed method are presented, where one paper extends the experiments from another: (1) Embedding Propagation over Heterogeneous Event Networks presents the results of the proposed method for event analysis where it achieved the best performance by at least 3% MRR@k in all scenarios; (2) TRENCHANT: TRENd prediCtion on Heterogeneous informAtion NeTworks extends Commodities trend link prediction on heterogeneous information networks where the proposed method is evaluated against network embeddings in the task of predicting price trends for commodities, and it achieved the best performance in some scenarios, where its best results 8% better F1 when predicting weekly soybean price trends; and (3) NatUKE: Benchmark for Natural Product Knowledge Extraction from Academic Literature that evaluates the use of network embedding methods for unsupervised knowledge extraction and the proposed method achieved the best performance in most scenarios, more notably it achieved 43% more Hits@1 than baselines when extraction the isolation process type to obtain a molecule from a certain species. The presented papers show, in three different use cases and experiments, that the proposed method achieves the research goals of propagating the initial embedding from some textual nodes to the remaining nodes in a heterogeneous information network and allowing dynamic insertion of new nodes in the embedding propagation process.Dados textuais precisam ser limpos e transformados para representações estruturadas antes de serem utilizados em cenários de aprendizado de máquina. Recentemente, embeddings estão sendo utilizadas para representar dados textuais. Por exemplo, o modelo de linguagem neurais pré-treinado BERT podem posicionar palavras, sentenças ou textos em embeddings dentro de um espaço vetorial de dimensão fixa. Outra forma de modelar dados textuais é a utilização de redes heterogêneas de informação. Essa estrutura permite a modelagem de relações complexas por meio de nós e conexões de dados textuais de diferentes domínios com conexões explícitas. Por outro lado, redes de informação possuem seus próprios desafios quanto a utilização direta em métodos tradicionais de aprendizado de máquina. Métodos de network embedding podem ser utilizados para gerarem embeddings de nós com relação a topologia da rede, tipos de relações e até mesmo rótulos. Entretanto esses métodos normalmente exploram apenas a topologia, e em alguns casos, metadados dos relacionamentos em uma rede. Métodos de propagação de embeddings foram desenvolvidos com o objetivo de distribuir vetores de características gerados a partir de outros modelos. Para redes de informação que possuem alguns nós com dados textuais modelos de linguagem pré-treinados podem ser propagados respeitando a topologia e outros dados das redes para a geração de uma embedding final. Esta dissertação de mestrado apresenta um método de propagação de embeddings para redes heterogêneas de informação que representam dados textuais. O método proposto propaga as embeddings de nós textuais por toda a rede por meio de uma função de regularização. Três artigos de caso de uso que avaliam e validam o método também são apresentados: (1) Embedding Propagation over Heterogeneous Event Networks mostra o desempenho do método proposto para análise de eventos onde sua performance supera a literatura por mais de 3% MRR@k em todos os cenários; (2) TRENCHANT: TRENd prediCtion on Heterogeneous informAtion NeTworks que é uma extensão de Commodities trend link prediction on heterogeneous information networks onde o método proposto é avaliado em relação a métodos de network embedding da literatura na tarefa de predição de preços de commodities e atinge performance superior a literatura em alguns cenários, onde obteve 8% melhor F1 predizendo trends de preços semanais da soja; e (3) NatUKE: Benchmark for Natural Product Knowledge Extraction from Academic Literature que avalia a utilização de métodos de network embedding para a extração de conhecimento não-supervisionada e o método proposto obteve a melhor performance na maior parte dos cenários, sendo que em sua melhor performance obteve 43% mais Hits@1 que a literatura extraindo o tipo de isolamento necessário para obter certa molécula de uma espécia de planta. Esses artigos mostram por meio de experimentos e resultados que o método proposto, ao utilizar uma função de regularização para a propagação, atinge os objetivos de pesquisa de propagar uma embedding inicial de alguns nós com dados textuais para os nós restantes de uma rede heterogênea de informação e permitir a inserção dinâmica de novos nós ao processo de propagação de embeddings.Biblioteca Digitais de Teses e Dissertações da USPMarcacini, Ricardo MarcondesCarmo, Paulo Ricardo Viviurka do2022-10-07info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/55/55134/tde-11012023-172819/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2023-01-11T19:32:31Zoai:teses.usp.br:tde-11012023-172819Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br\|\| atendimento@aguia.usp.br\|\|virginia@if.usp.bropendoar:27212023-01-11T19:32:31Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv	Embedding Propagation over Heterogeneous Information Networks Propagação de Embeddings em Redes Heterogêneas de Informação
title	Embedding Propagation over Heterogeneous Information Networks
spellingShingle	Embedding Propagation over Heterogeneous Information Networks Carmo, Paulo Ricardo Viviurka do Embedding propagation Heterogeneous information network Network embedding Network embedding Propagação de embeddings Redes heterogêneas
title_short	Embedding Propagation over Heterogeneous Information Networks
title_full	Embedding Propagation over Heterogeneous Information Networks
title_fullStr	Embedding Propagation over Heterogeneous Information Networks
title_full_unstemmed	Embedding Propagation over Heterogeneous Information Networks
title_sort	Embedding Propagation over Heterogeneous Information Networks
author	Carmo, Paulo Ricardo Viviurka do
author_facet	Carmo, Paulo Ricardo Viviurka do
author_role	author
dc.contributor.none.fl_str_mv	Marcacini, Ricardo Marcondes
dc.contributor.author.fl_str_mv	Carmo, Paulo Ricardo Viviurka do
dc.subject.por.fl_str_mv	Embedding propagation Heterogeneous information network Network embedding Network embedding Propagação de embeddings Redes heterogêneas
topic	Embedding propagation Heterogeneous information network Network embedding Network embedding Propagação de embeddings Redes heterogêneas
description	In order to use text data in machine learning tasks, they must be cleaned and transformed to a structured representation. Recently, neural embeddings have been used to encode text data in low dimensionality latent spaces. For example, BERT pre-trained neural language models can position words, sentences, or documents with fixed dimension embedding vectors. Another way to model text data is to use heterogeneous information networks. That structure models multi-typed data respecting relations and characteristics. Heterogeneous information networks also have their challenges for use with off-the-shelf machine learning methods. Network embedding methods allow the extraction of embedding vectors for each node in an information network. However, these methods usually use only network topology, and sometimes, metadata for the relationships. Embedding propagation methods allow previously generated features with pre-trained methods to be propagated through all network nodes. Information networks that contain some nodes with textual information can use pre-trained neural language models features for propagation. This masters dissertation presents an embedding propagation method for heterogeneous information networks with some textual nodes. The proposed method combines pre-trained neural language models to the topology of heterogeneous information networks through a regularization function to generate embedding for non-textual nodes. Three papers on use case experiments to evaluate and validate the proposed method are presented, where one paper extends the experiments from another: (1) Embedding Propagation over Heterogeneous Event Networks presents the results of the proposed method for event analysis where it achieved the best performance by at least 3% MRR@k in all scenarios; (2) TRENCHANT: TRENd prediCtion on Heterogeneous informAtion NeTworks extends Commodities trend link prediction on heterogeneous information networks where the proposed method is evaluated against network embeddings in the task of predicting price trends for commodities, and it achieved the best performance in some scenarios, where its best results 8% better F1 when predicting weekly soybean price trends; and (3) NatUKE: Benchmark for Natural Product Knowledge Extraction from Academic Literature that evaluates the use of network embedding methods for unsupervised knowledge extraction and the proposed method achieved the best performance in most scenarios, more notably it achieved 43% more Hits@1 than baselines when extraction the isolation process type to obtain a molecule from a certain species. The presented papers show, in three different use cases and experiments, that the proposed method achieves the research goals of propagating the initial embedding from some textual nodes to the remaining nodes in a heterogeneous information network and allowing dynamic insertion of new nodes in the embedding propagation process.
publishDate	2022
dc.date.none.fl_str_mv	2022-10-07
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://www.teses.usp.br/teses/disponiveis/55/55134/tde-11012023-172819/
url	https://www.teses.usp.br/teses/disponiveis/55/55134/tde-11012023-172819/
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv	Liberar o conteúdo para acesso público. info:eu-repo/semantics/openAccess
rights_invalid_str_mv	Liberar o conteúdo para acesso público.
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv	Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv	Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv	reponame:Biblioteca Digital de Teses e Dissertações da USP instname:Universidade de São Paulo (USP) instacron:USP
instname_str	Universidade de São Paulo (USP)
instacron_str	USP
institution	USP
reponame_str	Biblioteca Digital de Teses e Dissertações da USP
collection	Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv	Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv	virginia@if.usp.br\|\| atendimento@aguia.usp.br\|\|virginia@if.usp.br
_version_	1815256852050477056

Embedding Propagation over Heterogeneous Information Networks

Registros relacionados