A wikification prediction model based on the combination of latent, dyadic and monadic features
Autor(a) principal: | |
---|---|
Data de Publicação: | 2016 |
Tipo de documento: | Tese |
Idioma: | eng |
Título da fonte: | Biblioteca Digital de Teses e Dissertações da USP |
Texto Completo: | http://www.teses.usp.br/teses/disponiveis/55/55134/tde-29112016-164654/ |
Resumo: | Most of the reference information, nowadays, is found in repositories of documents semantically linked, created in a collaborative fashion and freely available in the web. Among the many problems faced by content providers in these repositories, one of the most important is Wikification, that is, the placement of links in the articles. These links have to support user navigation and should provide a deeper semantic interpretation of the content. Wikification is a hard task since the continuous growth of such repositories makes it increasingly demanding for editors. As consequence, they have their focus shifted from content creation, which should be their main objective. This has motivated the design of automatic Wikification tools which, traditionally, address two distinct problems: (a) how to identify which words (or phrases) in an article should be selected as anchors and (b) how to determine to which article the link, associated with the anchor, should point. Most of the methods in literature that address these problems are based on machine learning approaches which attempt to capture, through statistical features, characteristics of the concepts and its associations. Although these strategies handle the repository as a graph of concepts, normally they take limited advantage of the topological structure of this graph, as they describe it by means of human-engineered link statistical features. Despite the effectiveness of these machine learning methods, better models should take full advantage of the information topology if they describe it by means of data-oriented approaches such as matrix factorization. This indeed has been successfully done in other domains, such as movie recommendation. In this work, we fill this gap, proposing a wikification prediction model that combines the strengths of traditional predictors based on statistical features with a latent component which models the concept graph topology by means of matrix factorization. By comparing our model with a state-of-the-art wikification method, using a sample of Wikipedia articles, we obtained a gain up to 13% in F1 metric. We also provide a comprehensive analysis of the model performance showing the importance of the latent predictor component and the attributes derived from the associations between the concepts. The study still includes the analysis of the impact of ambiguous concepts, which allows us to conclude the model is resilient to ambiguity, even though does not include any explicitly disambiguation phase. We finally study the impact of selecting training samples from specific content quality classes, an information that is available in some respositories, such as Wikipedia. We empirically shown that the quality of the training samples impact on precision and overlinking, when comparing training performed using random quality samples versus high quality samples. |
id |
USP_898304ef5a7aee1ddb2968f378967b2d |
---|---|
oai_identifier_str |
oai:teses.usp.br:tde-29112016-164654 |
network_acronym_str |
USP |
network_name_str |
Biblioteca Digital de Teses e Dissertações da USP |
repository_id_str |
2721 |
spelling |
A wikification prediction model based on the combination of latent, dyadic and monadic featuresUm modelo de previsão para Wikification baseado na combinação de atributos latentes, diádicos e monádicosAprendizado de máquinaFatoração matricialLink predictionMachine learningMatrix factorizationPrevisão de linksWikificaçãoWikificationWikipediaWikipédiaMost of the reference information, nowadays, is found in repositories of documents semantically linked, created in a collaborative fashion and freely available in the web. Among the many problems faced by content providers in these repositories, one of the most important is Wikification, that is, the placement of links in the articles. These links have to support user navigation and should provide a deeper semantic interpretation of the content. Wikification is a hard task since the continuous growth of such repositories makes it increasingly demanding for editors. As consequence, they have their focus shifted from content creation, which should be their main objective. This has motivated the design of automatic Wikification tools which, traditionally, address two distinct problems: (a) how to identify which words (or phrases) in an article should be selected as anchors and (b) how to determine to which article the link, associated with the anchor, should point. Most of the methods in literature that address these problems are based on machine learning approaches which attempt to capture, through statistical features, characteristics of the concepts and its associations. Although these strategies handle the repository as a graph of concepts, normally they take limited advantage of the topological structure of this graph, as they describe it by means of human-engineered link statistical features. Despite the effectiveness of these machine learning methods, better models should take full advantage of the information topology if they describe it by means of data-oriented approaches such as matrix factorization. This indeed has been successfully done in other domains, such as movie recommendation. In this work, we fill this gap, proposing a wikification prediction model that combines the strengths of traditional predictors based on statistical features with a latent component which models the concept graph topology by means of matrix factorization. By comparing our model with a state-of-the-art wikification method, using a sample of Wikipedia articles, we obtained a gain up to 13% in F1 metric. We also provide a comprehensive analysis of the model performance showing the importance of the latent predictor component and the attributes derived from the associations between the concepts. The study still includes the analysis of the impact of ambiguous concepts, which allows us to conclude the model is resilient to ambiguity, even though does not include any explicitly disambiguation phase. We finally study the impact of selecting training samples from specific content quality classes, an information that is available in some respositories, such as Wikipedia. We empirically shown that the quality of the training samples impact on precision and overlinking, when comparing training performed using random quality samples versus high quality samples.Atualmente, informações de referência são disponibilizadas através de repositórios de documentos semanticamente ligados, criados de forma colaborativa e com acesso livre na Web. Entre os muitos problemas enfrentados pelos provedores de conteúdo desses repositórios, destaca-se a Wikification, isto é, a inclusão de links nos artigos desses repositórios. Esses links possibilitam a navegação pelos artigos e permitem ao usuário um aprofundamento semântico do conteúdo. A Wikification é uma tarefa complexa, uma vez que o crescimento contínuo de tais repositórios resulta em um esforço cada vez maior dos editores. Como consequência, eles têm seu foco desviado da criação de conteúdo, que deveria ser o seu principal objetivo. Isso tem motivado o desenvolvimento de ferramentas de Wikification automática que, tradicionalmente, abordam dois problemas distintos: (a) como identificar que palavras (ou frases) em um artigo deveriam ser selecionados como texto de âncora e (b) como determinar para que artigos o link, associado ao texto de âncora, deveria apontar. A maioria dos métodos na literatura que abordam esses problemas usam aprendizado de máquina. Eles tentam capturar, através de atributos estatísticos, características dos conceitos e seus links. Embora essas estratégias tratam o repositório como um grafo de conceitos, normalmente elas pouco exploram a estrutura topológica do grafo, uma vez que se limitam a descrevê-lo por meio de atributos estatísticos dos links, projetados por especialistas humanos. Embora tais métodos sejam eficazes, novos modelos poderiam tirar mais proveito da topologia se a descrevessem por meio de abordagens orientados a dados, tais como a fatoração matricial. De fato, essa abordagem tem sido aplicada com sucesso em outros domínios como recomendação de filmes. Neste trabalho, propomos um modelo de previsão para Wikification que combina a força dos previsores tradicionais baseados em atributos estatísticos, projetados por seres humanos, com um componente de previsão latente, que modela a topologia do grafo de conceitos usando fatoração matricial. Ao comparar nosso modelo com o estado-da-arte em Wikification, usando uma amostra de artigos Wikipédia, observamos um ganho de até 13% em F1. Além disso, fornecemos uma análise detalhada do desempenho do modelo enfatizando a importância do componente de previsão latente e dos atributos derivados dos links entre os conceitos. Também analisamos o impacto de conceitos ambíguos, o que permite concluir que nosso modelo se porta bem mesmo diante de ambiguidade, apesar de não tratar explicitamente este problema. Ainda realizamos um estudo sobre o impacto da seleção das amostras de treino conforme a qualidade dos seus conteúdos, uma informação disponível em alguns repositórios, tais como a Wikipédia. Nós observamos que o treino com documentos de alta qualidade melhora a precisão do método, minimizando o uso de links desnecessários.Biblioteca Digitais de Teses e Dissertações da USPPimentel, Maria da Graça CamposFerreira, Raoni Simões2016-04-25info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttp://www.teses.usp.br/teses/disponiveis/55/55134/tde-29112016-164654/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2017-09-04T21:05:35Zoai:teses.usp.br:tde-29112016-164654Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212017-09-04T21:05:35Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false |
dc.title.none.fl_str_mv |
A wikification prediction model based on the combination of latent, dyadic and monadic features Um modelo de previsão para Wikification baseado na combinação de atributos latentes, diádicos e monádicos |
title |
A wikification prediction model based on the combination of latent, dyadic and monadic features |
spellingShingle |
A wikification prediction model based on the combination of latent, dyadic and monadic features Ferreira, Raoni Simões Aprendizado de máquina Fatoração matricial Link prediction Machine learning Matrix factorization Previsão de links Wikificação Wikification Wikipedia Wikipédia |
title_short |
A wikification prediction model based on the combination of latent, dyadic and monadic features |
title_full |
A wikification prediction model based on the combination of latent, dyadic and monadic features |
title_fullStr |
A wikification prediction model based on the combination of latent, dyadic and monadic features |
title_full_unstemmed |
A wikification prediction model based on the combination of latent, dyadic and monadic features |
title_sort |
A wikification prediction model based on the combination of latent, dyadic and monadic features |
author |
Ferreira, Raoni Simões |
author_facet |
Ferreira, Raoni Simões |
author_role |
author |
dc.contributor.none.fl_str_mv |
Pimentel, Maria da Graça Campos |
dc.contributor.author.fl_str_mv |
Ferreira, Raoni Simões |
dc.subject.por.fl_str_mv |
Aprendizado de máquina Fatoração matricial Link prediction Machine learning Matrix factorization Previsão de links Wikificação Wikification Wikipedia Wikipédia |
topic |
Aprendizado de máquina Fatoração matricial Link prediction Machine learning Matrix factorization Previsão de links Wikificação Wikification Wikipedia Wikipédia |
description |
Most of the reference information, nowadays, is found in repositories of documents semantically linked, created in a collaborative fashion and freely available in the web. Among the many problems faced by content providers in these repositories, one of the most important is Wikification, that is, the placement of links in the articles. These links have to support user navigation and should provide a deeper semantic interpretation of the content. Wikification is a hard task since the continuous growth of such repositories makes it increasingly demanding for editors. As consequence, they have their focus shifted from content creation, which should be their main objective. This has motivated the design of automatic Wikification tools which, traditionally, address two distinct problems: (a) how to identify which words (or phrases) in an article should be selected as anchors and (b) how to determine to which article the link, associated with the anchor, should point. Most of the methods in literature that address these problems are based on machine learning approaches which attempt to capture, through statistical features, characteristics of the concepts and its associations. Although these strategies handle the repository as a graph of concepts, normally they take limited advantage of the topological structure of this graph, as they describe it by means of human-engineered link statistical features. Despite the effectiveness of these machine learning methods, better models should take full advantage of the information topology if they describe it by means of data-oriented approaches such as matrix factorization. This indeed has been successfully done in other domains, such as movie recommendation. In this work, we fill this gap, proposing a wikification prediction model that combines the strengths of traditional predictors based on statistical features with a latent component which models the concept graph topology by means of matrix factorization. By comparing our model with a state-of-the-art wikification method, using a sample of Wikipedia articles, we obtained a gain up to 13% in F1 metric. We also provide a comprehensive analysis of the model performance showing the importance of the latent predictor component and the attributes derived from the associations between the concepts. The study still includes the analysis of the impact of ambiguous concepts, which allows us to conclude the model is resilient to ambiguity, even though does not include any explicitly disambiguation phase. We finally study the impact of selecting training samples from specific content quality classes, an information that is available in some respositories, such as Wikipedia. We empirically shown that the quality of the training samples impact on precision and overlinking, when comparing training performed using random quality samples versus high quality samples. |
publishDate |
2016 |
dc.date.none.fl_str_mv |
2016-04-25 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
format |
doctoralThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://www.teses.usp.br/teses/disponiveis/55/55134/tde-29112016-164654/ |
url |
http://www.teses.usp.br/teses/disponiveis/55/55134/tde-29112016-164654/ |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
|
dc.rights.driver.fl_str_mv |
Liberar o conteúdo para acesso público. info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
Liberar o conteúdo para acesso público. |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.coverage.none.fl_str_mv |
|
dc.publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
dc.source.none.fl_str_mv |
reponame:Biblioteca Digital de Teses e Dissertações da USP instname:Universidade de São Paulo (USP) instacron:USP |
instname_str |
Universidade de São Paulo (USP) |
instacron_str |
USP |
institution |
USP |
reponame_str |
Biblioteca Digital de Teses e Dissertações da USP |
collection |
Biblioteca Digital de Teses e Dissertações da USP |
repository.name.fl_str_mv |
Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP) |
repository.mail.fl_str_mv |
virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br |
_version_ |
1815257490988728320 |