Hierarchical semi-supervised confidence-based active clustering and its application to the extraction of topic hierarchies from document collections
Autor(a) principal: | |
---|---|
Data de Publicação: | 2013 |
Tipo de documento: | Tese |
Idioma: | eng |
Título da fonte: | Biblioteca Digital de Teses e Dissertações da USP |
Texto Completo: | http://www.teses.usp.br/teses/disponiveis/55/55134/tde-06052014-103312/ |
Resumo: | Topic hierarchies are efficient ways of organizing document collections. These structures help users to manage the knowledge contained in textual data. These hierarchies are usually obtained through unsupervised hierarchical clustering algorithms. By not considering the context of the user in the formation of the hierarchical groups, unsupervised topic hierarchies may not attend the user\'s expectations in some cases. One possible solution for this problem is to employ semi-supervised clustering algorithms. These algorithms incorporate the user\'s knowledge through the usage of constraints to the clustering process. However, in the context of semi-supervised hierarchical clustering, the works in the literature do not efficient explore the selection of cases (instances or cluster) to add constraints, neither the interaction of the user with the clustering process. In this sense, in this work we introduce two semi-supervised hierarchical clustering algorithms: HCAC (Hierarchical Confidence-based Active Clustering) and HCAC-LC (Hierarchical Confidence-based Active Clustering with Limited Constraints). These algorithms employ an active learning approach based in the confidence of cluster merges. When a low confidence merge is detected, the user is invited to decide, from a pool of candidate pairs of clusters, the best cluster merge in that point. In this work, we employ HCAC and HCAC-LC in the extraction of topic hierarchies through the SMITH framework, which is also proposed in this thesis. This framework provides a series of well defined activities that allow the user\'s interaction in the generation of topic hierarchies. The active learning approach used in the HCAC-based algorithms, the kind of queries employed in these algorithms, as well as the SMITH framework for the generation of semi-supervised topic hierarchies are innovations to the state of the art proposed in this thesis. Our experimental results indicate that HCAC and HCAC-LC outperform other semi-supervised hierarchical clustering algorithms in diverse scenarios. The results also indicate that semi-supervised topic hierarchies obtained through the SMITH framework are more intuitive and easier to navigate than unsupervised topic hierarchies |
id |
USP_40de1121b390324c815dcdeeeafa184f |
---|---|
oai_identifier_str |
oai:teses.usp.br:tde-06052014-103312 |
network_acronym_str |
USP |
network_name_str |
Biblioteca Digital de Teses e Dissertações da USP |
repository_id_str |
2721 |
spelling |
Hierarchical semi-supervised confidence-based active clustering and its application to the extraction of topic hierarchies from document collectionsAgrupamento hierárquico semissupervisionado ativo baseado em confiança e sua aplicação para extração de hierarquias de tópicos a partir de coleções de documentosActive learningAgrupamento semissupervisionadoAprendizado ativoHierarquias de tópicosSemi-supervised clusteringTopic hierarchiesTopic hierarchies are efficient ways of organizing document collections. These structures help users to manage the knowledge contained in textual data. These hierarchies are usually obtained through unsupervised hierarchical clustering algorithms. By not considering the context of the user in the formation of the hierarchical groups, unsupervised topic hierarchies may not attend the user\'s expectations in some cases. One possible solution for this problem is to employ semi-supervised clustering algorithms. These algorithms incorporate the user\'s knowledge through the usage of constraints to the clustering process. However, in the context of semi-supervised hierarchical clustering, the works in the literature do not efficient explore the selection of cases (instances or cluster) to add constraints, neither the interaction of the user with the clustering process. In this sense, in this work we introduce two semi-supervised hierarchical clustering algorithms: HCAC (Hierarchical Confidence-based Active Clustering) and HCAC-LC (Hierarchical Confidence-based Active Clustering with Limited Constraints). These algorithms employ an active learning approach based in the confidence of cluster merges. When a low confidence merge is detected, the user is invited to decide, from a pool of candidate pairs of clusters, the best cluster merge in that point. In this work, we employ HCAC and HCAC-LC in the extraction of topic hierarchies through the SMITH framework, which is also proposed in this thesis. This framework provides a series of well defined activities that allow the user\'s interaction in the generation of topic hierarchies. The active learning approach used in the HCAC-based algorithms, the kind of queries employed in these algorithms, as well as the SMITH framework for the generation of semi-supervised topic hierarchies are innovations to the state of the art proposed in this thesis. Our experimental results indicate that HCAC and HCAC-LC outperform other semi-supervised hierarchical clustering algorithms in diverse scenarios. The results also indicate that semi-supervised topic hierarchies obtained through the SMITH framework are more intuitive and easier to navigate than unsupervised topic hierarchiesHierarquias de tópicos são formas eficientes de organização de coleções de documentos, auxiliando usuários a gerir o conhecimento materializado nessas publicações textuais. Tais hierarquias são usualmente construídas por meio de algoritmos de agrupamento hierárquico não supervisionado. Entretanto, por não considerarem o contexto do usuário na formação dos grupos, hierarquias de tópicos não supervisionadas nem sempre conseguem atender as suas expectativas. Uma solução para este problema e o emprego de algoritmos de agrupamento semissupervisionado, os quais incorporam o conhecimento de domínio do usuário por meio de restrições. Entretanto, para o contexto de agrupamento hierárquico semissupervisionado, não são eficientemente explorados na literatura métodos de seleção de casos (instâncias ou grupos) para receber restrições, bem como não há formas eficientes de interação do usuário com o processo de agrupamento hierárquico. Dessa maneira, neste trabalho, dois algoritmos de agrupamento hierárquico semissupervisionado são propostos: HCAC (Hierarchical Confidence-based Active Clustering) e HCAC-LC (Hierarchical Confidence-based Active Clustering with Limited Constraints). Estes algoritmos empregam uma abordagem de aprendizado ativo baseado na confiança de uma junção de clusters. Quando uma junção de baixa confiança e detectada, o usuário e convidado a decidir, em um conjunto de pares de grupos candidatos, a melhor junção naquele ponto. Estes algoritmos são aqui utilizados na extração de hierarquias de tópicos por meio do framework SMITH, também proposto nesse trabalho. Este framework fornece uma série de atividades bem definidas que possibilitam a interação do usuário para a obtenção de hierarquias de tópicos. A abordagem de aprendizado ativo utilizado nos algoritmos HCAC e HCAC-LC, o tipo de restrição utilizada nestes algoritmos, bem como o framework SMITH para obtenção de hierarquias de tópicos semissupervisionadas são inovações ao estado da arte propostos neste trabalho. Os resultados obtidos indicam que os algoritmos HCAC e HCAC-LC superam o desempenho de outros algoritmos hierárquicos semissupervisionados em diversos cenários. Os resultados também indicam que hierarquias de tópico semissupervisionadas obtidas por meio do framework SMITH são mais intuitivas e fáceis de navegar do que aquelas não supervisionadasBiblioteca Digitais de Teses e Dissertações da USPJorge, Alípio Mário GuedesRezende, Solange OliveiraNogueira, Bruno Magalhães2013-12-16info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttp://www.teses.usp.br/teses/disponiveis/55/55134/tde-06052014-103312/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2016-07-28T16:11:49Zoai:teses.usp.br:tde-06052014-103312Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212016-07-28T16:11:49Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false |
dc.title.none.fl_str_mv |
Hierarchical semi-supervised confidence-based active clustering and its application to the extraction of topic hierarchies from document collections Agrupamento hierárquico semissupervisionado ativo baseado em confiança e sua aplicação para extração de hierarquias de tópicos a partir de coleções de documentos |
title |
Hierarchical semi-supervised confidence-based active clustering and its application to the extraction of topic hierarchies from document collections |
spellingShingle |
Hierarchical semi-supervised confidence-based active clustering and its application to the extraction of topic hierarchies from document collections Nogueira, Bruno Magalhães Active learning Agrupamento semissupervisionado Aprendizado ativo Hierarquias de tópicos Semi-supervised clustering Topic hierarchies |
title_short |
Hierarchical semi-supervised confidence-based active clustering and its application to the extraction of topic hierarchies from document collections |
title_full |
Hierarchical semi-supervised confidence-based active clustering and its application to the extraction of topic hierarchies from document collections |
title_fullStr |
Hierarchical semi-supervised confidence-based active clustering and its application to the extraction of topic hierarchies from document collections |
title_full_unstemmed |
Hierarchical semi-supervised confidence-based active clustering and its application to the extraction of topic hierarchies from document collections |
title_sort |
Hierarchical semi-supervised confidence-based active clustering and its application to the extraction of topic hierarchies from document collections |
author |
Nogueira, Bruno Magalhães |
author_facet |
Nogueira, Bruno Magalhães |
author_role |
author |
dc.contributor.none.fl_str_mv |
Jorge, Alípio Mário Guedes Rezende, Solange Oliveira |
dc.contributor.author.fl_str_mv |
Nogueira, Bruno Magalhães |
dc.subject.por.fl_str_mv |
Active learning Agrupamento semissupervisionado Aprendizado ativo Hierarquias de tópicos Semi-supervised clustering Topic hierarchies |
topic |
Active learning Agrupamento semissupervisionado Aprendizado ativo Hierarquias de tópicos Semi-supervised clustering Topic hierarchies |
description |
Topic hierarchies are efficient ways of organizing document collections. These structures help users to manage the knowledge contained in textual data. These hierarchies are usually obtained through unsupervised hierarchical clustering algorithms. By not considering the context of the user in the formation of the hierarchical groups, unsupervised topic hierarchies may not attend the user\'s expectations in some cases. One possible solution for this problem is to employ semi-supervised clustering algorithms. These algorithms incorporate the user\'s knowledge through the usage of constraints to the clustering process. However, in the context of semi-supervised hierarchical clustering, the works in the literature do not efficient explore the selection of cases (instances or cluster) to add constraints, neither the interaction of the user with the clustering process. In this sense, in this work we introduce two semi-supervised hierarchical clustering algorithms: HCAC (Hierarchical Confidence-based Active Clustering) and HCAC-LC (Hierarchical Confidence-based Active Clustering with Limited Constraints). These algorithms employ an active learning approach based in the confidence of cluster merges. When a low confidence merge is detected, the user is invited to decide, from a pool of candidate pairs of clusters, the best cluster merge in that point. In this work, we employ HCAC and HCAC-LC in the extraction of topic hierarchies through the SMITH framework, which is also proposed in this thesis. This framework provides a series of well defined activities that allow the user\'s interaction in the generation of topic hierarchies. The active learning approach used in the HCAC-based algorithms, the kind of queries employed in these algorithms, as well as the SMITH framework for the generation of semi-supervised topic hierarchies are innovations to the state of the art proposed in this thesis. Our experimental results indicate that HCAC and HCAC-LC outperform other semi-supervised hierarchical clustering algorithms in diverse scenarios. The results also indicate that semi-supervised topic hierarchies obtained through the SMITH framework are more intuitive and easier to navigate than unsupervised topic hierarchies |
publishDate |
2013 |
dc.date.none.fl_str_mv |
2013-12-16 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
format |
doctoralThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://www.teses.usp.br/teses/disponiveis/55/55134/tde-06052014-103312/ |
url |
http://www.teses.usp.br/teses/disponiveis/55/55134/tde-06052014-103312/ |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
|
dc.rights.driver.fl_str_mv |
Liberar o conteúdo para acesso público. info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
Liberar o conteúdo para acesso público. |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.coverage.none.fl_str_mv |
|
dc.publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
dc.source.none.fl_str_mv |
reponame:Biblioteca Digital de Teses e Dissertações da USP instname:Universidade de São Paulo (USP) instacron:USP |
instname_str |
Universidade de São Paulo (USP) |
instacron_str |
USP |
institution |
USP |
reponame_str |
Biblioteca Digital de Teses e Dissertações da USP |
collection |
Biblioteca Digital de Teses e Dissertações da USP |
repository.name.fl_str_mv |
Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP) |
repository.mail.fl_str_mv |
virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br |
_version_ |
1815256713907929088 |