Coarse-refinement dilemma: on generalization bounds for data clustering
Autor(a) principal: | |
---|---|
Data de Publicação: | 2020 |
Tipo de documento: | Tese |
Idioma: | eng |
Título da fonte: | Biblioteca Digital de Teses e Dissertações da USP |
Texto Completo: | https://www.teses.usp.br/teses/disponiveis/55/55134/tde-18112020-130229/ |
Resumo: | Machine Learning (ML) is typically organized into two main paradigms: (i) the Supervised Machine Learning (SML) to identify patterns from pre-labeled data, in which a loss function is used to adapt the corresponding model; and, (ii) the Unsupervised Machine Learning (UML) to organize data points in the absence of labels, taking similarity relations among elements into account. SML relies on well-consolidated theoretical frameworks, such as the Statistical Learning Theory (SLT) and the Algorithmic Stability (AS) to define assumptions, properties and convergence guarantees, allowing the comparison of different methods and, consequently, their improvements. Complementary, UML has been supported by investigations on Data Clustering (DC) and Hierarchical Clustering (HC) in order to define properties and improve their characterizations. Specifically, Kleinberg stated richness, scale-invariance and partition consistency as the necessary properties to define the DC problem, proving they do not hold simultaneously, while Ackerman, Ben-David and Loker explored other properties such as locality, and Carlsson and Mémoli developed stability and consistency frameworks for HC from metric spaces. To bring an additional contribution to UML, we considered topological spaces to design more general theoretical results given: (i) the invariance on topological spaces, more precisely isomorphism of homology groups, guarantees the properties of scale-invariance, partition consistency and locality; and (ii) this same invariance is inherited along less general spaces, such as the metric, thus allowing a more abstract clustering representation. Taking such invariance into account, we demonstrated that over-refined topologies endowed by DC and HC models lead to non-consistency in terms of their associated homology groups and, on the other hand, over-coarsed topologies devise consistent but unrepresentative homology groups, a phenomenon that we referred to as the Coarse-Refinement Dilemma (CRD). We then formulated DC and HC problems by employing Carlsson and Zomorodians bidimensional persistent homology, with the first dimension corresponding to the HC levels and the second to the inclusion of new data, thus allowing a probabilistic study based on martingales process and subsequent formalization of generalization bounds. From such results, we contributed with the related work by: (i) defining lower and upper bounds for Carlsson and Mémolis metric consistency; (ii) showing that Kleinbergs richness axiom must be relaxed otherwise over-refined or over-coarsed clusterings could be obtained; and, finally, (iii) defining unexpected changes in consistent topologies using what we named as Topological Concept Drift (TCD). An extensive set of experiments was performed to analyze the CRD and the TCD, including a brief study of a real-world scenario involving text documents. Results corroborated the usefulness in representing DC and HC problems using topological spaces, in detecting topology changes and the existence of CRD. |
id |
USP_9cf5098d6fc36b123f055aa587f46e18 |
---|---|
oai_identifier_str |
oai:teses.usp.br:tde-18112020-130229 |
network_acronym_str |
USP |
network_name_str |
Biblioteca Digital de Teses e Dissertações da USP |
repository_id_str |
2721 |
spelling |
Coarse-refinement dilemma: on generalization bounds for data clusteringDilema do sobre-refinamento: limites de generalização para agrupamento de dadosAgrupamento de dadosAgrupamento hierarquicoAlgorithmic stabilityData clusteringEstabilidade algorítmicaHierarchical clusteringHomologia persistenteMudança de conceito topológicoPersistent homologyTopological concept driftMachine Learning (ML) is typically organized into two main paradigms: (i) the Supervised Machine Learning (SML) to identify patterns from pre-labeled data, in which a loss function is used to adapt the corresponding model; and, (ii) the Unsupervised Machine Learning (UML) to organize data points in the absence of labels, taking similarity relations among elements into account. SML relies on well-consolidated theoretical frameworks, such as the Statistical Learning Theory (SLT) and the Algorithmic Stability (AS) to define assumptions, properties and convergence guarantees, allowing the comparison of different methods and, consequently, their improvements. Complementary, UML has been supported by investigations on Data Clustering (DC) and Hierarchical Clustering (HC) in order to define properties and improve their characterizations. Specifically, Kleinberg stated richness, scale-invariance and partition consistency as the necessary properties to define the DC problem, proving they do not hold simultaneously, while Ackerman, Ben-David and Loker explored other properties such as locality, and Carlsson and Mémoli developed stability and consistency frameworks for HC from metric spaces. To bring an additional contribution to UML, we considered topological spaces to design more general theoretical results given: (i) the invariance on topological spaces, more precisely isomorphism of homology groups, guarantees the properties of scale-invariance, partition consistency and locality; and (ii) this same invariance is inherited along less general spaces, such as the metric, thus allowing a more abstract clustering representation. Taking such invariance into account, we demonstrated that over-refined topologies endowed by DC and HC models lead to non-consistency in terms of their associated homology groups and, on the other hand, over-coarsed topologies devise consistent but unrepresentative homology groups, a phenomenon that we referred to as the Coarse-Refinement Dilemma (CRD). We then formulated DC and HC problems by employing Carlsson and Zomorodians bidimensional persistent homology, with the first dimension corresponding to the HC levels and the second to the inclusion of new data, thus allowing a probabilistic study based on martingales process and subsequent formalization of generalization bounds. From such results, we contributed with the related work by: (i) defining lower and upper bounds for Carlsson and Mémolis metric consistency; (ii) showing that Kleinbergs richness axiom must be relaxed otherwise over-refined or over-coarsed clusterings could be obtained; and, finally, (iii) defining unexpected changes in consistent topologies using what we named as Topological Concept Drift (TCD). An extensive set of experiments was performed to analyze the CRD and the TCD, including a brief study of a real-world scenario involving text documents. Results corroborated the usefulness in representing DC and HC problems using topological spaces, in detecting topology changes and the existence of CRD.O Aprendizado de Máquina é tipicamente organizado em dois principais paradigmas: (i) o Aprendizado de Máquina Supervisionado (do inglês Supervised Machine Learning SML) que identifica padrões de dados pré-rotulados, empregando uma função de perda para adaptar o modelo estatístico correspondente; e, (ii) o Aprendizado de Máquina Não-supervisionado (do inglês Unsupervised Machine Learning UML) que organiza dados não rotulados por meio de relações de similaridade. O SML é suportado por arcabouços teóricos bem consolidados tais como a Teoria do Aprendizado Estatístico e a Estabilidade Algorítmica as quais definem suas principais suposições, propriedades e garantias de convergência, permitindo a comparação de diferentes métodos de SML e, consequentemente, seus aperfeiçoamentos. Complementarmente, UML conta com investigações sobre Agrupamento de Dados (do inglês Data Clustering DC) e Agrupamento Hierárquico de Dados (do inglês Hierarchical Clustering HC) definindo propriedades e caracterizando DC e HC de maneira mais adequada. Especificamente, Kleinberg estabeleceu riqueza, invariância à escala e consistência de partição como propriedades necessárias para a definição do problema de DC, provando a impossibilidade de satisfazê-las simultaneamente. Ackerman, Ben-David e Loker exploraram outras propriedades tais como a localidade, e Carlsson e Mémoli relataram resultados sobre a estabilidade e consistência para HC baseando-se na distância entre espaços métricos. A fim de complementar tais trabalhos, considera-se nesta tese de doutorado espaços topológicos para desenvolver um arcabouço teórico mais geral dado que: (i) a invariância de espaços topológicos, isomorfismos de grupos homológicos mais precisamente, garante as propriedades de invariância à escala, consistência de partição e localidade; e (ii) essa mesma invariância é herdada ao longo de espaços menos gerais, tal como o métrico, permitindo uma representação mais abstrata de agrupamento. Nesse contexto, foi demonstrado que topologias sobre-refinadas adotadas por modelos de DC e HC levam à não-consistência dos grupos homológicos associados. Por outro lado, topologias muito grosseiras levam à consistência, porém produzem grupos homológicos sem representatividade; nomeamos tal dilema como Coarse-Refinement Dilemma (CRD). Os problemas de DC e HC foram então formulados empregando a homologia persistente bidimensional de Carlsson e Zomorodian, sendo a primeira dimensão correspondente às hierarquias do HC e a segunda às inclusões de novos dados, o que permitiu o estudo probabilístico por meio de processos de martingales e subsequente formalização de um limite de generalização para DC e HC. Por meio desse resultado, complementamos os trabalhos relacionados: (i) definindo limitantes inferiores e superiores para a consistência métrica de Carlsson e Mémoli; (ii) mostrando que o axioma da riqueza de Kleinberg deve ser relaxado, caso contrário agrupamentos sobre-refinados e sobre-grosseiros serão obtidos; e, finalmente, (iii) definindo mudanças inesperadas em topologias consistentes como Mudanças Topológicas de Conceito (do inglês Topological Concept Drifts TCD). Um extenso conjunto de experimentos foi executado para analisar o CRD e o TCD, incluindo um breve estudo de um cenário real envolvendo documentos textuais. Resultados corroboraram com a empregabilidade de espaços topológicos para a representação de problemas de DC e HC, sendo possível detectar mudanças topológicas e a existência do CRD.Biblioteca Digitais de Teses e Dissertações da USPMello, Rodrigo Fernandes deVaz, Yule2020-10-01info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/55/55134/tde-18112020-130229/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2020-11-18T18:11:02Zoai:teses.usp.br:tde-18112020-130229Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212020-11-18T18:11:02Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false |
dc.title.none.fl_str_mv |
Coarse-refinement dilemma: on generalization bounds for data clustering Dilema do sobre-refinamento: limites de generalização para agrupamento de dados |
title |
Coarse-refinement dilemma: on generalization bounds for data clustering |
spellingShingle |
Coarse-refinement dilemma: on generalization bounds for data clustering Vaz, Yule Agrupamento de dados Agrupamento hierarquico Algorithmic stability Data clustering Estabilidade algorítmica Hierarchical clustering Homologia persistente Mudança de conceito topológico Persistent homology Topological concept drift |
title_short |
Coarse-refinement dilemma: on generalization bounds for data clustering |
title_full |
Coarse-refinement dilemma: on generalization bounds for data clustering |
title_fullStr |
Coarse-refinement dilemma: on generalization bounds for data clustering |
title_full_unstemmed |
Coarse-refinement dilemma: on generalization bounds for data clustering |
title_sort |
Coarse-refinement dilemma: on generalization bounds for data clustering |
author |
Vaz, Yule |
author_facet |
Vaz, Yule |
author_role |
author |
dc.contributor.none.fl_str_mv |
Mello, Rodrigo Fernandes de |
dc.contributor.author.fl_str_mv |
Vaz, Yule |
dc.subject.por.fl_str_mv |
Agrupamento de dados Agrupamento hierarquico Algorithmic stability Data clustering Estabilidade algorítmica Hierarchical clustering Homologia persistente Mudança de conceito topológico Persistent homology Topological concept drift |
topic |
Agrupamento de dados Agrupamento hierarquico Algorithmic stability Data clustering Estabilidade algorítmica Hierarchical clustering Homologia persistente Mudança de conceito topológico Persistent homology Topological concept drift |
description |
Machine Learning (ML) is typically organized into two main paradigms: (i) the Supervised Machine Learning (SML) to identify patterns from pre-labeled data, in which a loss function is used to adapt the corresponding model; and, (ii) the Unsupervised Machine Learning (UML) to organize data points in the absence of labels, taking similarity relations among elements into account. SML relies on well-consolidated theoretical frameworks, such as the Statistical Learning Theory (SLT) and the Algorithmic Stability (AS) to define assumptions, properties and convergence guarantees, allowing the comparison of different methods and, consequently, their improvements. Complementary, UML has been supported by investigations on Data Clustering (DC) and Hierarchical Clustering (HC) in order to define properties and improve their characterizations. Specifically, Kleinberg stated richness, scale-invariance and partition consistency as the necessary properties to define the DC problem, proving they do not hold simultaneously, while Ackerman, Ben-David and Loker explored other properties such as locality, and Carlsson and Mémoli developed stability and consistency frameworks for HC from metric spaces. To bring an additional contribution to UML, we considered topological spaces to design more general theoretical results given: (i) the invariance on topological spaces, more precisely isomorphism of homology groups, guarantees the properties of scale-invariance, partition consistency and locality; and (ii) this same invariance is inherited along less general spaces, such as the metric, thus allowing a more abstract clustering representation. Taking such invariance into account, we demonstrated that over-refined topologies endowed by DC and HC models lead to non-consistency in terms of their associated homology groups and, on the other hand, over-coarsed topologies devise consistent but unrepresentative homology groups, a phenomenon that we referred to as the Coarse-Refinement Dilemma (CRD). We then formulated DC and HC problems by employing Carlsson and Zomorodians bidimensional persistent homology, with the first dimension corresponding to the HC levels and the second to the inclusion of new data, thus allowing a probabilistic study based on martingales process and subsequent formalization of generalization bounds. From such results, we contributed with the related work by: (i) defining lower and upper bounds for Carlsson and Mémolis metric consistency; (ii) showing that Kleinbergs richness axiom must be relaxed otherwise over-refined or over-coarsed clusterings could be obtained; and, finally, (iii) defining unexpected changes in consistent topologies using what we named as Topological Concept Drift (TCD). An extensive set of experiments was performed to analyze the CRD and the TCD, including a brief study of a real-world scenario involving text documents. Results corroborated the usefulness in representing DC and HC problems using topological spaces, in detecting topology changes and the existence of CRD. |
publishDate |
2020 |
dc.date.none.fl_str_mv |
2020-10-01 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
format |
doctoralThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://www.teses.usp.br/teses/disponiveis/55/55134/tde-18112020-130229/ |
url |
https://www.teses.usp.br/teses/disponiveis/55/55134/tde-18112020-130229/ |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
|
dc.rights.driver.fl_str_mv |
Liberar o conteúdo para acesso público. info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
Liberar o conteúdo para acesso público. |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.coverage.none.fl_str_mv |
|
dc.publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
dc.source.none.fl_str_mv |
reponame:Biblioteca Digital de Teses e Dissertações da USP instname:Universidade de São Paulo (USP) instacron:USP |
instname_str |
Universidade de São Paulo (USP) |
instacron_str |
USP |
institution |
USP |
reponame_str |
Biblioteca Digital de Teses e Dissertações da USP |
collection |
Biblioteca Digital de Teses e Dissertações da USP |
repository.name.fl_str_mv |
Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP) |
repository.mail.fl_str_mv |
virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br |
_version_ |
1815257118119297024 |