A data-driven systematic, consistent and non-exhaustive approach to Model Selection

Diego Ribeiro Marcondes

A data-driven systematic, consistent and non-exhaustive approach to Model Selection

Detalhes bibliográficos
Autor(a) principal:	Diego Ribeiro Marcondes
Data de Publicação:	2022
Tipo de documento:	Tese
Idioma:	eng
Título da fonte:	Biblioteca Digital de Teses e Dissertações da USP
Texto Completo:	https://doi.org/10.11606/T.45.2022.tde-09082022-154351
Resumo:	Modern science consists on conceiving a set of hypotheses to explain observable phenomena, confronting them with reality, and keeping as possible explanations only hypotheses which have not yet been falsified. Such a set of hypotheses is called a model, hence an important step of the scientific method is to select a model. Under a Statistical Learning framework, this consists on selecting a model among candidates based on quantitative evidence, and then learning hypotheses on it by the minimization of an empirical risk function. The need to select a model, rather than considering the union of the candidates as the possible hypotheses, is the liability to overfitting, from which arises a complexity-bias trade-off. If we choose a highly complex model, then we may have in it hypotheses which explain the underlying process very well, but there may also be hypotheses which explain the empirical data very well, and it is not clear how to separate them, so we overfit the data. If we choose a simpler model, it may happen that the hypotheses which well fit empirical data are the same that better explain the process, but may not explain it very well, as there may be hypotheses not in the model which are better, so there is a bias when learning on this model. Therefore, properly choosing the model is an important part of the solution of a learning problem, and is performed via Model Selection. This thesis proposes a data-driven systematic, consistent and non-exhaustive approach to Model Selection. The main feature of the approach are the collections of candidate models, which we call Learning Spaces, that, when seen as a set partially ordered by inclusion, may have a rich structure which enhance the quality of learning. The approach is data-driven since the only features which are chosen are the Learning Space and risk function, so all other features are based on data. It is systematic since it is constituted of a formal system of two steps: select a model from the Learning Space and then learn hypotheses on it. From a statistical point of view, there is a target model among the candidates, which is that with the lowest bias and complexity, and the approach is consistent since, as the sample size increases, the selected model converges to the target with probability one, and the estimation errors related to the learning of hypotheses on it converge in probability to zero. We establish U-curve properties of the Learning Spaces which imply the existence of U-curve algorithms that can optimally estimate the target model without an exhaustive search, which can also be efficiently implemented to obtain suboptimal solutions. The main implication of the approach are instances in which the lack of data may be mitigated by high computational power, a property which may be behind the high-performance computing demanding modern learning methods. We illustrate the approach on simulated and real data to learn on the important Partition Lattice Learning Space, to forecast binary sequences under a Markov Chain framework, to learn multilayer W-operators, and to filter binary images via the learning of interval Boolean functions.

Metadados do item

id	USP_e7ca91747706bed8fe793557448cfa0d
oai_identifier_str	oai:teses.usp.br:tde-09082022-154351
network_acronym_str	USP
network_name_str	Biblioteca Digital de Teses e Dissertações da USP
repository_id_str	2721
spelling	info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesis A data-driven systematic, consistent and non-exhaustive approach to Model Selection Uma abordagem sistemática, consistente e não-exaustiva para Seleção de Modelos baseada em dados 2022-07-14Claudia Monteiro PeixotoJunior BarreraUlisses de Mendonça Braga NetoClaudio LandimMarcelo da Silva ReisDiego Ribeiro MarcondesUniversidade de São PauloMatemática AplicadaUSPBR Algoritmos U-curve Aprendizado estatístico Aprendizado PAC Busca de arquiteturas de redes neurais Cross validation Model Selection Neural architecture search PAC learning Partition lattice Reticulado das partições Seleção de Modelos Statistical learning Teoria VC U-curve algorithms Validação cruzada VC theory W-operadores W-operators Modern science consists on conceiving a set of hypotheses to explain observable phenomena, confronting them with reality, and keeping as possible explanations only hypotheses which have not yet been falsified. Such a set of hypotheses is called a model, hence an important step of the scientific method is to select a model. Under a Statistical Learning framework, this consists on selecting a model among candidates based on quantitative evidence, and then learning hypotheses on it by the minimization of an empirical risk function. The need to select a model, rather than considering the union of the candidates as the possible hypotheses, is the liability to overfitting, from which arises a complexity-bias trade-off. If we choose a highly complex model, then we may have in it hypotheses which explain the underlying process very well, but there may also be hypotheses which explain the empirical data very well, and it is not clear how to separate them, so we overfit the data. If we choose a simpler model, it may happen that the hypotheses which well fit empirical data are the same that better explain the process, but may not explain it very well, as there may be hypotheses not in the model which are better, so there is a bias when learning on this model. Therefore, properly choosing the model is an important part of the solution of a learning problem, and is performed via Model Selection. This thesis proposes a data-driven systematic, consistent and non-exhaustive approach to Model Selection. The main feature of the approach are the collections of candidate models, which we call Learning Spaces, that, when seen as a set partially ordered by inclusion, may have a rich structure which enhance the quality of learning. The approach is data-driven since the only features which are chosen are the Learning Space and risk function, so all other features are based on data. It is systematic since it is constituted of a formal system of two steps: select a model from the Learning Space and then learn hypotheses on it. From a statistical point of view, there is a target model among the candidates, which is that with the lowest bias and complexity, and the approach is consistent since, as the sample size increases, the selected model converges to the target with probability one, and the estimation errors related to the learning of hypotheses on it converge in probability to zero. We establish U-curve properties of the Learning Spaces which imply the existence of U-curve algorithms that can optimally estimate the target model without an exhaustive search, which can also be efficiently implemented to obtain suboptimal solutions. The main implication of the approach are instances in which the lack of data may be mitigated by high computational power, a property which may be behind the high-performance computing demanding modern learning methods. We illustrate the approach on simulated and real data to learn on the important Partition Lattice Learning Space, to forecast binary sequences under a Markov Chain framework, to learn multilayer W-operators, and to filter binary images via the learning of interval Boolean functions. A ciência moderna consiste em desenvolver um conjunto de hipóteses para explicar um fenômeno observável, confrontá-las com a realidade, e manter como possíveis explicações hipóteses que ainda não foram falsificadas. Esse conjunto de hipóteses é chamado de modelo, logo um passo importante do método científico é selecionar um modelo. Em métodos de Aprendizado Estatístico, isso consiste em selecionar um modelo dentre candidatos baseando-se em evidências quantitativas, e então aprender hipóteses nele pela minimização de uma função de risco empírica. A necessidade de selecionar um modelo, ao invés de considerar a união dos candidatos como as hipóteses possíveis, é a suscetibilidade a overfitting, a partir da qual emerge um trade-off entre complexidade e viés. Se escolhermos um modelo altamente complexo, então teremos nele hipóteses que explicam o fenômeno muito bem, mas também poderá haver hipóteses que explicam os dados empíricos muito bem, e não é claro como separamos essas hipóteses, logo ocorre overfitting. Se escolhermos um modelo mais simples, pode ocorrer que as hipóteses que se encaixam bem nos dados empíricos são as mesmas que melhor explicam o fenômeno, mas podem não explicá-lo muito bem, já que podem haver hipóteses que não estão no modelo que o explicam melhor, logo há um viés no aprendizado nesse modelo. Assim, escolher adequadamente o modelo é uma parte importante da solução de problemas de aprendizado, o que é feito por meio de Seleção de Modelos. Esta tese propõe uma abordagem baseada em dados sistemática, consistente e não-exaustiva para Seleção de Modelos. O principal conceito da abordagem são as coleções de modelos candidatos, que chamamos Espaços de Aprendizado, que, quando vistas como conjuntos parcialmente ordenados por inclusão, podem ter uma estrutura rica que aumenta a qualidade do aprendizado. A abordagem é baseada em dados, pois apenas o Espaço de Aprendizado e função de risco são escolhidas, e o restante da abordagem é baseado em dados. Ela é sistemática, pois é constituída de um sistema formal com dois passos: selecionar um modelo do Espaço de Aprendizado e aprender hipóteses nele. Do ponto de vista estatístico, há um modelo-alvo dentre os candidatos, que é aquele com menor viés e complexidade, e a abordagem é consistente, pois, quando o tamanho da amostra aumenta, o modelo selecionado converge para o modelo-alvo com probabilidade um, e os erros de estimação relacionados com o aprendizado de hipóteses nele convergem em probabilidade para zero. Desenvolvemos propriedades U-curve dos Espaços de Aprendizado que implicam a existência de algoritmos U-curve que podem estimar de forma ótima o modelo-alvo sem realizar uma busca exaustiva, e que podem também ser implementados eficientemente para obter soluções sub-ótimas. A principal implicação da abordagem são situações em que a falta de dados pode ser mitigada por alto poder computacional, uma propriedade que pode estar por trás dos métodos de aprendizado modernos de alta performance que demandam altos recursos computacionais. Ilustramos a abordagem em dados reais e simulados para aprender no importante Espaço de Aprendizado das Partições, para prever sequencias binárias geradas por cadeias de Markov, para aprender W-operadores multicamadas, e para filtrar imagens binárias através do aprendizado de funções Booleanas intervalares. https://doi.org/10.11606/T.45.2022.tde-09082022-154351info:eu-repo/semantics/openAccessengreponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USP2023-12-21T18:33:04Zoai:teses.usp.br:tde-09082022-154351Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br\|\| atendimento@aguia.usp.br\|\|virginia@if.usp.bropendoar:27212023-12-22T12:23:35.860092Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.en.fl_str_mv	A data-driven systematic, consistent and non-exhaustive approach to Model Selection
dc.title.alternative.pt.fl_str_mv	Uma abordagem sistemática, consistente e não-exaustiva para Seleção de Modelos baseada em dados
title	A data-driven systematic, consistent and non-exhaustive approach to Model Selection
spellingShingle	A data-driven systematic, consistent and non-exhaustive approach to Model Selection Diego Ribeiro Marcondes
title_short	A data-driven systematic, consistent and non-exhaustive approach to Model Selection
title_full	A data-driven systematic, consistent and non-exhaustive approach to Model Selection
title_fullStr	A data-driven systematic, consistent and non-exhaustive approach to Model Selection
title_full_unstemmed	A data-driven systematic, consistent and non-exhaustive approach to Model Selection
title_sort	A data-driven systematic, consistent and non-exhaustive approach to Model Selection
author	Diego Ribeiro Marcondes
author_facet	Diego Ribeiro Marcondes
author_role	author
dc.contributor.advisor1.fl_str_mv	Claudia Monteiro Peixoto
dc.contributor.referee1.fl_str_mv	Junior Barrera
dc.contributor.referee2.fl_str_mv	Ulisses de Mendonça Braga Neto
dc.contributor.referee3.fl_str_mv	Claudio Landim
dc.contributor.referee4.fl_str_mv	Marcelo da Silva Reis
dc.contributor.author.fl_str_mv	Diego Ribeiro Marcondes
contributor_str_mv	Claudia Monteiro Peixoto Junior Barrera Ulisses de Mendonça Braga Neto Claudio Landim Marcelo da Silva Reis
description	Modern science consists on conceiving a set of hypotheses to explain observable phenomena, confronting them with reality, and keeping as possible explanations only hypotheses which have not yet been falsified. Such a set of hypotheses is called a model, hence an important step of the scientific method is to select a model. Under a Statistical Learning framework, this consists on selecting a model among candidates based on quantitative evidence, and then learning hypotheses on it by the minimization of an empirical risk function. The need to select a model, rather than considering the union of the candidates as the possible hypotheses, is the liability to overfitting, from which arises a complexity-bias trade-off. If we choose a highly complex model, then we may have in it hypotheses which explain the underlying process very well, but there may also be hypotheses which explain the empirical data very well, and it is not clear how to separate them, so we overfit the data. If we choose a simpler model, it may happen that the hypotheses which well fit empirical data are the same that better explain the process, but may not explain it very well, as there may be hypotheses not in the model which are better, so there is a bias when learning on this model. Therefore, properly choosing the model is an important part of the solution of a learning problem, and is performed via Model Selection. This thesis proposes a data-driven systematic, consistent and non-exhaustive approach to Model Selection. The main feature of the approach are the collections of candidate models, which we call Learning Spaces, that, when seen as a set partially ordered by inclusion, may have a rich structure which enhance the quality of learning. The approach is data-driven since the only features which are chosen are the Learning Space and risk function, so all other features are based on data. It is systematic since it is constituted of a formal system of two steps: select a model from the Learning Space and then learn hypotheses on it. From a statistical point of view, there is a target model among the candidates, which is that with the lowest bias and complexity, and the approach is consistent since, as the sample size increases, the selected model converges to the target with probability one, and the estimation errors related to the learning of hypotheses on it converge in probability to zero. We establish U-curve properties of the Learning Spaces which imply the existence of U-curve algorithms that can optimally estimate the target model without an exhaustive search, which can also be efficiently implemented to obtain suboptimal solutions. The main implication of the approach are instances in which the lack of data may be mitigated by high computational power, a property which may be behind the high-performance computing demanding modern learning methods. We illustrate the approach on simulated and real data to learn on the important Partition Lattice Learning Space, to forecast binary sequences under a Markov Chain framework, to learn multilayer W-operators, and to filter binary images via the learning of interval Boolean functions.
publishDate	2022
dc.date.issued.fl_str_mv	2022-07-14
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/doctoralThesis
format	doctoralThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://doi.org/10.11606/T.45.2022.tde-09082022-154351
url	https://doi.org/10.11606/T.45.2022.tde-09082022-154351
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.publisher.none.fl_str_mv	Universidade de São Paulo
dc.publisher.program.fl_str_mv	Matemática Aplicada
dc.publisher.initials.fl_str_mv	USP
dc.publisher.country.fl_str_mv	BR
publisher.none.fl_str_mv	Universidade de São Paulo
dc.source.none.fl_str_mv	reponame:Biblioteca Digital de Teses e Dissertações da USP instname:Universidade de São Paulo (USP) instacron:USP
instname_str	Universidade de São Paulo (USP)
instacron_str	USP
institution	USP
reponame_str	Biblioteca Digital de Teses e Dissertações da USP
collection	Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv	Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv	virginia@if.usp.br\|\| atendimento@aguia.usp.br\|\|virginia@if.usp.br
_version_	1794502600414986240

A data-driven systematic, consistent and non-exhaustive approach to Model Selection

Registros relacionados