A data-driven systematic, consistent and non-exhaustive approach to Model Selection

Detalhes bibliográficos
Autor(a) principal: Diego Ribeiro Marcondes
Data de Publicação: 2022
Tipo de documento: Tese
Idioma: eng
Título da fonte: Biblioteca Digital de Teses e Dissertações da USP
Texto Completo: https://doi.org/10.11606/T.45.2022.tde-09082022-154351
Resumo: Modern science consists on conceiving a set of hypotheses to explain observable phenomena, confronting them with reality, and keeping as possible explanations only hypotheses which have not yet been falsified. Such a set of hypotheses is called a model, hence an important step of the scientific method is to select a model. Under a Statistical Learning framework, this consists on selecting a model among candidates based on quantitative evidence, and then learning hypotheses on it by the minimization of an empirical risk function. The need to select a model, rather than considering the union of the candidates as the possible hypotheses, is the liability to overfitting, from which arises a complexity-bias trade-off. If we choose a highly complex model, then we may have in it hypotheses which explain the underlying process very well, but there may also be hypotheses which explain the empirical data very well, and it is not clear how to separate them, so we overfit the data. If we choose a simpler model, it may happen that the hypotheses which well fit empirical data are the same that better explain the process, but may not explain it very well, as there may be hypotheses not in the model which are better, so there is a bias when learning on this model. Therefore, properly choosing the model is an important part of the solution of a learning problem, and is performed via Model Selection. This thesis proposes a data-driven systematic, consistent and non-exhaustive approach to Model Selection. The main feature of the approach are the collections of candidate models, which we call Learning Spaces, that, when seen as a set partially ordered by inclusion, may have a rich structure which enhance the quality of learning. The approach is data-driven since the only features which are chosen are the Learning Space and risk function, so all other features are based on data. It is systematic since it is constituted of a formal system of two steps: select a model from the Learning Space and then learn hypotheses on it. From a statistical point of view, there is a target model among the candidates, which is that with the lowest bias and complexity, and the approach is consistent since, as the sample size increases, the selected model converges to the target with probability one, and the estimation errors related to the learning of hypotheses on it converge in probability to zero. We establish U-curve properties of the Learning Spaces which imply the existence of U-curve algorithms that can optimally estimate the target model without an exhaustive search, which can also be efficiently implemented to obtain suboptimal solutions. The main implication of the approach are instances in which the lack of data may be mitigated by high computational power, a property which may be behind the high-performance computing demanding modern learning methods. We illustrate the approach on simulated and real data to learn on the important Partition Lattice Learning Space, to forecast binary sequences under a Markov Chain framework, to learn multilayer W-operators, and to filter binary images via the learning of interval Boolean functions.
id USP_e7ca91747706bed8fe793557448cfa0d
oai_identifier_str oai:teses.usp.br:tde-09082022-154351
network_acronym_str USP
network_name_str Biblioteca Digital de Teses e Dissertações da USP
repository_id_str 2721
spelling info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesis A data-driven systematic, consistent and non-exhaustive approach to Model Selection Uma abordagem sistemática, consistente e não-exaustiva para Seleção de Modelos baseada em dados 2022-07-14Claudia Monteiro PeixotoJunior BarreraUlisses de Mendonça Braga NetoClaudio LandimMarcelo da Silva ReisDiego Ribeiro MarcondesUniversidade de São PauloMatemática AplicadaUSPBR Algoritmos U-curve Aprendizado estatístico Aprendizado PAC Busca de arquiteturas de redes neurais Cross validation Model Selection Neural architecture search PAC learning Partition lattice Reticulado das partições Seleção de Modelos Statistical learning Teoria VC U-curve algorithms Validação cruzada VC theory W-operadores W-operators Modern science consists on conceiving a set of hypotheses to explain observable phenomena, confronting them with reality, and keeping as possible explanations only hypotheses which have not yet been falsified. Such a set of hypotheses is called a model, hence an important step of the scientific method is to select a model. Under a Statistical Learning framework, this consists on selecting a model among candidates based on quantitative evidence, and then learning hypotheses on it by the minimization of an empirical risk function. The need to select a model, rather than considering the union of the candidates as the possible hypotheses, is the liability to overfitting, from which arises a complexity-bias trade-off. If we choose a highly complex model, then we may have in it hypotheses which explain the underlying process very well, but there may also be hypotheses which explain the empirical data very well, and it is not clear how to separate them, so we overfit the data. If we choose a simpler model, it may happen that the hypotheses which well fit empirical data are the same that better explain the process, but may not explain it very well, as there may be hypotheses not in the model which are better, so there is a bias when learning on this model. Therefore, properly choosing the model is an important part of the solution of a learning problem, and is performed via Model Selection. This thesis proposes a data-driven systematic, consistent and non-exhaustive approach to Model Selection. The main feature of the approach are the collections of candidate models, which we call Learning Spaces, that, when seen as a set partially ordered by inclusion, may have a rich structure which enhance the quality of learning. The approach is data-driven since the only features which are chosen are the Learning Space and risk function, so all other features are based on data. It is systematic since it is constituted of a formal system of two steps: select a model from the Learning Space and then learn hypotheses on it. From a statistical point of view, there is a target model among the candidates, which is that with the lowest bias and complexity, and the approach is consistent since, as the sample size increases, the selected model converges to the target with probability one, and the estimation errors related to the learning of hypotheses on it converge in probability to zero. We establish U-curve properties of the Learning Spaces which imply the existence of U-curve algorithms that can optimally estimate the target model without an exhaustive search, which can also be efficiently implemented to obtain suboptimal solutions. The main implication of the approach are instances in which the lack of data may be mitigated by high computational power, a property which may be behind the high-performance computing demanding modern learning methods. We illustrate the approach on simulated and real data to learn on the important Partition Lattice Learning Space, to forecast binary sequences under a Markov Chain framework, to learn multilayer W-operators, and to filter binary images via the learning of interval Boolean functions. A ciência moderna consiste em desenvolver um conjunto de hipóteses para explicar um fenômeno observável, confrontá-las com a realidade, e manter como possíveis explicações hipóteses que ainda não foram falsificadas. Esse conjunto de hipóteses é chamado de modelo, logo um passo importante do método científico é selecionar um modelo. Em métodos de Aprendizado Estatístico, isso consiste em selecionar um modelo dentre candidatos baseando-se em evidências quantitativas, e então aprender hipóteses nele pela minimização de uma função de risco empírica. A necessidade de selecionar um modelo, ao invés de considerar a união dos candidatos como as hipóteses possíveis, é a suscetibilidade a overfitting, a partir da qual emerge um trade-off entre complexidade e viés. Se escolhermos um modelo altamente complexo, então teremos nele hipóteses que explicam o fenômeno muito bem, mas também poderá haver hipóteses que explicam os dados empíricos muito bem, e não é claro como separamos essas hipóteses, logo ocorre overfitting. Se escolhermos um modelo mais simples, pode ocorrer que as hipóteses que se encaixam bem nos dados empíricos são as mesmas que melhor explicam o fenômeno, mas podem não explicá-lo muito bem, já que podem haver hipóteses que não estão no modelo que o explicam melhor, logo há um viés no aprendizado nesse modelo. Assim, escolher adequadamente o modelo é uma parte importante da solução de problemas de aprendizado, o que é feito por meio de Seleção de Modelos. Esta tese propõe uma abordagem baseada em dados sistemática, consistente e não-exaustiva para Seleção de Modelos. O principal conceito da abordagem são as coleções de modelos candidatos, que chamamos Espaços de Aprendizado, que, quando vistas como conjuntos parcialmente ordenados por inclusão, podem ter uma estrutura rica que aumenta a qualidade do aprendizado. A abordagem é baseada em dados, pois apenas o Espaço de Aprendizado e função de risco são escolhidas, e o restante da abordagem é baseado em dados. Ela é sistemática, pois é constituída de um sistema formal com dois passos: selecionar um modelo do Espaço de Aprendizado e aprender hipóteses nele. Do ponto de vista estatístico, há um modelo-alvo dentre os candidatos, que é aquele com menor viés e complexidade, e a abordagem é consistente, pois, quando o tamanho da amostra aumenta, o modelo selecionado converge para o modelo-alvo com probabilidade um, e os erros de estimação relacionados com o aprendizado de hipóteses nele convergem em probabilidade para zero. Desenvolvemos propriedades U-curve dos Espaços de Aprendizado que implicam a existência de algoritmos U-curve que podem estimar de forma ótima o modelo-alvo sem realizar uma busca exaustiva, e que podem também ser implementados eficientemente para obter soluções sub-ótimas. A principal implicação da abordagem são situações em que a falta de dados pode ser mitigada por alto poder computacional, uma propriedade que pode estar por trás dos métodos de aprendizado modernos de alta performance que demandam altos recursos computacionais. Ilustramos a abordagem em dados reais e simulados para aprender no importante Espaço de Aprendizado das Partições, para prever sequencias binárias geradas por cadeias de Markov, para aprender W-operadores multicamadas, e para filtrar imagens binárias através do aprendizado de funções Booleanas intervalares. https://doi.org/10.11606/T.45.2022.tde-09082022-154351info:eu-repo/semantics/openAccessengreponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USP2023-12-21T18:33:04Zoai:teses.usp.br:tde-09082022-154351Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212023-12-22T12:23:35.860092Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.en.fl_str_mv A data-driven systematic, consistent and non-exhaustive approach to Model Selection
dc.title.alternative.pt.fl_str_mv Uma abordagem sistemática, consistente e não-exaustiva para Seleção de Modelos baseada em dados
title A data-driven systematic, consistent and non-exhaustive approach to Model Selection
spellingShingle A data-driven systematic, consistent and non-exhaustive approach to Model Selection
Diego Ribeiro Marcondes
title_short A data-driven systematic, consistent and non-exhaustive approach to Model Selection
title_full A data-driven systematic, consistent and non-exhaustive approach to Model Selection
title_fullStr A data-driven systematic, consistent and non-exhaustive approach to Model Selection
title_full_unstemmed A data-driven systematic, consistent and non-exhaustive approach to Model Selection
title_sort A data-driven systematic, consistent and non-exhaustive approach to Model Selection
author Diego Ribeiro Marcondes
author_facet Diego Ribeiro Marcondes
author_role author
dc.contributor.advisor1.fl_str_mv Claudia Monteiro Peixoto
dc.contributor.referee1.fl_str_mv Junior Barrera
dc.contributor.referee2.fl_str_mv Ulisses de Mendonça Braga Neto
dc.contributor.referee3.fl_str_mv Claudio Landim
dc.contributor.referee4.fl_str_mv Marcelo da Silva Reis
dc.contributor.author.fl_str_mv Diego Ribeiro Marcondes
contributor_str_mv Claudia Monteiro Peixoto
Junior Barrera
Ulisses de Mendonça Braga Neto
Claudio Landim
Marcelo da Silva Reis
description Modern science consists on conceiving a set of hypotheses to explain observable phenomena, confronting them with reality, and keeping as possible explanations only hypotheses which have not yet been falsified. Such a set of hypotheses is called a model, hence an important step of the scientific method is to select a model. Under a Statistical Learning framework, this consists on selecting a model among candidates based on quantitative evidence, and then learning hypotheses on it by the minimization of an empirical risk function. The need to select a model, rather than considering the union of the candidates as the possible hypotheses, is the liability to overfitting, from which arises a complexity-bias trade-off. If we choose a highly complex model, then we may have in it hypotheses which explain the underlying process very well, but there may also be hypotheses which explain the empirical data very well, and it is not clear how to separate them, so we overfit the data. If we choose a simpler model, it may happen that the hypotheses which well fit empirical data are the same that better explain the process, but may not explain it very well, as there may be hypotheses not in the model which are better, so there is a bias when learning on this model. Therefore, properly choosing the model is an important part of the solution of a learning problem, and is performed via Model Selection. This thesis proposes a data-driven systematic, consistent and non-exhaustive approach to Model Selection. The main feature of the approach are the collections of candidate models, which we call Learning Spaces, that, when seen as a set partially ordered by inclusion, may have a rich structure which enhance the quality of learning. The approach is data-driven since the only features which are chosen are the Learning Space and risk function, so all other features are based on data. It is systematic since it is constituted of a formal system of two steps: select a model from the Learning Space and then learn hypotheses on it. From a statistical point of view, there is a target model among the candidates, which is that with the lowest bias and complexity, and the approach is consistent since, as the sample size increases, the selected model converges to the target with probability one, and the estimation errors related to the learning of hypotheses on it converge in probability to zero. We establish U-curve properties of the Learning Spaces which imply the existence of U-curve algorithms that can optimally estimate the target model without an exhaustive search, which can also be efficiently implemented to obtain suboptimal solutions. The main implication of the approach are instances in which the lack of data may be mitigated by high computational power, a property which may be behind the high-performance computing demanding modern learning methods. We illustrate the approach on simulated and real data to learn on the important Partition Lattice Learning Space, to forecast binary sequences under a Markov Chain framework, to learn multilayer W-operators, and to filter binary images via the learning of interval Boolean functions.
publishDate 2022
dc.date.issued.fl_str_mv 2022-07-14
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://doi.org/10.11606/T.45.2022.tde-09082022-154351
url https://doi.org/10.11606/T.45.2022.tde-09082022-154351
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv Universidade de São Paulo
dc.publisher.program.fl_str_mv Matemática Aplicada
dc.publisher.initials.fl_str_mv USP
dc.publisher.country.fl_str_mv BR
publisher.none.fl_str_mv Universidade de São Paulo
dc.source.none.fl_str_mv reponame:Biblioteca Digital de Teses e Dissertações da USP
instname:Universidade de São Paulo (USP)
instacron:USP
instname_str Universidade de São Paulo (USP)
instacron_str USP
institution USP
reponame_str Biblioteca Digital de Teses e Dissertações da USP
collection Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br
_version_ 1794502600414986240