Analytical variation in the generalization of deep feed-forward neural networks

Detalhes bibliográficos
Autor(a) principal: Neves, Carlos Guatimosim
Data de Publicação: 2021
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Biblioteca Digital de Teses e Dissertações da USP
Texto Completo: https://www.teses.usp.br/teses/disponiveis/45/45132/tde-19042021-202404/
Resumo: The essence of Machine Learning modelling may be summarized as: to unravel the implicit pattern in the data by having access to only a finite number of samples. The theory which studies this process is rich, and two quantities are of particular importance: the performance errors inside and outside the sample. The error inside the sample is called the training error, and is calculated over the set used to optimize the models parameters. The outside error is the average error amongst all samples, and may be understood as the true error. Although the final objective is to construct a model with low true error, we only have access to the training error, which is an empirical estimation. Thus, in order to deduce the general pattern in the data, is necessary that both are similar.The distance between those errors is called the generalization gap, and much of the theory is dedicated to study its properties and upper bounds, so that we may understand under what circumstances it is controlled. The gap is a measure of the models ability to properly induce the global pattern, and is a major topic in all applied instances of Machine Learning.The classical view of statistics correlates the generalization property with the model capacity to fit patterns in the data. The reasoning behind this is that, being capable of fitting many configurations, the model is prone to read noise in the sample, and thus perform poorly in general. However, the definition of a models complexity is loose, and while it usually translates to the number of parameters, there is one hypothesis space which seemingly escapes this intuition. That is the case of Neural Networks. Using Neural Networks with many layers (Deep learning) is proving to be the best modelling paradigm for many benchmark problems, and many of the advances in the industry are due to their success. However, this seemingly goes against what classical Statistical Learning theory states about generalization and complexity, since Deep networks are capable offitting many patterns. Indeed, there has been experiments showing that networks may fit even random labels.This apparent paradox is an open question in the field and the main topic of this work.After a introduction and overview of the classical understanding of generalization, we introduce the work of [20], which is the central to our contributions. In it a new approach named Analytical Learning is proposed, aiming to complement the classical one, hopefully bringingsome insights about the apparent contradiction emerged from Deep Learning. Instead of analyzing probabilistic bounds, in this paper the generalization gap is studied in a context where the predictor and the dataset are fixed. By doing so, we prevent the pessimistic cases, and a tighter bound is hopefully achieved. Additionally, it provides a more real scenario, since in practice usually the data is given. The main result of [20] bounds the gap involving a term related to the data and another related to the loss function\'s Hardy-Kruase Variation. Our main contribution revolves around tracing similarities between this variation term with the stability concept studied in the classical approach of generalization, making parallels with what may be understood in the Analytical case as information.The main idea is that the loss function variation decreases if the partial derivatives of the predictor, according to the instance space, are close to the oracles. The derivatives in this sense may be understood as how much information the function is reading, since it measures the impact of a certain dimension in a local prediction. Thus, if the predictor reads information similarly to the oracle, then we guarantee a low gap. With this, we argue that the partial derivatives of the predictor are the main measure of regularization in the analytical sense. One of the advantages of this is simplicity: rewriting the SGD (Stochastic Gradient Descent) optimization step in the function space, we have an easy way to investigate the evolution of the models complexity during training.Furthermore, we use this interpretation to develop on relative recent papers trying to tackle the generalization paradox in deep learning, [28] and [37]. In the former we make an extensive analysis. while in the latter we make a more brief qualitative approach, showing how our interpretation relates with their result. In [28], the complexity of networks is studied through the lens of Fourier Theory. There is shown that the space of ReLU (Rectifier Linear Unit) networks has a high spectral decay:during optimization, the increments in the k-th harmonic caused by the weights updates decreases with at least k^2. This means that high frequencies magnitudes in this space are naturally damped during training, suggesting an inherent regularization property. However,at no moment in [28] the generalization gap is mentioned, and so it is not clear if the spectral decay is enough to guarantee a good estimation of the true error. Motivated by this, we show a bound using the Hardy-Krause Variation on splines which decreases with the degree, justifying the special properties of ReLU activation functions.In [37] the main theorem shows that if the architecture of the network follows a funnel pattern (when the number of neurons in the network decreases as we go deeper), then increasing the number of layers actually reduces the generalization gap, thus supporting the deep learning approach. This happens because the funnel like architecture forces a nontrivial kernel in the linear transformations, which translates into a loss of information. This implies that as the number of layers increases, the information shared between the final layerand the dataset decreases, making the prediction less data dependant and thus regularized.This result relates closely to our interpretation of information in the analytical sense.Having a non trivial kernel means that in some cases the prediction is constant with respect to disruptions in certain dimensions. This means that the overall variation (in the sense ofderivatives) will be smaller, which according to Analytical Learning translates to a smaller generalization gap.
id USP_c9c878b5bf8b45e9b7006e5b10253ab2
oai_identifier_str oai:teses.usp.br:tde-19042021-202404
network_acronym_str USP
network_name_str Biblioteca Digital de Teses e Dissertações da USP
repository_id_str 2721
spelling Analytical variation in the generalization of deep feed-forward neural networksVariação analítica na generalização de redes neurais feed-forward profundasAnalytical learningAprendizado analíticoAprendizado estatísticoAprendizado profundoArtificial intelligenceDeep learningErro de generalizaçãoGeneralization errorGeneralization gapGeneralization theoryHardy-Krause variationInteligência artificialMachine learningNeural networksRedes neuraisRegularizaçãoRegularizationStatistical learningTeoria da generalizaçãoVariação de Hardy-KrauseThe essence of Machine Learning modelling may be summarized as: to unravel the implicit pattern in the data by having access to only a finite number of samples. The theory which studies this process is rich, and two quantities are of particular importance: the performance errors inside and outside the sample. The error inside the sample is called the training error, and is calculated over the set used to optimize the models parameters. The outside error is the average error amongst all samples, and may be understood as the true error. Although the final objective is to construct a model with low true error, we only have access to the training error, which is an empirical estimation. Thus, in order to deduce the general pattern in the data, is necessary that both are similar.The distance between those errors is called the generalization gap, and much of the theory is dedicated to study its properties and upper bounds, so that we may understand under what circumstances it is controlled. The gap is a measure of the models ability to properly induce the global pattern, and is a major topic in all applied instances of Machine Learning.The classical view of statistics correlates the generalization property with the model capacity to fit patterns in the data. The reasoning behind this is that, being capable of fitting many configurations, the model is prone to read noise in the sample, and thus perform poorly in general. However, the definition of a models complexity is loose, and while it usually translates to the number of parameters, there is one hypothesis space which seemingly escapes this intuition. That is the case of Neural Networks. Using Neural Networks with many layers (Deep learning) is proving to be the best modelling paradigm for many benchmark problems, and many of the advances in the industry are due to their success. However, this seemingly goes against what classical Statistical Learning theory states about generalization and complexity, since Deep networks are capable offitting many patterns. Indeed, there has been experiments showing that networks may fit even random labels.This apparent paradox is an open question in the field and the main topic of this work.After a introduction and overview of the classical understanding of generalization, we introduce the work of [20], which is the central to our contributions. In it a new approach named Analytical Learning is proposed, aiming to complement the classical one, hopefully bringingsome insights about the apparent contradiction emerged from Deep Learning. Instead of analyzing probabilistic bounds, in this paper the generalization gap is studied in a context where the predictor and the dataset are fixed. By doing so, we prevent the pessimistic cases, and a tighter bound is hopefully achieved. Additionally, it provides a more real scenario, since in practice usually the data is given. The main result of [20] bounds the gap involving a term related to the data and another related to the loss function\'s Hardy-Kruase Variation. Our main contribution revolves around tracing similarities between this variation term with the stability concept studied in the classical approach of generalization, making parallels with what may be understood in the Analytical case as information.The main idea is that the loss function variation decreases if the partial derivatives of the predictor, according to the instance space, are close to the oracles. The derivatives in this sense may be understood as how much information the function is reading, since it measures the impact of a certain dimension in a local prediction. Thus, if the predictor reads information similarly to the oracle, then we guarantee a low gap. With this, we argue that the partial derivatives of the predictor are the main measure of regularization in the analytical sense. One of the advantages of this is simplicity: rewriting the SGD (Stochastic Gradient Descent) optimization step in the function space, we have an easy way to investigate the evolution of the models complexity during training.Furthermore, we use this interpretation to develop on relative recent papers trying to tackle the generalization paradox in deep learning, [28] and [37]. In the former we make an extensive analysis. while in the latter we make a more brief qualitative approach, showing how our interpretation relates with their result. In [28], the complexity of networks is studied through the lens of Fourier Theory. There is shown that the space of ReLU (Rectifier Linear Unit) networks has a high spectral decay:during optimization, the increments in the k-th harmonic caused by the weights updates decreases with at least k^2. This means that high frequencies magnitudes in this space are naturally damped during training, suggesting an inherent regularization property. However,at no moment in [28] the generalization gap is mentioned, and so it is not clear if the spectral decay is enough to guarantee a good estimation of the true error. Motivated by this, we show a bound using the Hardy-Krause Variation on splines which decreases with the degree, justifying the special properties of ReLU activation functions.In [37] the main theorem shows that if the architecture of the network follows a funnel pattern (when the number of neurons in the network decreases as we go deeper), then increasing the number of layers actually reduces the generalization gap, thus supporting the deep learning approach. This happens because the funnel like architecture forces a nontrivial kernel in the linear transformations, which translates into a loss of information. This implies that as the number of layers increases, the information shared between the final layerand the dataset decreases, making the prediction less data dependant and thus regularized.This result relates closely to our interpretation of information in the analytical sense.Having a non trivial kernel means that in some cases the prediction is constant with respect to disruptions in certain dimensions. This means that the overall variation (in the sense ofderivatives) will be smaller, which according to Analytical Learning translates to a smaller generalization gap.A essência do Aprendizado de Máquina pode ser resumida como: revelar padrões implícitos nos dados tendo em posse apenas uma amostra finita. A Teoria que estuda essa questão é vasta, e duas quantidades são de particular importância: os erros do modelo dentro e fora da amostra. O primeiro é chamado de erro de treinamento, e mede a performance dentro do conjunto usado para otimizar. O segundo é a média do erro em todas as amostras, e pode ser compreendido como o erro real. Apesar do verdadeiro indicador de performance ser o erro real, podemos calcular apenas a performance dentro da amostra, que é uma estimativa empírica. Portanto, para deduzir o padrão geral, é necessário que tais erros sejam similares. A distância entre estes erros é chamada de generalization gap, e grande parte da teoria se dedica a estudar as suas propriedades e limites superiores. Ele é uma medida da habilidade do modelo de induzir corretamente o comportamento global, e é um tópico central em todas as aplicações de Aprendizagem de Máquina. A visão clássica da Estatística correlaciona a propriedade de generalização com a capacidade do modelo de ajustar os dados. A ideia é de que, se o modelo é capaz de performar em diversas configurações, então ele será sensível a ruídos presentes, e então performará mal fora da amostra. Entretanto, a definição de complexidade de um modelo é vaga, e apesar de muitas vezes ser caracterizado pelo número de parâmetros, existe um espaço de hipóteses que aparentemente escapa essa intuição. É o caso das redes neurais profundas. Fazer uso de redes com muitas camadas (aprendizado profundo) está provando ser um dos melhores paradigmas de modelagem em diversos problemas de referência, e os avanços da indústria se devem em grande parte ao seu grande sucesso. Entretanto, isso aparentemente contradiz o que a teoria de Aprendizado Estatístico nos fala sobre complexidade e generalização, já que redes profundas são capazes de ajustar diversos padrões. De fato, existem experimentos que mostram que elas podem ajustar até variáveis respostas aleatórias. Esse possível paradoxo é uma questão em aberto na área e é o principal tópico neste estudo. Após uma introdução e revisão sobre a teoria clássica de generalização, nós introduzir o trabalho de [20], que é central para as nossas contribuições. Nele é proposta uma nova abordagem chamada de Analytical Learning com o intuito de complementar o entendimento clássico, almejando trazer uma nova visão sobre essa aparente contradição revelada pelo aprendizado profundo. Neste artigo, ao invés de analisar limites de natureza probabilística, o generalization gap é estudado em um contexto onde o preditor e o conjunto de dados é fixo. Assim, desconsideramos os casos pessimistas, e uma majoração mais precisa poderia ser alcançada. Além disso, essa proposta considera um cenário mais real, pois na prática a amostra é dada. O principal resultado de [20] é uma majoração que envolve um termo relacionado aos dados, e um outro relacionado à Variação de Hardy-Krause da função perda. Nossa principal contribuição envolve traçar similaridades entre esse termo de variação com a noção de estabilidade estudada na abordagem estatística da generalização, fazendo paralelos com o que pode ser entendido no caso Analítico como informação. A ideia principal é que a variação da função perda decresce se as derivadas parciais do preditor são próximas das do oráculo. Dessa forma, as derivadas nesse sentido podem ser compreendidas como a quantidade de informação que está sendo usada, já que ela mede o impacto de uma certa dimensão na predição local. Portanto, se o preditor lê a informação da mesma forma que a função que gerou os dados, então podemos garantir uma aproximação do erro de treinamento com o erro real. Com isso, argumentamos que a derivada parcial é a principal forma de medir a regularização de um modelo no sentido analítico. Uma das vantagens dessa abordagem é simplicidade: reescrevendo o passo de otimização do DGE (descida pelo gradiente estocástico) no espaço das funções, adquirimos uma forma simples de investigar a evolução da complexidade do modelo durante o treinamento. Ademais, usamos essa interpretação para elaborar em recentes artigos abordando o problema da generalização em aprendizado profundo, [28] e [37]. No primeiro fazemos uma análise extensiva, enquanto no segundo fazemos uma análise qualitativa mais abreviada, mostrando como nossa interpretação se relaciona com os resultados neles presentes. Em [28], a complexidade das redes neurais é estudada sob a lente da Teoria de Fourier. Nele é mostrado que o espaço de redes ReLU (Rectifier Linear Unit) apresenta um decaimento espectral particularmente intenso: durante a otimização, os incrementos no k-ésimo harmônico decorrente da atualização dos pesos decresce com pelo menos k². Isso significa que as magnitudes de frequências mais altas são naturalmente amortecidas durante o treinamento, sugerindo uma propriedade regularizadora intrínseca no espaço. Apesar disso, em momento algum [28] menciona o generalization gap, e portanto não é claro se o decaimento espectral é suficiente para garantir que o erro de treinamento seja uma boa estimativa do erro real. Motivados por isso, mostramos um limite superior usando a Variação de Hardy-Krause em splines que decresce com o grau, justificando as propriedades especiais da função de ligação ReLU. O principal teorema em [37] mostra que, para arquiteturas com um formato de funil (quando o número de neurônios decresce à medida que nos aprofundamos na rede), aumentar o número de camadas implica em uma redução no generalization gap, fundamentando portanto o aprendizado profundo nesses casos. Isso ocorre pois uma arquitetura com tal formato força núcleos não triviais nas transformações lineares, o que implica em uma perda de informação. Ou seja, ao aumentar o número de camadas, a quantidade de informação nos dados usada pelo preditor diminui, tornando a predição menos dependente da amostra, e portanto regularizada. Esse resultado se relaciona muito com a nossa interpretação de informação no sentido analítico. Ter um núcleo não trivial na arquitetura significa que em alguns casos a predição vai permanecer inalterada com relação à mudanças em certas dimensões. Isso implica que a variação geral (no sentido de derivadas) será menor, o que de acordo com a teoria do Aprendizado Analítico, acarreta em uma melhor estimativa do erro real.Biblioteca Digitais de Teses e Dissertações da USPVicente, RenatoNeves, Carlos Guatimosim2021-01-26info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/45/45132/tde-19042021-202404/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2021-07-06T20:50:02Zoai:teses.usp.br:tde-19042021-202404Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212021-07-06T20:50:02Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv Analytical variation in the generalization of deep feed-forward neural networks
Variação analítica na generalização de redes neurais feed-forward profundas
title Analytical variation in the generalization of deep feed-forward neural networks
spellingShingle Analytical variation in the generalization of deep feed-forward neural networks
Neves, Carlos Guatimosim
Analytical learning
Aprendizado analítico
Aprendizado estatístico
Aprendizado profundo
Artificial intelligence
Deep learning
Erro de generalização
Generalization error
Generalization gap
Generalization theory
Hardy-Krause variation
Inteligência artificial
Machine learning
Neural networks
Redes neurais
Regularização
Regularization
Statistical learning
Teoria da generalização
Variação de Hardy-Krause
title_short Analytical variation in the generalization of deep feed-forward neural networks
title_full Analytical variation in the generalization of deep feed-forward neural networks
title_fullStr Analytical variation in the generalization of deep feed-forward neural networks
title_full_unstemmed Analytical variation in the generalization of deep feed-forward neural networks
title_sort Analytical variation in the generalization of deep feed-forward neural networks
author Neves, Carlos Guatimosim
author_facet Neves, Carlos Guatimosim
author_role author
dc.contributor.none.fl_str_mv Vicente, Renato
dc.contributor.author.fl_str_mv Neves, Carlos Guatimosim
dc.subject.por.fl_str_mv Analytical learning
Aprendizado analítico
Aprendizado estatístico
Aprendizado profundo
Artificial intelligence
Deep learning
Erro de generalização
Generalization error
Generalization gap
Generalization theory
Hardy-Krause variation
Inteligência artificial
Machine learning
Neural networks
Redes neurais
Regularização
Regularization
Statistical learning
Teoria da generalização
Variação de Hardy-Krause
topic Analytical learning
Aprendizado analítico
Aprendizado estatístico
Aprendizado profundo
Artificial intelligence
Deep learning
Erro de generalização
Generalization error
Generalization gap
Generalization theory
Hardy-Krause variation
Inteligência artificial
Machine learning
Neural networks
Redes neurais
Regularização
Regularization
Statistical learning
Teoria da generalização
Variação de Hardy-Krause
description The essence of Machine Learning modelling may be summarized as: to unravel the implicit pattern in the data by having access to only a finite number of samples. The theory which studies this process is rich, and two quantities are of particular importance: the performance errors inside and outside the sample. The error inside the sample is called the training error, and is calculated over the set used to optimize the models parameters. The outside error is the average error amongst all samples, and may be understood as the true error. Although the final objective is to construct a model with low true error, we only have access to the training error, which is an empirical estimation. Thus, in order to deduce the general pattern in the data, is necessary that both are similar.The distance between those errors is called the generalization gap, and much of the theory is dedicated to study its properties and upper bounds, so that we may understand under what circumstances it is controlled. The gap is a measure of the models ability to properly induce the global pattern, and is a major topic in all applied instances of Machine Learning.The classical view of statistics correlates the generalization property with the model capacity to fit patterns in the data. The reasoning behind this is that, being capable of fitting many configurations, the model is prone to read noise in the sample, and thus perform poorly in general. However, the definition of a models complexity is loose, and while it usually translates to the number of parameters, there is one hypothesis space which seemingly escapes this intuition. That is the case of Neural Networks. Using Neural Networks with many layers (Deep learning) is proving to be the best modelling paradigm for many benchmark problems, and many of the advances in the industry are due to their success. However, this seemingly goes against what classical Statistical Learning theory states about generalization and complexity, since Deep networks are capable offitting many patterns. Indeed, there has been experiments showing that networks may fit even random labels.This apparent paradox is an open question in the field and the main topic of this work.After a introduction and overview of the classical understanding of generalization, we introduce the work of [20], which is the central to our contributions. In it a new approach named Analytical Learning is proposed, aiming to complement the classical one, hopefully bringingsome insights about the apparent contradiction emerged from Deep Learning. Instead of analyzing probabilistic bounds, in this paper the generalization gap is studied in a context where the predictor and the dataset are fixed. By doing so, we prevent the pessimistic cases, and a tighter bound is hopefully achieved. Additionally, it provides a more real scenario, since in practice usually the data is given. The main result of [20] bounds the gap involving a term related to the data and another related to the loss function\'s Hardy-Kruase Variation. Our main contribution revolves around tracing similarities between this variation term with the stability concept studied in the classical approach of generalization, making parallels with what may be understood in the Analytical case as information.The main idea is that the loss function variation decreases if the partial derivatives of the predictor, according to the instance space, are close to the oracles. The derivatives in this sense may be understood as how much information the function is reading, since it measures the impact of a certain dimension in a local prediction. Thus, if the predictor reads information similarly to the oracle, then we guarantee a low gap. With this, we argue that the partial derivatives of the predictor are the main measure of regularization in the analytical sense. One of the advantages of this is simplicity: rewriting the SGD (Stochastic Gradient Descent) optimization step in the function space, we have an easy way to investigate the evolution of the models complexity during training.Furthermore, we use this interpretation to develop on relative recent papers trying to tackle the generalization paradox in deep learning, [28] and [37]. In the former we make an extensive analysis. while in the latter we make a more brief qualitative approach, showing how our interpretation relates with their result. In [28], the complexity of networks is studied through the lens of Fourier Theory. There is shown that the space of ReLU (Rectifier Linear Unit) networks has a high spectral decay:during optimization, the increments in the k-th harmonic caused by the weights updates decreases with at least k^2. This means that high frequencies magnitudes in this space are naturally damped during training, suggesting an inherent regularization property. However,at no moment in [28] the generalization gap is mentioned, and so it is not clear if the spectral decay is enough to guarantee a good estimation of the true error. Motivated by this, we show a bound using the Hardy-Krause Variation on splines which decreases with the degree, justifying the special properties of ReLU activation functions.In [37] the main theorem shows that if the architecture of the network follows a funnel pattern (when the number of neurons in the network decreases as we go deeper), then increasing the number of layers actually reduces the generalization gap, thus supporting the deep learning approach. This happens because the funnel like architecture forces a nontrivial kernel in the linear transformations, which translates into a loss of information. This implies that as the number of layers increases, the information shared between the final layerand the dataset decreases, making the prediction less data dependant and thus regularized.This result relates closely to our interpretation of information in the analytical sense.Having a non trivial kernel means that in some cases the prediction is constant with respect to disruptions in certain dimensions. This means that the overall variation (in the sense ofderivatives) will be smaller, which according to Analytical Learning translates to a smaller generalization gap.
publishDate 2021
dc.date.none.fl_str_mv 2021-01-26
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://www.teses.usp.br/teses/disponiveis/45/45132/tde-19042021-202404/
url https://www.teses.usp.br/teses/disponiveis/45/45132/tde-19042021-202404/
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv Liberar o conteúdo para acesso público.
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Liberar o conteúdo para acesso público.
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv
reponame:Biblioteca Digital de Teses e Dissertações da USP
instname:Universidade de São Paulo (USP)
instacron:USP
instname_str Universidade de São Paulo (USP)
instacron_str USP
institution USP
reponame_str Biblioteca Digital de Teses e Dissertações da USP
collection Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br
_version_ 1809090858660659200