Characterization of automated machine learning fitness landscapes

Cristiano Guimarães Pimenta

Characterization of automated machine learning fitness landscapes

Detalhes bibliográficos
Autor(a) principal:	Cristiano Guimarães Pimenta
Data de Publicação:	2023
Tipo de documento:	Dissertação
Idioma:	eng
Título da fonte:	Repositório Institucional da UFMG
Texto Completo:	http://hdl.handle.net/1843/62093 https://orcid.org/0000-0003-2809-8663
Resumo:	Automated Machine Learning (AutoML) aims at automatically selecting and configuring complete machine learning pipelines without requiring deep user expertise. AutoML methods utilize a search space of possible solutions and try to find the best pipeline for a given learning problem. However, there is little knowledge about the characteristics of such spaces and how they relate to the performance of search methods. One way of exploring them is using Fitness Landscape Analysis (FLA), a technique commonly used to describe the landscape of combinatorial optimization problems. This work adapts classic FLA measures, such as Neutrality, Fitness Distance Correlation (FDC) and Correlation Length, to the context of the complex fitness landscape generated by AutoML search spaces, which include discrete, continuous, categorical and conditional variables, regardless of the methods used to explore the search spaces. It also evaluates how the characteristics of the landscape affect the performance of two AutoML methods based on Bayesian optimization: Tree-structured Parzen Estimator (TPE) and Sequential Model-based Algorithm Configuration (SMAC). In order to use FLA in the context of AutoML, we propose a tree-based representation for machine learning pipelines that is able to capture their semantics, a neighborhood definition based on a mutation operator, and a semantic distance metric between pipelines. Neutrality analyses suggest that larger landscapes tend to have more areas of equal or nearly equal fitness values, a feature that can improve the ability of TPE to explore the search space and find good solutions. Larger search spaces tend to be more rugged, as indicated by the Correlation Length measure, and are often more challenging for the optimizers. FDC proved to be a weak measure in describing problem difficulty. Furthermore, using local optima to calculate FDC can lead to very different results when compared to using the global optimum, which is usually unfeasible to calculate for AutoML problems. On the other hand, SMAC’s performance seems less affected by changes in the characteristics of the landscape.

Metadados do item

id	UFMG_69100c6ab1a475acee640fd8fa8afb12
oai_identifier_str	oai:repositorio.ufmg.br:1843/62093
network_acronym_str	UFMG
network_name_str	Repositório Institucional da UFMG
repository_id_str
spelling	Gisele Lobo Pappahttp://lattes.cnpq.br/5936682335701497Alex Guimarães Cardoso de SáRenato VimieiroRicardo Bastos Cavalcante Prudênciohttp://lattes.cnpq.br/8713326153602094Cristiano Guimarães Pimenta2023-12-19T20:53:28Z2023-12-19T20:53:28Z2023-06-21http://hdl.handle.net/1843/62093https://orcid.org/0000-0003-2809-8663Automated Machine Learning (AutoML) aims at automatically selecting and configuring complete machine learning pipelines without requiring deep user expertise. AutoML methods utilize a search space of possible solutions and try to find the best pipeline for a given learning problem. However, there is little knowledge about the characteristics of such spaces and how they relate to the performance of search methods. One way of exploring them is using Fitness Landscape Analysis (FLA), a technique commonly used to describe the landscape of combinatorial optimization problems. This work adapts classic FLA measures, such as Neutrality, Fitness Distance Correlation (FDC) and Correlation Length, to the context of the complex fitness landscape generated by AutoML search spaces, which include discrete, continuous, categorical and conditional variables, regardless of the methods used to explore the search spaces. It also evaluates how the characteristics of the landscape affect the performance of two AutoML methods based on Bayesian optimization: Tree-structured Parzen Estimator (TPE) and Sequential Model-based Algorithm Configuration (SMAC). In order to use FLA in the context of AutoML, we propose a tree-based representation for machine learning pipelines that is able to capture their semantics, a neighborhood definition based on a mutation operator, and a semantic distance metric between pipelines. Neutrality analyses suggest that larger landscapes tend to have more areas of equal or nearly equal fitness values, a feature that can improve the ability of TPE to explore the search space and find good solutions. Larger search spaces tend to be more rugged, as indicated by the Correlation Length measure, and are often more challenging for the optimizers. FDC proved to be a weak measure in describing problem difficulty. Furthermore, using local optima to calculate FDC can lead to very different results when compared to using the global optimum, which is usually unfeasible to calculate for AutoML problems. On the other hand, SMAC’s performance seems less affected by changes in the characteristics of the landscape.Aprendizado de Máquina Automatizado (AutoML) tem o objetivo de selecionar e configurar pipelines de aprendizado de máquina automaticamente, sem exigir conhecimentos profundos do usuário. Métodos de AutoML utilizam um espaço de busca que contém possíveis soluções e tentam encontrar o melhor pipeline para um problema de aprendizado específico. Entretanto, pouco se sabe sobre quais são as características desses espaços de busca e como elas afetam o desempenho de métodos de busca. Uma forma de descrever os espaços de busca é por meio de Análise de Fitness Landscape (FLA), uma técnica muito utilizada para descrever o espaço de busca de problemas de otimização combinatória. O presente trabalho adapta métricas clássicas de FLA, tais como Neutralidade, Correlação de Distância de Fitness (FDC) e Distância de Correlação ao contexto de AutoML, cujos espaços de busca são complexos, uma vez que contêm variáveis discretas, contínuas, categóricas e condicionais, de forma totalmente independente do método de busca utilizado para explorar o espaço. Além disso, é feita uma avaliação de como as características do espaço de busca afetam o desempenho de dois métodos de busca baseados em otimização Bayesiana: Tree-structured Parzen Estimator (TPE) e Sequential Model-based Algorithm Configuration (SMAC). De forma a utilizar FLA no contexto de AutoML, nós propomos uma representação em árvore para os pipelines de aprendizado de máquina capaz de capturar sua semântica, uma definição de vizinhança baseada em um operador de mutação e uma medida semântica de distância entre pipelines. Análises de Neutralidade sugerem que espaços de busca maiores tendem a ter mais áreas com valores iguais, ou quase iguais, de fitness, uma característica que pode melhorar a habilidade do TPE de explorar o espaço e encontrar boas soluções. Espaços de busca maiores tendem a ser mais enrugados, de acordo com a métrica de Distância de Correlação, e normalmente são mais difíceis para os otimizadores. FDC se mostrou uma métrica pouco informativa em relação à dificuldade do problema de encontrar o melhor pipeline de aprendizado de máquina. Além disso, a utilização de ótimos locais para calcular a métrica pode levar a resultados bastante diferentes em comparação ao uso do ótimo global, cujo cálculo é normalmente inviável para problemas de AutoML. Por outro lado, desempenho do otimizador SMAC se mostrou menos afetado por alterações nas características do espaço, quando comparado ao TPE.FAPEMIG - Fundação de Amparo à Pesquisa do Estado de Minas GeraisCAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível SuperiorengUniversidade Federal de Minas GeraisPrograma de Pós-Graduação em Ciência da ComputaçãoUFMGBrasilICX - DEPARTAMENTO DE CIÊNCIA DA COMPUTAÇÃOComputação – TesesAprendizado do computador – TesesOtimização combinatória - TesesFitness landscape – TesesFitness landscape analysisAutomated machine learningSearch spacesOptimizationCharacterization of automated machine learning fitness landscapesinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisinfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da UFMGinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMGORIGINALDissertation_Cristiano_G_Pimenta.pdfDissertation_Cristiano_G_Pimenta.pdfapplication/pdf12136185https://repositorio.ufmg.br/bitstream/1843/62093/1/Dissertation_Cristiano_G_Pimenta.pdf0e8a2a5b5b2b9bf9bcf714da4090fb67MD51LICENSElicense.txtlicense.txttext/plain; charset=utf-82118https://repositorio.ufmg.br/bitstream/1843/62093/2/license.txtcda590c95a0b51b4d15f60c9642ca272MD521843/620932023-12-19 17:53:29.174oai:repositorio.ufmg.br:1843/62093TElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEgRE8gUkVQT1NJVMOTUklPIElOU1RJVFVDSU9OQUwgREEgVUZNRwoKQ29tIGEgYXByZXNlbnRhw6fDo28gZGVzdGEgbGljZW7Dp2EsIHZvY8OqIChvIGF1dG9yIChlcykgb3UgbyB0aXR1bGFyIGRvcyBkaXJlaXRvcyBkZSBhdXRvcikgY29uY2VkZSBhbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIChSSS1VRk1HKSBvIGRpcmVpdG8gbsOjbyBleGNsdXNpdm8gZSBpcnJldm9nw6F2ZWwgZGUgcmVwcm9kdXppciBlL291IGRpc3RyaWJ1aXIgYSBzdWEgcHVibGljYcOnw6NvIChpbmNsdWluZG8gbyByZXN1bW8pIHBvciB0b2RvIG8gbXVuZG8gbm8gZm9ybWF0byBpbXByZXNzbyBlIGVsZXRyw7RuaWNvIGUgZW0gcXVhbHF1ZXIgbWVpbywgaW5jbHVpbmRvIG9zIGZvcm1hdG9zIMOhdWRpbyBvdSB2w61kZW8uCgpWb2PDqiBkZWNsYXJhIHF1ZSBjb25oZWNlIGEgcG9sw610aWNhIGRlIGNvcHlyaWdodCBkYSBlZGl0b3JhIGRvIHNldSBkb2N1bWVudG8gZSBxdWUgY29uaGVjZSBlIGFjZWl0YSBhcyBEaXJldHJpemVzIGRvIFJJLVVGTUcuCgpWb2PDqiBjb25jb3JkYSBxdWUgbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIHBvZGUsIHNlbSBhbHRlcmFyIG8gY29udGXDumRvLCB0cmFuc3BvciBhIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBxdWFscXVlciBtZWlvIG91IGZvcm1hdG8gcGFyYSBmaW5zIGRlIHByZXNlcnZhw6fDo28uCgpWb2PDqiB0YW1iw6ltIGNvbmNvcmRhIHF1ZSBvIFJlcG9zaXTDs3JpbyBJbnN0aXR1Y2lvbmFsIGRhIFVGTUcgcG9kZSBtYW50ZXIgbWFpcyBkZSB1bWEgY8OzcGlhIGRlIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBmaW5zIGRlIHNlZ3VyYW7Dp2EsIGJhY2stdXAgZSBwcmVzZXJ2YcOnw6NvLgoKVm9jw6ogZGVjbGFyYSBxdWUgYSBzdWEgcHVibGljYcOnw6NvIMOpIG9yaWdpbmFsIGUgcXVlIHZvY8OqIHRlbSBvIHBvZGVyIGRlIGNvbmNlZGVyIG9zIGRpcmVpdG9zIGNvbnRpZG9zIG5lc3RhIGxpY2Vuw6dhLiBWb2PDqiB0YW1iw6ltIGRlY2xhcmEgcXVlIG8gZGVww7NzaXRvIGRlIHN1YSBwdWJsaWNhw6fDo28gbsOjbywgcXVlIHNlamEgZGUgc2V1IGNvbmhlY2ltZW50bywgaW5mcmluZ2UgZGlyZWl0b3MgYXV0b3JhaXMgZGUgbmluZ3XDqW0uCgpDYXNvIGEgc3VhIHB1YmxpY2HDp8OjbyBjb250ZW5oYSBtYXRlcmlhbCBxdWUgdm9jw6ogbsOjbyBwb3NzdWkgYSB0aXR1bGFyaWRhZGUgZG9zIGRpcmVpdG9zIGF1dG9yYWlzLCB2b2PDqiBkZWNsYXJhIHF1ZSBvYnRldmUgYSBwZXJtaXNzw6NvIGlycmVzdHJpdGEgZG8gZGV0ZW50b3IgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIHBhcmEgY29uY2VkZXIgYW8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBvcyBkaXJlaXRvcyBhcHJlc2VudGFkb3MgbmVzdGEgbGljZW7Dp2EsIGUgcXVlIGVzc2UgbWF0ZXJpYWwgZGUgcHJvcHJpZWRhZGUgZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUgaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3Ugbm8gY29udGXDumRvIGRhIHB1YmxpY2HDp8OjbyBvcmEgZGVwb3NpdGFkYS4KCkNBU08gQSBQVUJMSUNBw4fDg08gT1JBIERFUE9TSVRBREEgVEVOSEEgU0lETyBSRVNVTFRBRE8gREUgVU0gUEFUUk9Dw41OSU8gT1UgQVBPSU8gREUgVU1BIEFHw4pOQ0lBIERFIEZPTUVOVE8gT1UgT1VUUk8gT1JHQU5JU01PLCBWT0PDiiBERUNMQVJBIFFVRSBSRVNQRUlUT1UgVE9ET1MgRSBRVUFJU1FVRVIgRElSRUlUT1MgREUgUkVWSVPDg08gQ09NTyBUQU1Cw4lNIEFTIERFTUFJUyBPQlJJR0HDh8OVRVMgRVhJR0lEQVMgUE9SIENPTlRSQVRPIE9VIEFDT1JETy4KCk8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBzZSBjb21wcm9tZXRlIGEgaWRlbnRpZmljYXIgY2xhcmFtZW50ZSBvIHNldSBub21lKHMpIG91IG8ocykgbm9tZXMocykgZG8ocykgZGV0ZW50b3IoZXMpIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBkYSBwdWJsaWNhw6fDo28sIGUgbsOjbyBmYXLDoSBxdWFscXVlciBhbHRlcmHDp8OjbywgYWzDqW0gZGFxdWVsYXMgY29uY2VkaWRhcyBwb3IgZXN0YSBsaWNlbsOnYS4KRepositório de PublicaçõesPUBhttps://repositorio.ufmg.br/oaiopendoar:2023-12-19T20:53:29Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)false
dc.title.pt_BR.fl_str_mv	Characterization of automated machine learning fitness landscapes
title	Characterization of automated machine learning fitness landscapes
spellingShingle	Characterization of automated machine learning fitness landscapes Cristiano Guimarães Pimenta Fitness landscape analysis Automated machine learning Search spaces Optimization Computação – Teses Aprendizado do computador – Teses Otimização combinatória - Teses Fitness landscape – Teses
title_short	Characterization of automated machine learning fitness landscapes
title_full	Characterization of automated machine learning fitness landscapes
title_fullStr	Characterization of automated machine learning fitness landscapes
title_full_unstemmed	Characterization of automated machine learning fitness landscapes
title_sort	Characterization of automated machine learning fitness landscapes
author	Cristiano Guimarães Pimenta
author_facet	Cristiano Guimarães Pimenta
author_role	author
dc.contributor.advisor1.fl_str_mv	Gisele Lobo Pappa
dc.contributor.advisor1Lattes.fl_str_mv	http://lattes.cnpq.br/5936682335701497
dc.contributor.advisor-co1.fl_str_mv	Alex Guimarães Cardoso de Sá
dc.contributor.referee1.fl_str_mv	Renato Vimieiro
dc.contributor.referee2.fl_str_mv	Ricardo Bastos Cavalcante Prudêncio
dc.contributor.authorLattes.fl_str_mv	http://lattes.cnpq.br/8713326153602094
dc.contributor.author.fl_str_mv	Cristiano Guimarães Pimenta
contributor_str_mv	Gisele Lobo Pappa Alex Guimarães Cardoso de Sá Renato Vimieiro Ricardo Bastos Cavalcante Prudêncio
dc.subject.por.fl_str_mv	Fitness landscape analysis Automated machine learning Search spaces Optimization
topic	Fitness landscape analysis Automated machine learning Search spaces Optimization Computação – Teses Aprendizado do computador – Teses Otimização combinatória - Teses Fitness landscape – Teses
dc.subject.other.pt_BR.fl_str_mv	Computação – Teses Aprendizado do computador – Teses Otimização combinatória - Teses Fitness landscape – Teses
description	Automated Machine Learning (AutoML) aims at automatically selecting and configuring complete machine learning pipelines without requiring deep user expertise. AutoML methods utilize a search space of possible solutions and try to find the best pipeline for a given learning problem. However, there is little knowledge about the characteristics of such spaces and how they relate to the performance of search methods. One way of exploring them is using Fitness Landscape Analysis (FLA), a technique commonly used to describe the landscape of combinatorial optimization problems. This work adapts classic FLA measures, such as Neutrality, Fitness Distance Correlation (FDC) and Correlation Length, to the context of the complex fitness landscape generated by AutoML search spaces, which include discrete, continuous, categorical and conditional variables, regardless of the methods used to explore the search spaces. It also evaluates how the characteristics of the landscape affect the performance of two AutoML methods based on Bayesian optimization: Tree-structured Parzen Estimator (TPE) and Sequential Model-based Algorithm Configuration (SMAC). In order to use FLA in the context of AutoML, we propose a tree-based representation for machine learning pipelines that is able to capture their semantics, a neighborhood definition based on a mutation operator, and a semantic distance metric between pipelines. Neutrality analyses suggest that larger landscapes tend to have more areas of equal or nearly equal fitness values, a feature that can improve the ability of TPE to explore the search space and find good solutions. Larger search spaces tend to be more rugged, as indicated by the Correlation Length measure, and are often more challenging for the optimizers. FDC proved to be a weak measure in describing problem difficulty. Furthermore, using local optima to calculate FDC can lead to very different results when compared to using the global optimum, which is usually unfeasible to calculate for AutoML problems. On the other hand, SMAC’s performance seems less affected by changes in the characteristics of the landscape.
publishDate	2023
dc.date.accessioned.fl_str_mv	2023-12-19T20:53:28Z
dc.date.available.fl_str_mv	2023-12-19T20:53:28Z
dc.date.issued.fl_str_mv	2023-06-21
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/1843/62093
dc.identifier.orcid.pt_BR.fl_str_mv	https://orcid.org/0000-0003-2809-8663
url	http://hdl.handle.net/1843/62093 https://orcid.org/0000-0003-2809-8663
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.publisher.none.fl_str_mv	Universidade Federal de Minas Gerais
dc.publisher.program.fl_str_mv	Programa de Pós-Graduação em Ciência da Computação
dc.publisher.initials.fl_str_mv	UFMG
dc.publisher.country.fl_str_mv	Brasil
dc.publisher.department.fl_str_mv	ICX - DEPARTAMENTO DE CIÊNCIA DA COMPUTAÇÃO
publisher.none.fl_str_mv	Universidade Federal de Minas Gerais
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFMG instname:Universidade Federal de Minas Gerais (UFMG) instacron:UFMG
instname_str	Universidade Federal de Minas Gerais (UFMG)
instacron_str	UFMG
institution	UFMG
reponame_str	Repositório Institucional da UFMG
collection	Repositório Institucional da UFMG
bitstream.url.fl_str_mv	https://repositorio.ufmg.br/bitstream/1843/62093/1/Dissertation_Cristiano_G_Pimenta.pdf https://repositorio.ufmg.br/bitstream/1843/62093/2/license.txt
bitstream.checksum.fl_str_mv	0e8a2a5b5b2b9bf9bcf714da4090fb67 cda590c95a0b51b4d15f60c9642ca272
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)
repository.mail.fl_str_mv
_version_	1803589190719373312

Characterization of automated machine learning fitness landscapes

Registros relacionados