Imbalanced classification tasks: measuring data complexity and recommending techniques

Detalhes bibliográficos
Autor(a) principal: Barella, Victor Hugo
Data de Publicação: 2021
Tipo de documento: Tese
Idioma: eng
Título da fonte: Biblioteca Digital de Teses e Dissertações da USP
Texto Completo: https://www.teses.usp.br/teses/disponiveis/55/55134/tde-26042021-140437/
Resumo: Machine learning classification algorithms tend to perform poorly in datasets with class imbalance. Class imbalance is not a problem per se, but it poses adverse effects when combined with other data characteristics, such as class overlap and noise. This study aims to measure data characteristics in imbalanced datasets and recommend techniques to deal with class imbalance in a meta-learning system. Popular data complexity measures were decomposed per class to better assess the imbalanced datasets characteristics. They were applied to controlled artificial datasets and to real datasets. These measures were correlated with several classification models predictive performance. The measures were also evaluated before and after applying popular pre-processing techniques for imbalanced datasets. Moreover, a meta-learning system was implemented using popular meta-features along with the data complexity measures developed in this research. The results showed that decomposing the data complexity measures per class improved their ability to measure complexity in imbalanced datasets. Furthermore, according to experimental results, they were the most important meta-features in the meta-learning system. Based on the results, data science practitioners should consider measuring the data complexity of imbalanced datasets, whether it is to interpret the data characteristics, select techniques, or develop new techniques.
id USP_1ce25dfb785e34c9ba15f3e92136cabf
oai_identifier_str oai:teses.usp.br:tde-26042021-140437
network_acronym_str USP
network_name_str Biblioteca Digital de Teses e Dissertações da USP
repository_id_str 2721
spelling Imbalanced classification tasks: measuring data complexity and recommending techniquesTarefas de classificação desbalanceadas: medindo complexidade de dados e recomendando técnicasAprendizado de máquinaDados desbalanceadosData complexityImbalanced datasetsMachine learningMeta- aprendizadoMeta- learningMeta-atributosMeta-featuresMachine learning classification algorithms tend to perform poorly in datasets with class imbalance. Class imbalance is not a problem per se, but it poses adverse effects when combined with other data characteristics, such as class overlap and noise. This study aims to measure data characteristics in imbalanced datasets and recommend techniques to deal with class imbalance in a meta-learning system. Popular data complexity measures were decomposed per class to better assess the imbalanced datasets characteristics. They were applied to controlled artificial datasets and to real datasets. These measures were correlated with several classification models predictive performance. The measures were also evaluated before and after applying popular pre-processing techniques for imbalanced datasets. Moreover, a meta-learning system was implemented using popular meta-features along with the data complexity measures developed in this research. The results showed that decomposing the data complexity measures per class improved their ability to measure complexity in imbalanced datasets. Furthermore, according to experimental results, they were the most important meta-features in the meta-learning system. Based on the results, data science practitioners should consider measuring the data complexity of imbalanced datasets, whether it is to interpret the data characteristics, select techniques, or develop new techniques.Algoritmos de classificação em aprendizado de máquina tendem a desempenhar pior em dados com classes desbalanceadas. Desbalanceamento de classes não é um problema sozinho, mas provoca efeitos adversos quando combinado com outras características de dados, como sobreposição de classes e ruído. Este estudo tem por objetivo medir características de dados desbalanceados e recomendar técnicas para lidar com desbalanceamento por meio de um sistema de meta-aprendizado. Nesta pesquisa, medidas populares de complexidade de dados foram decompostas por classe para melhor aferir as características de dados desbalanceados. Elas foram aplicadas em conjuntos de dados artificiais controlados e conjuntos reais. Essas medidas foram correlacionadas com o desempenho preditivo de diversos modelos de classificação. Elas também foram avaliadas antes e após a aplicação de famosas técnicas de pré-processamento pra dados desbalanceados. Além disso, um sistem de meta-prendizado foi implementado usando meta-atributos populares na literatura juntamente com as medidas de complexidade de dados desenvolvidas nessa pesquisa. Os resultados mostraram que decompor as medidas de complexidade por classe melhorou sua habilidade em medir complexidade em dados desbalanceados. Ademais, de acordo com os resultados dos experimentos, elas foram os meta-atributos mais relevantes para o sistema de meta-aprendizado. Baseado nos resultados desta pesquisa, praticantes de ciência de dados devem considerar medir a complexidade de conjuntos de dados desbalanceados, seja para interpretar características de dados, selecionar técnicas ou desenvolver novas técnicas.Biblioteca Digitais de Teses e Dissertações da USPCarvalho, André Carlos Ponce de Leon Ferreira deBarella, Victor Hugo2021-02-22info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/55/55134/tde-26042021-140437/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2021-04-26T20:12:02Zoai:teses.usp.br:tde-26042021-140437Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212021-04-26T20:12:02Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv Imbalanced classification tasks: measuring data complexity and recommending techniques
Tarefas de classificação desbalanceadas: medindo complexidade de dados e recomendando técnicas
title Imbalanced classification tasks: measuring data complexity and recommending techniques
spellingShingle Imbalanced classification tasks: measuring data complexity and recommending techniques
Barella, Victor Hugo
Aprendizado de máquina
Dados desbalanceados
Data complexity
Imbalanced datasets
Machine learning
Meta- aprendizado
Meta- learning
Meta-atributos
Meta-features
title_short Imbalanced classification tasks: measuring data complexity and recommending techniques
title_full Imbalanced classification tasks: measuring data complexity and recommending techniques
title_fullStr Imbalanced classification tasks: measuring data complexity and recommending techniques
title_full_unstemmed Imbalanced classification tasks: measuring data complexity and recommending techniques
title_sort Imbalanced classification tasks: measuring data complexity and recommending techniques
author Barella, Victor Hugo
author_facet Barella, Victor Hugo
author_role author
dc.contributor.none.fl_str_mv Carvalho, André Carlos Ponce de Leon Ferreira de
dc.contributor.author.fl_str_mv Barella, Victor Hugo
dc.subject.por.fl_str_mv Aprendizado de máquina
Dados desbalanceados
Data complexity
Imbalanced datasets
Machine learning
Meta- aprendizado
Meta- learning
Meta-atributos
Meta-features
topic Aprendizado de máquina
Dados desbalanceados
Data complexity
Imbalanced datasets
Machine learning
Meta- aprendizado
Meta- learning
Meta-atributos
Meta-features
description Machine learning classification algorithms tend to perform poorly in datasets with class imbalance. Class imbalance is not a problem per se, but it poses adverse effects when combined with other data characteristics, such as class overlap and noise. This study aims to measure data characteristics in imbalanced datasets and recommend techniques to deal with class imbalance in a meta-learning system. Popular data complexity measures were decomposed per class to better assess the imbalanced datasets characteristics. They were applied to controlled artificial datasets and to real datasets. These measures were correlated with several classification models predictive performance. The measures were also evaluated before and after applying popular pre-processing techniques for imbalanced datasets. Moreover, a meta-learning system was implemented using popular meta-features along with the data complexity measures developed in this research. The results showed that decomposing the data complexity measures per class improved their ability to measure complexity in imbalanced datasets. Furthermore, according to experimental results, they were the most important meta-features in the meta-learning system. Based on the results, data science practitioners should consider measuring the data complexity of imbalanced datasets, whether it is to interpret the data characteristics, select techniques, or develop new techniques.
publishDate 2021
dc.date.none.fl_str_mv 2021-02-22
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://www.teses.usp.br/teses/disponiveis/55/55134/tde-26042021-140437/
url https://www.teses.usp.br/teses/disponiveis/55/55134/tde-26042021-140437/
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv Liberar o conteúdo para acesso público.
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Liberar o conteúdo para acesso público.
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv
reponame:Biblioteca Digital de Teses e Dissertações da USP
instname:Universidade de São Paulo (USP)
instacron:USP
instname_str Universidade de São Paulo (USP)
instacron_str USP
institution USP
reponame_str Biblioteca Digital de Teses e Dissertações da USP
collection Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br
_version_ 1815257357140099072