Handling imbalanced datasets through Optimum-Path Forest

Detalhes bibliográficos
Autor(a) principal: Passos, Leandro Aparecido [UNESP]
Data de Publicação: 2022
Outros Autores: Jodas, Danilo S. [UNESP], Ribeiro, Luiz C.F. [UNESP], Akio, Marco [UNESP], de Souza, Andre Nunes [UNESP], Papa, João Paulo [UNESP]
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Institucional da UNESP
Texto Completo: http://dx.doi.org/10.1016/j.knosys.2022.108445
http://hdl.handle.net/11449/234201
Resumo: In the last decade, machine learning-based approaches became capable of performing a wide range of complex tasks sometimes better than humans, demanding a fraction of the time. Such an advance is partially due to the exponential growth in the amount of data available, which makes it possible to extract trustworthy real-world information from them. However, such data is generally imbalanced since some phenomena are more likely than others. Such a behavior yields considerable influence on the machine learning model's performance since it becomes biased on the more frequent data it receives. Despite the considerable amount of machine learning methods, a graph-based approach has attracted considerable notoriety due to the outstanding performance over many applications, i.e., the Optimum-Path Forest (OPF). In this paper, we propose three OPF-based strategies to deal with the imbalance problem: the O2PF and the OPF-US, which are novel approaches for oversampling and undersampling, respectively, as well as a hybrid strategy combining both approaches. The paper also introduces a set of variants concerning the strategies mentioned above. Results compared against several state-of-the-art techniques over public and private datasets confirm the robustness of the proposed approaches.
id UNSP_971d4ee5da266237ecf919b44386d0fb
oai_identifier_str oai:repositorio.unesp.br:11449/234201
network_acronym_str UNSP
network_name_str Repositório Institucional da UNESP
repository_id_str 2946
spelling Handling imbalanced datasets through Optimum-Path ForestImbalanced dataOptimum-Path ForestOversamplingUndersamplingIn the last decade, machine learning-based approaches became capable of performing a wide range of complex tasks sometimes better than humans, demanding a fraction of the time. Such an advance is partially due to the exponential growth in the amount of data available, which makes it possible to extract trustworthy real-world information from them. However, such data is generally imbalanced since some phenomena are more likely than others. Such a behavior yields considerable influence on the machine learning model's performance since it becomes biased on the more frequent data it receives. Despite the considerable amount of machine learning methods, a graph-based approach has attracted considerable notoriety due to the outstanding performance over many applications, i.e., the Optimum-Path Forest (OPF). In this paper, we propose three OPF-based strategies to deal with the imbalance problem: the O2PF and the OPF-US, which are novel approaches for oversampling and undersampling, respectively, as well as a hybrid strategy combining both approaches. The paper also introduces a set of variants concerning the strategies mentioned above. Results compared against several state-of-the-art techniques over public and private datasets confirm the robustness of the proposed approaches.Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)Department of Computing São Paulo State University, Av. Eng. Luiz Edmundo Carrijo Coube, 14-01Department of Electrical Engineering São Paulo State University, Av. Eng. Luiz Edmundo Carrijo Coube, 14-01Department of Computing São Paulo State University, Av. Eng. Luiz Edmundo Carrijo Coube, 14-01Department of Electrical Engineering São Paulo State University, Av. Eng. Luiz Edmundo Carrijo Coube, 14-01FAPESP: #2013/07375-0FAPESP: #2014/12236-1FAPESP: #2017/02286-0FAPESP: #2018/21934-5FAPESP: #2019/07665-4FAPESP: #2019/18287-0FAPESP: #2020/12101-0CNPq: #307066/2017-7CNPq: #427968/2018-6Universidade Estadual Paulista (UNESP)Passos, Leandro Aparecido [UNESP]Jodas, Danilo S. [UNESP]Ribeiro, Luiz C.F. [UNESP]Akio, Marco [UNESP]de Souza, Andre Nunes [UNESP]Papa, João Paulo [UNESP]2022-05-01T13:57:34Z2022-05-01T13:57:34Z2022-04-22info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articlehttp://dx.doi.org/10.1016/j.knosys.2022.108445Knowledge-Based Systems, v. 242.0950-7051http://hdl.handle.net/11449/23420110.1016/j.knosys.2022.1084452-s2.0-85125266467Scopusreponame:Repositório Institucional da UNESPinstname:Universidade Estadual Paulista (UNESP)instacron:UNESPengKnowledge-Based Systemsinfo:eu-repo/semantics/openAccess2024-04-23T16:11:00Zoai:repositorio.unesp.br:11449/234201Repositório InstitucionalPUBhttp://repositorio.unesp.br/oai/requestopendoar:29462024-04-23T16:11Repositório Institucional da UNESP - Universidade Estadual Paulista (UNESP)false
dc.title.none.fl_str_mv Handling imbalanced datasets through Optimum-Path Forest
title Handling imbalanced datasets through Optimum-Path Forest
spellingShingle Handling imbalanced datasets through Optimum-Path Forest
Passos, Leandro Aparecido [UNESP]
Imbalanced data
Optimum-Path Forest
Oversampling
Undersampling
title_short Handling imbalanced datasets through Optimum-Path Forest
title_full Handling imbalanced datasets through Optimum-Path Forest
title_fullStr Handling imbalanced datasets through Optimum-Path Forest
title_full_unstemmed Handling imbalanced datasets through Optimum-Path Forest
title_sort Handling imbalanced datasets through Optimum-Path Forest
author Passos, Leandro Aparecido [UNESP]
author_facet Passos, Leandro Aparecido [UNESP]
Jodas, Danilo S. [UNESP]
Ribeiro, Luiz C.F. [UNESP]
Akio, Marco [UNESP]
de Souza, Andre Nunes [UNESP]
Papa, João Paulo [UNESP]
author_role author
author2 Jodas, Danilo S. [UNESP]
Ribeiro, Luiz C.F. [UNESP]
Akio, Marco [UNESP]
de Souza, Andre Nunes [UNESP]
Papa, João Paulo [UNESP]
author2_role author
author
author
author
author
dc.contributor.none.fl_str_mv Universidade Estadual Paulista (UNESP)
dc.contributor.author.fl_str_mv Passos, Leandro Aparecido [UNESP]
Jodas, Danilo S. [UNESP]
Ribeiro, Luiz C.F. [UNESP]
Akio, Marco [UNESP]
de Souza, Andre Nunes [UNESP]
Papa, João Paulo [UNESP]
dc.subject.por.fl_str_mv Imbalanced data
Optimum-Path Forest
Oversampling
Undersampling
topic Imbalanced data
Optimum-Path Forest
Oversampling
Undersampling
description In the last decade, machine learning-based approaches became capable of performing a wide range of complex tasks sometimes better than humans, demanding a fraction of the time. Such an advance is partially due to the exponential growth in the amount of data available, which makes it possible to extract trustworthy real-world information from them. However, such data is generally imbalanced since some phenomena are more likely than others. Such a behavior yields considerable influence on the machine learning model's performance since it becomes biased on the more frequent data it receives. Despite the considerable amount of machine learning methods, a graph-based approach has attracted considerable notoriety due to the outstanding performance over many applications, i.e., the Optimum-Path Forest (OPF). In this paper, we propose three OPF-based strategies to deal with the imbalance problem: the O2PF and the OPF-US, which are novel approaches for oversampling and undersampling, respectively, as well as a hybrid strategy combining both approaches. The paper also introduces a set of variants concerning the strategies mentioned above. Results compared against several state-of-the-art techniques over public and private datasets confirm the robustness of the proposed approaches.
publishDate 2022
dc.date.none.fl_str_mv 2022-05-01T13:57:34Z
2022-05-01T13:57:34Z
2022-04-22
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://dx.doi.org/10.1016/j.knosys.2022.108445
Knowledge-Based Systems, v. 242.
0950-7051
http://hdl.handle.net/11449/234201
10.1016/j.knosys.2022.108445
2-s2.0-85125266467
url http://dx.doi.org/10.1016/j.knosys.2022.108445
http://hdl.handle.net/11449/234201
identifier_str_mv Knowledge-Based Systems, v. 242.
0950-7051
10.1016/j.knosys.2022.108445
2-s2.0-85125266467
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv Knowledge-Based Systems
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.source.none.fl_str_mv Scopus
reponame:Repositório Institucional da UNESP
instname:Universidade Estadual Paulista (UNESP)
instacron:UNESP
instname_str Universidade Estadual Paulista (UNESP)
instacron_str UNESP
institution UNESP
reponame_str Repositório Institucional da UNESP
collection Repositório Institucional da UNESP
repository.name.fl_str_mv Repositório Institucional da UNESP - Universidade Estadual Paulista (UNESP)
repository.mail.fl_str_mv
_version_ 1797790299450245120