New sampling algorithms for enhancing classifier performance on imbalanced data problems

MORAIS, Romero Fernando Almeida Barata de

New sampling algorithms for enhancing classifier performance on imbalanced data problems

Detalhes bibliográficos
Autor(a) principal:	MORAIS, Romero Fernando Almeida Barata de
Data de Publicação:	2018
Tipo de documento:	Dissertação
Idioma:	eng
Título da fonte:	Repositório Institucional da UFPE
dARK ID:	ark:/64986/001300000pt9k
Texto Completo:	https://repositorio.ufpe.br/handle/123456789/31428
Resumo:	Classification problems where the distribution of examples among the classes are imbalanced arise frequently in real-world domains. Commonly, these domains comprise critical problems where accurate predictions for all classes are necessary, such as credit card fraud detection, churn prediction, disease diagnosis, and network intrusive traffic detection. The problem with imbalanced data sets is that standard classifiers often have low accuracy on the underrepresented classes of the problem. Data sampling is the most popular approach to deal with imbalanced data sets and works by either decreasing the size of majority classes (undersampling) or increasing the size of minority classes (over-sampling). In this dissertation we propose two new data sampling algorithms: RRUS and k-INOS. RRUS is an under-sampling algorithm that aims to select a subset of examples from the majority class that best represents the majority class by preserving its density distribution. k-INOS is a general strategy to enhance robustness of over-sampling algorithms to noisy examples present in the minority class. Bothalgorithms were extensively tested on 50 imbalanced data sets, 6 diverse classifiers, and performance was evaluated according to 7 metrics. In particular, RRUS was compared to other 3 under-sampling algorithms and was significantly better than KMUS and SBC most of the time, and significantly better than RUS many times, for most classifiers and performance metrics. k-INOS, as a wrapper for over-sampling algorithms, was tested on 7 over-sampling algorithms and significantly increased Accuracy, Precision, and Specificity most of the time, and F1 many times. In addition, k-INOS’ hyperparameters were studied and appropriate values for their use were suggested. Finally, rules extracted from the former experiments with k-INOS revealed that the N3 complexity metric (loocv error rate of the 1-NN classifier) is often an indicator of whether k-INOS is likely to attain performance improvements or not.

Metadados do item

id	UFPE_d065e494d2f38cad4534e94c86f87737
oai_identifier_str	oai:repositorio.ufpe.br:123456789/31428
network_acronym_str	UFPE
network_name_str	Repositório Institucional da UFPE
repository_id_str	2221
spelling	MORAIS, Romero Fernando Almeida Barata dehttp://lattes.cnpq.br/2407501857144501http://lattes.cnpq.br/5943634209341438VASCONCELOS, Germano Crispim2019-07-11T18:57:52Z2019-07-11T18:57:52Z2018-02-06https://repositorio.ufpe.br/handle/123456789/31428ark:/64986/001300000pt9kClassification problems where the distribution of examples among the classes are imbalanced arise frequently in real-world domains. Commonly, these domains comprise critical problems where accurate predictions for all classes are necessary, such as credit card fraud detection, churn prediction, disease diagnosis, and network intrusive traffic detection. The problem with imbalanced data sets is that standard classifiers often have low accuracy on the underrepresented classes of the problem. Data sampling is the most popular approach to deal with imbalanced data sets and works by either decreasing the size of majority classes (undersampling) or increasing the size of minority classes (over-sampling). In this dissertation we propose two new data sampling algorithms: RRUS and k-INOS. RRUS is an under-sampling algorithm that aims to select a subset of examples from the majority class that best represents the majority class by preserving its density distribution. k-INOS is a general strategy to enhance robustness of over-sampling algorithms to noisy examples present in the minority class. Bothalgorithms were extensively tested on 50 imbalanced data sets, 6 diverse classifiers, and performance was evaluated according to 7 metrics. In particular, RRUS was compared to other 3 under-sampling algorithms and was significantly better than KMUS and SBC most of the time, and significantly better than RUS many times, for most classifiers and performance metrics. k-INOS, as a wrapper for over-sampling algorithms, was tested on 7 over-sampling algorithms and significantly increased Accuracy, Precision, and Specificity most of the time, and F1 many times. In addition, k-INOS’ hyperparameters were studied and appropriate values for their use were suggested. Finally, rules extracted from the former experiments with k-INOS revealed that the N3 complexity metric (loocv error rate of the 1-NN classifier) is often an indicator of whether k-INOS is likely to attain performance improvements or not.CNPqProblemas de classificação onde a distribuição de exemplos entre as classes é desbalanceada advém frequentemente de problemas reais. Muitas vezes, tais problemas reais são de natureza crítica e predições corretas para exemplos de todas as classes são necessárias, como em detecção de fraudes em cartões de crédito, identificação de doenças raras, e detecção de tráfego intrusivo em redes de internet. A problemática associada a dados desbalanceados é que classificadores comuns tendem a ter uma baixa taxa de acerto nas classes minoritárias. Algoritmos de amostragem são a solução mais comum para reduzir o desbalanceamento e em geral diminuem o número de exemplos nas classes majoritárias (sub-amostragem) ou aumentam o número de exemplos nas classes minoritárias (super-amostragem). Nesta dissertação propomos dois novos algoritmos de amostragem: RRUS e k-INOS. RRUS é um algoritmo de sub-amostragem que tem como objetivo obter um subconjunto da classe majoritária que melhor representa a classe majoritária original, através da preservação da distribuição de densidade. k-INOS é uma estratégia que torna qualquer algoritmo de super-amostragem mais robusto a ruídos presentes na classe minoritária. Ambos os algoritmos foram extensivamente testados em 50 conjuntos de dados desbalanceados, 6 classificadores diversos, e a performance foi avaliada de acordo com 7 métricas. Em particular, RRUS foi comparado com outros 3 algoritmos de sub-amostragem e teve um desempenho significativamente melhor que KMUS e SBC na maioria das vezes, e significativamente melhor que RUS várias vezes, para a maioria dos classificadores e métricas de performance. k-INOS, por ser aplicável a qualquer algoritmo de super-amostragem, foi testado em 7 algoritmos de super-amostragem e melhorou de maneira significativa na maioria das vezes a taxa de acerto, a precisão, e a cobertura da classe majoritária, e melhorou de maneira significativa em vários casos a métrica F1. Adicionalmente, os hyperparâmetros de k-INOS foram analisados através de um estudo de caso e valores apropriados para seu uso foram sugeridos. Por fim, um conjunto de regras foram extraídas a partir dos resultados principais com k-INOS e revelaram que a métrica de complexidade N3 (taxa de erro do 1-NN usando loocv) é frequentemente um indicador de situações em que k-INOS tem ou não chances de melhorar a performance de algoritmos de super-amostragem.engUniversidade Federal de PernambucoPrograma de Pos Graduacao em Ciencia da ComputacaoUFPEBrasilAttribution-NonCommercial-NoDerivs 3.0 Brazilhttp://creativecommons.org/licenses/by-nc-nd/3.0/br/info:eu-repo/semantics/openAccessInteligência computacionalAprendizagem de máquinaNew sampling algorithms for enhancing classifier performance on imbalanced data problemsinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesismestradoreponame:Repositório Institucional da UFPEinstname:Universidade Federal de Pernambuco (UFPE)instacron:UFPETHUMBNAILDISSERTAÇÃO Romero Fernando Almeida Barata de Morais.pdf.jpgDISSERTAÇÃO Romero Fernando Almeida Barata de Morais.pdf.jpgGenerated Thumbnailimage/jpeg1283https://repositorio.ufpe.br/bitstream/123456789/31428/6/DISSERTA%c3%87%c3%83O%20Romero%20Fernando%20Almeida%20Barata%20de%20Morais.pdf.jpg67983e28d659890d5dd8cfdaa2cc5d36MD56ORIGINALDISSERTAÇÃO Romero Fernando Almeida Barata de Morais.pdfDISSERTAÇÃO Romero Fernando Almeida Barata de Morais.pdfapplication/pdf13833128https://repositorio.ufpe.br/bitstream/123456789/31428/1/DISSERTA%c3%87%c3%83O%20Romero%20Fernando%20Almeida%20Barata%20de%20Morais.pdfca21dc261b03f861d16cf90fe4149199MD51LICENSElicense.txtlicense.txttext/plain; charset=utf-82311https://repositorio.ufpe.br/bitstream/123456789/31428/3/license.txt4b8a02c7f2818eaf00dcf2260dd5eb08MD53CC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-8811https://repositorio.ufpe.br/bitstream/123456789/31428/4/license_rdfe39d27027a6cc9cb039ad269a5db8e34MD54TEXTDISSERTAÇÃO Romero Fernando Almeida Barata de Morais.pdf.txtDISSERTAÇÃO Romero Fernando Almeida Barata de Morais.pdf.txtExtracted texttext/plain176341https://repositorio.ufpe.br/bitstream/123456789/31428/5/DISSERTA%c3%87%c3%83O%20Romero%20Fernando%20Almeida%20Barata%20de%20Morais.pdf.txte695f1d96d401835350bd8ddbe3abdf4MD55123456789/314282019-10-25 10:02:34.055oai:repositorio.ufpe.br:123456789/31428TGljZW7Dp2EgZGUgRGlzdHJpYnVpw6fDo28gTsOjbyBFeGNsdXNpdmEKClRvZG8gZGVwb3NpdGFudGUgZGUgbWF0ZXJpYWwgbm8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgKFJJKSBkZXZlIGNvbmNlZGVyLCDDoCBVbml2ZXJzaWRhZGUgRmVkZXJhbCBkZSBQZXJuYW1idWNvIChVRlBFKSwgdW1hIExpY2Vuw6dhIGRlIERpc3RyaWJ1acOnw6NvIE7Do28gRXhjbHVzaXZhIHBhcmEgbWFudGVyIGUgdG9ybmFyIGFjZXNzw612ZWlzIG9zIHNldXMgZG9jdW1lbnRvcywgZW0gZm9ybWF0byBkaWdpdGFsLCBuZXN0ZSByZXBvc2l0w7NyaW8uCgpDb20gYSBjb25jZXNzw6NvIGRlc3RhIGxpY2Vuw6dhIG7Do28gZXhjbHVzaXZhLCBvIGRlcG9zaXRhbnRlIG1hbnTDqW0gdG9kb3Mgb3MgZGlyZWl0b3MgZGUgYXV0b3IuCl9fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fXwoKTGljZW7Dp2EgZGUgRGlzdHJpYnVpw6fDo28gTsOjbyBFeGNsdXNpdmEKCkFvIGNvbmNvcmRhciBjb20gZXN0YSBsaWNlbsOnYSBlIGFjZWl0w6EtbGEsIHZvY8OqIChhdXRvciBvdSBkZXRlbnRvciBkb3MgZGlyZWl0b3MgYXV0b3JhaXMpOgoKYSkgRGVjbGFyYSBxdWUgY29uaGVjZSBhIHBvbMOtdGljYSBkZSBjb3B5cmlnaHQgZGEgZWRpdG9yYSBkbyBzZXUgZG9jdW1lbnRvOwpiKSBEZWNsYXJhIHF1ZSBjb25oZWNlIGUgYWNlaXRhIGFzIERpcmV0cml6ZXMgcGFyYSBvIFJlcG9zaXTDs3JpbyBJbnN0aXR1Y2lvbmFsIGRhIFVGUEU7CmMpIENvbmNlZGUgw6AgVUZQRSBvIGRpcmVpdG8gbsOjbyBleGNsdXNpdm8gZGUgYXJxdWl2YXIsIHJlcHJvZHV6aXIsIGNvbnZlcnRlciAoY29tbyBkZWZpbmlkbyBhIHNlZ3VpciksIGNvbXVuaWNhciBlL291IGRpc3RyaWJ1aXIsIG5vIFJJLCBvIGRvY3VtZW50byBlbnRyZWd1ZSAoaW5jbHVpbmRvIG8gcmVzdW1vL2Fic3RyYWN0KSBlbSBmb3JtYXRvIGRpZ2l0YWwgb3UgcG9yIG91dHJvIG1laW87CmQpIERlY2xhcmEgcXVlIGF1dG9yaXphIGEgVUZQRSBhIGFycXVpdmFyIG1haXMgZGUgdW1hIGPDs3BpYSBkZXN0ZSBkb2N1bWVudG8gZSBjb252ZXJ0w6otbG8sIHNlbSBhbHRlcmFyIG8gc2V1IGNvbnRlw7pkbywgcGFyYSBxdWFscXVlciBmb3JtYXRvIGRlIGZpY2hlaXJvLCBtZWlvIG91IHN1cG9ydGUsIHBhcmEgZWZlaXRvcyBkZSBzZWd1cmFuw6dhLCBwcmVzZXJ2YcOnw6NvIChiYWNrdXApIGUgYWNlc3NvOwplKSBEZWNsYXJhIHF1ZSBvIGRvY3VtZW50byBzdWJtZXRpZG8gw6kgbyBzZXUgdHJhYmFsaG8gb3JpZ2luYWwgZSBxdWUgZGV0w6ltIG8gZGlyZWl0byBkZSBjb25jZWRlciBhIHRlcmNlaXJvcyBvcyBkaXJlaXRvcyBjb250aWRvcyBuZXN0YSBsaWNlbsOnYS4gRGVjbGFyYSB0YW1iw6ltIHF1ZSBhIGVudHJlZ2EgZG8gZG9jdW1lbnRvIG7Do28gaW5mcmluZ2Ugb3MgZGlyZWl0b3MgZGUgb3V0cmEgcGVzc29hIG91IGVudGlkYWRlOwpmKSBEZWNsYXJhIHF1ZSwgbm8gY2FzbyBkbyBkb2N1bWVudG8gc3VibWV0aWRvIGNvbnRlciBtYXRlcmlhbCBkbyBxdWFsIG7Do28gZGV0w6ltIG9zIGRpcmVpdG9zIGRlCmF1dG9yLCBvYnRldmUgYSBhdXRvcml6YcOnw6NvIGlycmVzdHJpdGEgZG8gcmVzcGVjdGl2byBkZXRlbnRvciBkZXNzZXMgZGlyZWl0b3MgcGFyYSBjZWRlciDDoApVRlBFIG9zIGRpcmVpdG9zIHJlcXVlcmlkb3MgcG9yIGVzdGEgTGljZW7Dp2EgZSBhdXRvcml6YXIgYSB1bml2ZXJzaWRhZGUgYSB1dGlsaXrDoS1sb3MgbGVnYWxtZW50ZS4gRGVjbGFyYSB0YW1iw6ltIHF1ZSBlc3NlIG1hdGVyaWFsIGN1am9zIGRpcmVpdG9zIHPDo28gZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUgaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3UgY29udGXDumRvIGRvIGRvY3VtZW50byBlbnRyZWd1ZTsKZykgU2UgbyBkb2N1bWVudG8gZW50cmVndWUgw6kgYmFzZWFkbyBlbSB0cmFiYWxobyBmaW5hbmNpYWRvIG91IGFwb2lhZG8gcG9yIG91dHJhIGluc3RpdHVpw6fDo28gcXVlIG7Do28gYSBVRlBFLMKgZGVjbGFyYSBxdWUgY3VtcHJpdSBxdWFpc3F1ZXIgb2JyaWdhw6fDtWVzIGV4aWdpZGFzIHBlbG8gcmVzcGVjdGl2byBjb250cmF0byBvdSBhY29yZG8uCgpBIFVGUEUgaWRlbnRpZmljYXLDoSBjbGFyYW1lbnRlIG8ocykgbm9tZShzKSBkbyhzKSBhdXRvciAoZXMpIGRvcyBkaXJlaXRvcyBkbyBkb2N1bWVudG8gZW50cmVndWUgZSBuw6NvIGZhcsOhIHF1YWxxdWVyIGFsdGVyYcOnw6NvLCBwYXJhIGFsw6ltIGRvIHByZXZpc3RvIG5hIGFsw61uZWEgYykuCg==Repositório InstitucionalPUBhttps://repositorio.ufpe.br/oai/requestattena@ufpe.bropendoar:22212019-10-25T13:02:34Repositório Institucional da UFPE - Universidade Federal de Pernambuco (UFPE)false
dc.title.pt_BR.fl_str_mv	New sampling algorithms for enhancing classifier performance on imbalanced data problems
title	New sampling algorithms for enhancing classifier performance on imbalanced data problems
spellingShingle	New sampling algorithms for enhancing classifier performance on imbalanced data problems MORAIS, Romero Fernando Almeida Barata de Inteligência computacional Aprendizagem de máquina
title_short	New sampling algorithms for enhancing classifier performance on imbalanced data problems
title_full	New sampling algorithms for enhancing classifier performance on imbalanced data problems
title_fullStr	New sampling algorithms for enhancing classifier performance on imbalanced data problems
title_full_unstemmed	New sampling algorithms for enhancing classifier performance on imbalanced data problems
title_sort	New sampling algorithms for enhancing classifier performance on imbalanced data problems
author	MORAIS, Romero Fernando Almeida Barata de
author_facet	MORAIS, Romero Fernando Almeida Barata de
author_role	author
dc.contributor.authorLattes.pt_BR.fl_str_mv	http://lattes.cnpq.br/2407501857144501
dc.contributor.advisorLattes.pt_BR.fl_str_mv	http://lattes.cnpq.br/5943634209341438
dc.contributor.author.fl_str_mv	MORAIS, Romero Fernando Almeida Barata de
dc.contributor.advisor1.fl_str_mv	VASCONCELOS, Germano Crispim
contributor_str_mv	VASCONCELOS, Germano Crispim
dc.subject.por.fl_str_mv	Inteligência computacional Aprendizagem de máquina
topic	Inteligência computacional Aprendizagem de máquina
description	Classification problems where the distribution of examples among the classes are imbalanced arise frequently in real-world domains. Commonly, these domains comprise critical problems where accurate predictions for all classes are necessary, such as credit card fraud detection, churn prediction, disease diagnosis, and network intrusive traffic detection. The problem with imbalanced data sets is that standard classifiers often have low accuracy on the underrepresented classes of the problem. Data sampling is the most popular approach to deal with imbalanced data sets and works by either decreasing the size of majority classes (undersampling) or increasing the size of minority classes (over-sampling). In this dissertation we propose two new data sampling algorithms: RRUS and k-INOS. RRUS is an under-sampling algorithm that aims to select a subset of examples from the majority class that best represents the majority class by preserving its density distribution. k-INOS is a general strategy to enhance robustness of over-sampling algorithms to noisy examples present in the minority class. Bothalgorithms were extensively tested on 50 imbalanced data sets, 6 diverse classifiers, and performance was evaluated according to 7 metrics. In particular, RRUS was compared to other 3 under-sampling algorithms and was significantly better than KMUS and SBC most of the time, and significantly better than RUS many times, for most classifiers and performance metrics. k-INOS, as a wrapper for over-sampling algorithms, was tested on 7 over-sampling algorithms and significantly increased Accuracy, Precision, and Specificity most of the time, and F1 many times. In addition, k-INOS’ hyperparameters were studied and appropriate values for their use were suggested. Finally, rules extracted from the former experiments with k-INOS revealed that the N3 complexity metric (loocv error rate of the 1-NN classifier) is often an indicator of whether k-INOS is likely to attain performance improvements or not.
publishDate	2018
dc.date.issued.fl_str_mv	2018-02-06
dc.date.accessioned.fl_str_mv	2019-07-11T18:57:52Z
dc.date.available.fl_str_mv	2019-07-11T18:57:52Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://repositorio.ufpe.br/handle/123456789/31428
dc.identifier.dark.fl_str_mv	ark:/64986/001300000pt9k
url	https://repositorio.ufpe.br/handle/123456789/31428
identifier_str_mv	ark:/64986/001300000pt9k
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	Attribution-NonCommercial-NoDerivs 3.0 Brazil http://creativecommons.org/licenses/by-nc-nd/3.0/br/ info:eu-repo/semantics/openAccess
rights_invalid_str_mv	Attribution-NonCommercial-NoDerivs 3.0 Brazil http://creativecommons.org/licenses/by-nc-nd/3.0/br/
eu_rights_str_mv	openAccess
dc.publisher.none.fl_str_mv	Universidade Federal de Pernambuco
dc.publisher.program.fl_str_mv	Programa de Pos Graduacao em Ciencia da Computacao
dc.publisher.initials.fl_str_mv	UFPE
dc.publisher.country.fl_str_mv	Brasil
publisher.none.fl_str_mv	Universidade Federal de Pernambuco
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFPE instname:Universidade Federal de Pernambuco (UFPE) instacron:UFPE
instname_str	Universidade Federal de Pernambuco (UFPE)
instacron_str	UFPE
institution	UFPE
reponame_str	Repositório Institucional da UFPE
collection	Repositório Institucional da UFPE
bitstream.url.fl_str_mv	https://repositorio.ufpe.br/bitstream/123456789/31428/6/DISSERTA%c3%87%c3%83O%20Romero%20Fernando%20Almeida%20Barata%20de%20Morais.pdf.jpg https://repositorio.ufpe.br/bitstream/123456789/31428/1/DISSERTA%c3%87%c3%83O%20Romero%20Fernando%20Almeida%20Barata%20de%20Morais.pdf https://repositorio.ufpe.br/bitstream/123456789/31428/3/license.txt https://repositorio.ufpe.br/bitstream/123456789/31428/4/license_rdf https://repositorio.ufpe.br/bitstream/123456789/31428/5/DISSERTA%c3%87%c3%83O%20Romero%20Fernando%20Almeida%20Barata%20de%20Morais.pdf.txt
bitstream.checksum.fl_str_mv	67983e28d659890d5dd8cfdaa2cc5d36 ca21dc261b03f861d16cf90fe4149199 4b8a02c7f2818eaf00dcf2260dd5eb08 e39d27027a6cc9cb039ad269a5db8e34 e695f1d96d401835350bd8ddbe3abdf4
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5 MD5 MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da UFPE - Universidade Federal de Pernambuco (UFPE)
repository.mail.fl_str_mv	attena@ufpe.br
_version_	1815172879927476224

New sampling algorithms for enhancing classifier performance on imbalanced data problems

Registros relacionados