New sampling algorithms for enhancing classifier performance on imbalanced data problems
Autor(a) principal: | |
---|---|
Data de Publicação: | 2018 |
Tipo de documento: | Dissertação |
Idioma: | eng |
Título da fonte: | Repositório Institucional da UFPE |
dARK ID: | ark:/64986/001300000pt9k |
Texto Completo: | https://repositorio.ufpe.br/handle/123456789/31428 |
Resumo: | Classification problems where the distribution of examples among the classes are imbalanced arise frequently in real-world domains. Commonly, these domains comprise critical problems where accurate predictions for all classes are necessary, such as credit card fraud detection, churn prediction, disease diagnosis, and network intrusive traffic detection. The problem with imbalanced data sets is that standard classifiers often have low accuracy on the underrepresented classes of the problem. Data sampling is the most popular approach to deal with imbalanced data sets and works by either decreasing the size of majority classes (undersampling) or increasing the size of minority classes (over-sampling). In this dissertation we propose two new data sampling algorithms: RRUS and k-INOS. RRUS is an under-sampling algorithm that aims to select a subset of examples from the majority class that best represents the majority class by preserving its density distribution. k-INOS is a general strategy to enhance robustness of over-sampling algorithms to noisy examples present in the minority class. Bothalgorithms were extensively tested on 50 imbalanced data sets, 6 diverse classifiers, and performance was evaluated according to 7 metrics. In particular, RRUS was compared to other 3 under-sampling algorithms and was significantly better than KMUS and SBC most of the time, and significantly better than RUS many times, for most classifiers and performance metrics. k-INOS, as a wrapper for over-sampling algorithms, was tested on 7 over-sampling algorithms and significantly increased Accuracy, Precision, and Specificity most of the time, and F1 many times. In addition, k-INOS’ hyperparameters were studied and appropriate values for their use were suggested. Finally, rules extracted from the former experiments with k-INOS revealed that the N3 complexity metric (loocv error rate of the 1-NN classifier) is often an indicator of whether k-INOS is likely to attain performance improvements or not. |
id |
UFPE_d065e494d2f38cad4534e94c86f87737 |
---|---|
oai_identifier_str |
oai:repositorio.ufpe.br:123456789/31428 |
network_acronym_str |
UFPE |
network_name_str |
Repositório Institucional da UFPE |
repository_id_str |
2221 |
spelling |
MORAIS, Romero Fernando Almeida Barata dehttp://lattes.cnpq.br/2407501857144501http://lattes.cnpq.br/5943634209341438VASCONCELOS, Germano Crispim2019-07-11T18:57:52Z2019-07-11T18:57:52Z2018-02-06https://repositorio.ufpe.br/handle/123456789/31428ark:/64986/001300000pt9kClassification problems where the distribution of examples among the classes are imbalanced arise frequently in real-world domains. Commonly, these domains comprise critical problems where accurate predictions for all classes are necessary, such as credit card fraud detection, churn prediction, disease diagnosis, and network intrusive traffic detection. The problem with imbalanced data sets is that standard classifiers often have low accuracy on the underrepresented classes of the problem. Data sampling is the most popular approach to deal with imbalanced data sets and works by either decreasing the size of majority classes (undersampling) or increasing the size of minority classes (over-sampling). In this dissertation we propose two new data sampling algorithms: RRUS and k-INOS. RRUS is an under-sampling algorithm that aims to select a subset of examples from the majority class that best represents the majority class by preserving its density distribution. k-INOS is a general strategy to enhance robustness of over-sampling algorithms to noisy examples present in the minority class. Bothalgorithms were extensively tested on 50 imbalanced data sets, 6 diverse classifiers, and performance was evaluated according to 7 metrics. In particular, RRUS was compared to other 3 under-sampling algorithms and was significantly better than KMUS and SBC most of the time, and significantly better than RUS many times, for most classifiers and performance metrics. k-INOS, as a wrapper for over-sampling algorithms, was tested on 7 over-sampling algorithms and significantly increased Accuracy, Precision, and Specificity most of the time, and F1 many times. In addition, k-INOS’ hyperparameters were studied and appropriate values for their use were suggested. Finally, rules extracted from the former experiments with k-INOS revealed that the N3 complexity metric (loocv error rate of the 1-NN classifier) is often an indicator of whether k-INOS is likely to attain performance improvements or not.CNPqProblemas de classificação onde a distribuição de exemplos entre as classes é desbalanceada advém frequentemente de problemas reais. Muitas vezes, tais problemas reais são de natureza crítica e predições corretas para exemplos de todas as classes são necessárias, como em detecção de fraudes em cartões de crédito, identificação de doenças raras, e detecção de tráfego intrusivo em redes de internet. A problemática associada a dados desbalanceados é que classificadores comuns tendem a ter uma baixa taxa de acerto nas classes minoritárias. Algoritmos de amostragem são a solução mais comum para reduzir o desbalanceamento e em geral diminuem o número de exemplos nas classes majoritárias (sub-amostragem) ou aumentam o número de exemplos nas classes minoritárias (super-amostragem). Nesta dissertação propomos dois novos algoritmos de amostragem: RRUS e k-INOS. RRUS é um algoritmo de sub-amostragem que tem como objetivo obter um subconjunto da classe majoritária que melhor representa a classe majoritária original, através da preservação da distribuição de densidade. k-INOS é uma estratégia que torna qualquer algoritmo de super-amostragem mais robusto a ruídos presentes na classe minoritária. Ambos os algoritmos foram extensivamente testados em 50 conjuntos de dados desbalanceados, 6 classificadores diversos, e a performance foi avaliada de acordo com 7 métricas. Em particular, RRUS foi comparado com outros 3 algoritmos de sub-amostragem e teve um desempenho significativamente melhor que KMUS e SBC na maioria das vezes, e significativamente melhor que RUS várias vezes, para a maioria dos classificadores e métricas de performance. k-INOS, por ser aplicável a qualquer algoritmo de super-amostragem, foi testado em 7 algoritmos de super-amostragem e melhorou de maneira significativa na maioria das vezes a taxa de acerto, a precisão, e a cobertura da classe majoritária, e melhorou de maneira significativa em vários casos a métrica F1. Adicionalmente, os hyperparâmetros de k-INOS foram analisados através de um estudo de caso e valores apropriados para seu uso foram sugeridos. Por fim, um conjunto de regras foram extraídas a partir dos resultados principais com k-INOS e revelaram que a métrica de complexidade N3 (taxa de erro do 1-NN usando loocv) é frequentemente um indicador de situações em que k-INOS tem ou não chances de melhorar a performance de algoritmos de super-amostragem.engUniversidade Federal de PernambucoPrograma de Pos Graduacao em Ciencia da ComputacaoUFPEBrasilAttribution-NonCommercial-NoDerivs 3.0 Brazilhttp://creativecommons.org/licenses/by-nc-nd/3.0/br/info:eu-repo/semantics/openAccessInteligência computacionalAprendizagem de máquinaNew sampling algorithms for enhancing classifier performance on imbalanced data problemsinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesismestradoreponame:Repositório Institucional da UFPEinstname:Universidade Federal de Pernambuco (UFPE)instacron:UFPETHUMBNAILDISSERTAÇÃO Romero Fernando Almeida Barata de Morais.pdf.jpgDISSERTAÇÃO Romero Fernando Almeida Barata de Morais.pdf.jpgGenerated Thumbnailimage/jpeg1283https://repositorio.ufpe.br/bitstream/123456789/31428/6/DISSERTA%c3%87%c3%83O%20Romero%20Fernando%20Almeida%20Barata%20de%20Morais.pdf.jpg67983e28d659890d5dd8cfdaa2cc5d36MD56ORIGINALDISSERTAÇÃO Romero Fernando Almeida Barata de Morais.pdfDISSERTAÇÃO Romero Fernando Almeida Barata de Morais.pdfapplication/pdf13833128https://repositorio.ufpe.br/bitstream/123456789/31428/1/DISSERTA%c3%87%c3%83O%20Romero%20Fernando%20Almeida%20Barata%20de%20Morais.pdfca21dc261b03f861d16cf90fe4149199MD51LICENSElicense.txtlicense.txttext/plain; charset=utf-82311https://repositorio.ufpe.br/bitstream/123456789/31428/3/license.txt4b8a02c7f2818eaf00dcf2260dd5eb08MD53CC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-8811https://repositorio.ufpe.br/bitstream/123456789/31428/4/license_rdfe39d27027a6cc9cb039ad269a5db8e34MD54TEXTDISSERTAÇÃO Romero Fernando Almeida Barata de Morais.pdf.txtDISSERTAÇÃO Romero Fernando Almeida Barata de Morais.pdf.txtExtracted texttext/plain176341https://repositorio.ufpe.br/bitstream/123456789/31428/5/DISSERTA%c3%87%c3%83O%20Romero%20Fernando%20Almeida%20Barata%20de%20Morais.pdf.txte695f1d96d401835350bd8ddbe3abdf4MD55123456789/314282019-10-25 10:02:34.055oai:repositorio.ufpe.br:123456789/31428TGljZW7Dp2EgZGUgRGlzdHJpYnVpw6fDo28gTsOjbyBFeGNsdXNpdmEKClRvZG8gZGVwb3NpdGFudGUgZGUgbWF0ZXJpYWwgbm8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgKFJJKSBkZXZlIGNvbmNlZGVyLCDDoCBVbml2ZXJzaWRhZGUgRmVkZXJhbCBkZSBQZXJuYW1idWNvIChVRlBFKSwgdW1hIExpY2Vuw6dhIGRlIERpc3RyaWJ1acOnw6NvIE7Do28gRXhjbHVzaXZhIHBhcmEgbWFudGVyIGUgdG9ybmFyIGFjZXNzw612ZWlzIG9zIHNldXMgZG9jdW1lbnRvcywgZW0gZm9ybWF0byBkaWdpdGFsLCBuZXN0ZSByZXBvc2l0w7NyaW8uCgpDb20gYSBjb25jZXNzw6NvIGRlc3RhIGxpY2Vuw6dhIG7Do28gZXhjbHVzaXZhLCBvIGRlcG9zaXRhbnRlIG1hbnTDqW0gdG9kb3Mgb3MgZGlyZWl0b3MgZGUgYXV0b3IuCl9fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fX19fXwoKTGljZW7Dp2EgZGUgRGlzdHJpYnVpw6fDo28gTsOjbyBFeGNsdXNpdmEKCkFvIGNvbmNvcmRhciBjb20gZXN0YSBsaWNlbsOnYSBlIGFjZWl0w6EtbGEsIHZvY8OqIChhdXRvciBvdSBkZXRlbnRvciBkb3MgZGlyZWl0b3MgYXV0b3JhaXMpOgoKYSkgRGVjbGFyYSBxdWUgY29uaGVjZSBhIHBvbMOtdGljYSBkZSBjb3B5cmlnaHQgZGEgZWRpdG9yYSBkbyBzZXUgZG9jdW1lbnRvOwpiKSBEZWNsYXJhIHF1ZSBjb25oZWNlIGUgYWNlaXRhIGFzIERpcmV0cml6ZXMgcGFyYSBvIFJlcG9zaXTDs3JpbyBJbnN0aXR1Y2lvbmFsIGRhIFVGUEU7CmMpIENvbmNlZGUgw6AgVUZQRSBvIGRpcmVpdG8gbsOjbyBleGNsdXNpdm8gZGUgYXJxdWl2YXIsIHJlcHJvZHV6aXIsIGNvbnZlcnRlciAoY29tbyBkZWZpbmlkbyBhIHNlZ3VpciksIGNvbXVuaWNhciBlL291IGRpc3RyaWJ1aXIsIG5vIFJJLCBvIGRvY3VtZW50byBlbnRyZWd1ZSAoaW5jbHVpbmRvIG8gcmVzdW1vL2Fic3RyYWN0KSBlbSBmb3JtYXRvIGRpZ2l0YWwgb3UgcG9yIG91dHJvIG1laW87CmQpIERlY2xhcmEgcXVlIGF1dG9yaXphIGEgVUZQRSBhIGFycXVpdmFyIG1haXMgZGUgdW1hIGPDs3BpYSBkZXN0ZSBkb2N1bWVudG8gZSBjb252ZXJ0w6otbG8sIHNlbSBhbHRlcmFyIG8gc2V1IGNvbnRlw7pkbywgcGFyYSBxdWFscXVlciBmb3JtYXRvIGRlIGZpY2hlaXJvLCBtZWlvIG91IHN1cG9ydGUsIHBhcmEgZWZlaXRvcyBkZSBzZWd1cmFuw6dhLCBwcmVzZXJ2YcOnw6NvIChiYWNrdXApIGUgYWNlc3NvOwplKSBEZWNsYXJhIHF1ZSBvIGRvY3VtZW50byBzdWJtZXRpZG8gw6kgbyBzZXUgdHJhYmFsaG8gb3JpZ2luYWwgZSBxdWUgZGV0w6ltIG8gZGlyZWl0byBkZSBjb25jZWRlciBhIHRlcmNlaXJvcyBvcyBkaXJlaXRvcyBjb250aWRvcyBuZXN0YSBsaWNlbsOnYS4gRGVjbGFyYSB0YW1iw6ltIHF1ZSBhIGVudHJlZ2EgZG8gZG9jdW1lbnRvIG7Do28gaW5mcmluZ2Ugb3MgZGlyZWl0b3MgZGUgb3V0cmEgcGVzc29hIG91IGVudGlkYWRlOwpmKSBEZWNsYXJhIHF1ZSwgbm8gY2FzbyBkbyBkb2N1bWVudG8gc3VibWV0aWRvIGNvbnRlciBtYXRlcmlhbCBkbyBxdWFsIG7Do28gZGV0w6ltIG9zIGRpcmVpdG9zIGRlCmF1dG9yLCBvYnRldmUgYSBhdXRvcml6YcOnw6NvIGlycmVzdHJpdGEgZG8gcmVzcGVjdGl2byBkZXRlbnRvciBkZXNzZXMgZGlyZWl0b3MgcGFyYSBjZWRlciDDoApVRlBFIG9zIGRpcmVpdG9zIHJlcXVlcmlkb3MgcG9yIGVzdGEgTGljZW7Dp2EgZSBhdXRvcml6YXIgYSB1bml2ZXJzaWRhZGUgYSB1dGlsaXrDoS1sb3MgbGVnYWxtZW50ZS4gRGVjbGFyYSB0YW1iw6ltIHF1ZSBlc3NlIG1hdGVyaWFsIGN1am9zIGRpcmVpdG9zIHPDo28gZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUgaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3UgY29udGXDumRvIGRvIGRvY3VtZW50byBlbnRyZWd1ZTsKZykgU2UgbyBkb2N1bWVudG8gZW50cmVndWUgw6kgYmFzZWFkbyBlbSB0cmFiYWxobyBmaW5hbmNpYWRvIG91IGFwb2lhZG8gcG9yIG91dHJhIGluc3RpdHVpw6fDo28gcXVlIG7Do28gYSBVRlBFLMKgZGVjbGFyYSBxdWUgY3VtcHJpdSBxdWFpc3F1ZXIgb2JyaWdhw6fDtWVzIGV4aWdpZGFzIHBlbG8gcmVzcGVjdGl2byBjb250cmF0byBvdSBhY29yZG8uCgpBIFVGUEUgaWRlbnRpZmljYXLDoSBjbGFyYW1lbnRlIG8ocykgbm9tZShzKSBkbyhzKSBhdXRvciAoZXMpIGRvcyBkaXJlaXRvcyBkbyBkb2N1bWVudG8gZW50cmVndWUgZSBuw6NvIGZhcsOhIHF1YWxxdWVyIGFsdGVyYcOnw6NvLCBwYXJhIGFsw6ltIGRvIHByZXZpc3RvIG5hIGFsw61uZWEgYykuCg==Repositório InstitucionalPUBhttps://repositorio.ufpe.br/oai/requestattena@ufpe.bropendoar:22212019-10-25T13:02:34Repositório Institucional da UFPE - Universidade Federal de Pernambuco (UFPE)false |
dc.title.pt_BR.fl_str_mv |
New sampling algorithms for enhancing classifier performance on imbalanced data problems |
title |
New sampling algorithms for enhancing classifier performance on imbalanced data problems |
spellingShingle |
New sampling algorithms for enhancing classifier performance on imbalanced data problems MORAIS, Romero Fernando Almeida Barata de Inteligência computacional Aprendizagem de máquina |
title_short |
New sampling algorithms for enhancing classifier performance on imbalanced data problems |
title_full |
New sampling algorithms for enhancing classifier performance on imbalanced data problems |
title_fullStr |
New sampling algorithms for enhancing classifier performance on imbalanced data problems |
title_full_unstemmed |
New sampling algorithms for enhancing classifier performance on imbalanced data problems |
title_sort |
New sampling algorithms for enhancing classifier performance on imbalanced data problems |
author |
MORAIS, Romero Fernando Almeida Barata de |
author_facet |
MORAIS, Romero Fernando Almeida Barata de |
author_role |
author |
dc.contributor.authorLattes.pt_BR.fl_str_mv |
http://lattes.cnpq.br/2407501857144501 |
dc.contributor.advisorLattes.pt_BR.fl_str_mv |
http://lattes.cnpq.br/5943634209341438 |
dc.contributor.author.fl_str_mv |
MORAIS, Romero Fernando Almeida Barata de |
dc.contributor.advisor1.fl_str_mv |
VASCONCELOS, Germano Crispim |
contributor_str_mv |
VASCONCELOS, Germano Crispim |
dc.subject.por.fl_str_mv |
Inteligência computacional Aprendizagem de máquina |
topic |
Inteligência computacional Aprendizagem de máquina |
description |
Classification problems where the distribution of examples among the classes are imbalanced arise frequently in real-world domains. Commonly, these domains comprise critical problems where accurate predictions for all classes are necessary, such as credit card fraud detection, churn prediction, disease diagnosis, and network intrusive traffic detection. The problem with imbalanced data sets is that standard classifiers often have low accuracy on the underrepresented classes of the problem. Data sampling is the most popular approach to deal with imbalanced data sets and works by either decreasing the size of majority classes (undersampling) or increasing the size of minority classes (over-sampling). In this dissertation we propose two new data sampling algorithms: RRUS and k-INOS. RRUS is an under-sampling algorithm that aims to select a subset of examples from the majority class that best represents the majority class by preserving its density distribution. k-INOS is a general strategy to enhance robustness of over-sampling algorithms to noisy examples present in the minority class. Bothalgorithms were extensively tested on 50 imbalanced data sets, 6 diverse classifiers, and performance was evaluated according to 7 metrics. In particular, RRUS was compared to other 3 under-sampling algorithms and was significantly better than KMUS and SBC most of the time, and significantly better than RUS many times, for most classifiers and performance metrics. k-INOS, as a wrapper for over-sampling algorithms, was tested on 7 over-sampling algorithms and significantly increased Accuracy, Precision, and Specificity most of the time, and F1 many times. In addition, k-INOS’ hyperparameters were studied and appropriate values for their use were suggested. Finally, rules extracted from the former experiments with k-INOS revealed that the N3 complexity metric (loocv error rate of the 1-NN classifier) is often an indicator of whether k-INOS is likely to attain performance improvements or not. |
publishDate |
2018 |
dc.date.issued.fl_str_mv |
2018-02-06 |
dc.date.accessioned.fl_str_mv |
2019-07-11T18:57:52Z |
dc.date.available.fl_str_mv |
2019-07-11T18:57:52Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://repositorio.ufpe.br/handle/123456789/31428 |
dc.identifier.dark.fl_str_mv |
ark:/64986/001300000pt9k |
url |
https://repositorio.ufpe.br/handle/123456789/31428 |
identifier_str_mv |
ark:/64986/001300000pt9k |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
Attribution-NonCommercial-NoDerivs 3.0 Brazil http://creativecommons.org/licenses/by-nc-nd/3.0/br/ info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
Attribution-NonCommercial-NoDerivs 3.0 Brazil http://creativecommons.org/licenses/by-nc-nd/3.0/br/ |
eu_rights_str_mv |
openAccess |
dc.publisher.none.fl_str_mv |
Universidade Federal de Pernambuco |
dc.publisher.program.fl_str_mv |
Programa de Pos Graduacao em Ciencia da Computacao |
dc.publisher.initials.fl_str_mv |
UFPE |
dc.publisher.country.fl_str_mv |
Brasil |
publisher.none.fl_str_mv |
Universidade Federal de Pernambuco |
dc.source.none.fl_str_mv |
reponame:Repositório Institucional da UFPE instname:Universidade Federal de Pernambuco (UFPE) instacron:UFPE |
instname_str |
Universidade Federal de Pernambuco (UFPE) |
instacron_str |
UFPE |
institution |
UFPE |
reponame_str |
Repositório Institucional da UFPE |
collection |
Repositório Institucional da UFPE |
bitstream.url.fl_str_mv |
https://repositorio.ufpe.br/bitstream/123456789/31428/6/DISSERTA%c3%87%c3%83O%20Romero%20Fernando%20Almeida%20Barata%20de%20Morais.pdf.jpg https://repositorio.ufpe.br/bitstream/123456789/31428/1/DISSERTA%c3%87%c3%83O%20Romero%20Fernando%20Almeida%20Barata%20de%20Morais.pdf https://repositorio.ufpe.br/bitstream/123456789/31428/3/license.txt https://repositorio.ufpe.br/bitstream/123456789/31428/4/license_rdf https://repositorio.ufpe.br/bitstream/123456789/31428/5/DISSERTA%c3%87%c3%83O%20Romero%20Fernando%20Almeida%20Barata%20de%20Morais.pdf.txt |
bitstream.checksum.fl_str_mv |
67983e28d659890d5dd8cfdaa2cc5d36 ca21dc261b03f861d16cf90fe4149199 4b8a02c7f2818eaf00dcf2260dd5eb08 e39d27027a6cc9cb039ad269a5db8e34 e695f1d96d401835350bd8ddbe3abdf4 |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 MD5 MD5 MD5 |
repository.name.fl_str_mv |
Repositório Institucional da UFPE - Universidade Federal de Pernambuco (UFPE) |
repository.mail.fl_str_mv |
attena@ufpe.br |
_version_ |
1815172879927476224 |