Melhorando a predição de seleção dinâmica em problemas de Credit Scoring desbalanceado
Autor(a) principal: | |
---|---|
Data de Publicação: | 2020 |
Tipo de documento: | Tese |
Idioma: | eng |
Título da fonte: | Repositório Institucional da Universidade Federal do Ceará (UFC) |
Texto Completo: | http://www.repositorio.ufc.br/handle/riufc/58918 |
Resumo: | Lenders, such as banks and credit card companies use credit scoring models to evaluate the potential risk posed by lending money to consumers and, therefore, to mitigate losses due to bad credit. Thus, the profitability of the banks highly depends on the models used to decide on the customer’s loans. State-of-the-art credit scoring models use machine learning and statistical methods. One of the major problems of this field is that lenders often deal with imbalanced datasets that usually contain many paid loans but very few not paid ones (called defaults). Recently, dynamic selection methods combined with preprocessing techniques have been evaluated to improve classification models in imbalanced datasets presenting advantages over the static machine learning methods. In a dynamic selection technique, samples in the neighborhood of each query sample are used to compute the base classifiers’ local competence. Then, these techniques select only locally competent classifiers according to each query sample. Most dynamic selection techniques use the k-NN algorithm to define the concept of the local region. In this thesis, we modify dynamic selection techniques to improve the prediction performance in imbalanced credit scoring datasets. First, we evaluate the performance of static techniques when submitted to several imbalanced levels. Next, we apply dynamic selection techniques in the best ensembles of the previous experiment with a new definition of the local region, the Reduced Minority k-Nearest Neighbors (RMkNN). The intuition behind RMkNN is to overcome the biased behavior of kNN in defining the local regions in imbalanced datasets, mainly selecting samples of the majority class. After, we explore improvements by modifying the performance measure used to compute the local competence of base classifiers. The intuition is to replace accuracy with a measure better suited to imbalanced datasets. This metric is FA2, the combination of F-measure with the square of accuracy. We find out that these modifications improve the prediction performance in imbalanced credit scoring datasets. Finally, we combine RMkNN and FA2 techniques to evaluate the total prediction improvement on the credit scoring problem. We conduct a comprehensive evaluation of the proposed technique against state-ofart competitors on six real-world public datasets and one private one. Experiments show that RMkNN and FA2 improve the classification performance of the evaluated datasets up to 18% regarding seven performance measures. |
id |
UFC-7_38cb6217f82ad8fe35f4259eb4244c75 |
---|---|
oai_identifier_str |
oai:repositorio.ufc.br:riufc/58918 |
network_acronym_str |
UFC-7 |
network_name_str |
Repositório Institucional da Universidade Federal do Ceará (UFC) |
repository_id_str |
|
spelling |
Melhorando a predição de seleção dinâmica em problemas de Credit Scoring desbalanceadoImproving dynamic selection prediction in imbalanced credit scoring problemsCredit scoringImbalanced learningDynamic selection classificationLenders, such as banks and credit card companies use credit scoring models to evaluate the potential risk posed by lending money to consumers and, therefore, to mitigate losses due to bad credit. Thus, the profitability of the banks highly depends on the models used to decide on the customer’s loans. State-of-the-art credit scoring models use machine learning and statistical methods. One of the major problems of this field is that lenders often deal with imbalanced datasets that usually contain many paid loans but very few not paid ones (called defaults). Recently, dynamic selection methods combined with preprocessing techniques have been evaluated to improve classification models in imbalanced datasets presenting advantages over the static machine learning methods. In a dynamic selection technique, samples in the neighborhood of each query sample are used to compute the base classifiers’ local competence. Then, these techniques select only locally competent classifiers according to each query sample. Most dynamic selection techniques use the k-NN algorithm to define the concept of the local region. In this thesis, we modify dynamic selection techniques to improve the prediction performance in imbalanced credit scoring datasets. First, we evaluate the performance of static techniques when submitted to several imbalanced levels. Next, we apply dynamic selection techniques in the best ensembles of the previous experiment with a new definition of the local region, the Reduced Minority k-Nearest Neighbors (RMkNN). The intuition behind RMkNN is to overcome the biased behavior of kNN in defining the local regions in imbalanced datasets, mainly selecting samples of the majority class. After, we explore improvements by modifying the performance measure used to compute the local competence of base classifiers. The intuition is to replace accuracy with a measure better suited to imbalanced datasets. This metric is FA2, the combination of F-measure with the square of accuracy. We find out that these modifications improve the prediction performance in imbalanced credit scoring datasets. Finally, we combine RMkNN and FA2 techniques to evaluate the total prediction improvement on the credit scoring problem. We conduct a comprehensive evaluation of the proposed technique against state-ofart competitors on six real-world public datasets and one private one. Experiments show that RMkNN and FA2 improve the classification performance of the evaluated datasets up to 18% regarding seven performance measures.Os credores, como bancos e empresas de cartão de crédito, usam modelos de credit scoring para avaliar o risco potencial representado pelo empréstimo de dinheiro aos consumidores e, portanto, para mitigar perdas devido a inadimplência. Assim, a rentabilidade dos bancos depende muito dos modelos utilizados para decidir sobre os empréstimos dos clientes. Modelos de credit scoring de última geração usam aprendizado de máquina e métodos estatísticos. Um dos principais problemas desse campo é que os credores geralmente lidam com conjuntos de dados desequilibrados que geralmente contêm muitos empréstimos pagos, mas muito poucos empréstimos não pagos (chamados defaults). Recentemente, métodos de seleção dinâmica combinados com técnicas de pré-processamento têm sido avaliados para melhorar os modelos de classificação em dados desequilibrados apresentando vantagens sobre os métodos de aprendizado de máquina estáticos. Em uma técnica de seleção dinâmica, amostras conhecidas na vizinhança de uma amostra desconhecida são usadas para calcular a competência local dos classificadores base. Então, essas técnicas selecionam apenas classificadores localmente competentes na vizinhança da amostra desconhecida. A maioria das técnicas de seleção dinâmica usa o algoritmo k-NN para definir o conceito de região local. Nesta tese, modificamos técnicas de seleção dinâmica para melhorar o desempenho de previsão em conjuntos de dados de credit scoring desequilibrados. Primeiramente, avaliamos o desempenho de técnicas estáticas quando submetidas a vários níveis de desequilíbrio. A seguir, aplicamos técnicas de seleção dinâmica nos melhores ensembles do experimento anterior com uma nova definição da região local, a Reduced Minority k-NN (RMkNN). A intuição por trás do RMkNN é superar o comportamento tendencioso do kNN na definição das regiões locais em conjuntos de dados desequilibrados, principalmente selecionando amostras da classe majoritária. Depois, exploramos as melhorias modificando a métrica de desempenho usada para calcular a competência local dos classificadores básicos. A intuição é substituir a acurácia por uma medida mais adequada para conjuntos de dados desequilibrados. Esta métrica é FA2, a combinação da Fmeasure com o quadrado da acurácia. Descobrimos que essas modificações melhoram o desempenho de previsão em dados de credit scoring desequilibrados. Finalmente, combinamos as técnicas RMkNN e FA2 para avaliar a melhoria total da previsão no problema de credit scoring. Conduzimos uma avaliação abrangente da técnica proposta contra concorrentes de última geração em seis conjuntos de dados públicos do mundo real e um privado. Experimentos mostram que RMkNN e FA2 melhoram o desempenho de classificação dos dados avaliados em até 18% em relação a sete medidas de desempenho.Macêdo, José Antonio Fernandes deMelo Junior, Leopoldo Soares de2021-06-11T12:51:51Z2021-06-11T12:51:51Z2020info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfMELO JUNIOR, Leopoldo Soares de. Improving dynamic selection prediction in imbalanced credit scoring problems. 2020. 105 f. Tese (Doutorado em Ciência da Computação) - Universidade Federal do Ceará, Fortaleza, 2020.http://www.repositorio.ufc.br/handle/riufc/58918engreponame:Repositório Institucional da Universidade Federal do Ceará (UFC)instname:Universidade Federal do Ceará (UFC)instacron:UFCinfo:eu-repo/semantics/openAccess2021-06-11T12:51:51Zoai:repositorio.ufc.br:riufc/58918Repositório InstitucionalPUBhttp://www.repositorio.ufc.br/ri-oai/requestbu@ufc.br || repositorio@ufc.bropendoar:2024-09-11T18:15:42.511172Repositório Institucional da Universidade Federal do Ceará (UFC) - Universidade Federal do Ceará (UFC)false |
dc.title.none.fl_str_mv |
Melhorando a predição de seleção dinâmica em problemas de Credit Scoring desbalanceado Improving dynamic selection prediction in imbalanced credit scoring problems |
title |
Melhorando a predição de seleção dinâmica em problemas de Credit Scoring desbalanceado |
spellingShingle |
Melhorando a predição de seleção dinâmica em problemas de Credit Scoring desbalanceado Melo Junior, Leopoldo Soares de Credit scoring Imbalanced learning Dynamic selection classification |
title_short |
Melhorando a predição de seleção dinâmica em problemas de Credit Scoring desbalanceado |
title_full |
Melhorando a predição de seleção dinâmica em problemas de Credit Scoring desbalanceado |
title_fullStr |
Melhorando a predição de seleção dinâmica em problemas de Credit Scoring desbalanceado |
title_full_unstemmed |
Melhorando a predição de seleção dinâmica em problemas de Credit Scoring desbalanceado |
title_sort |
Melhorando a predição de seleção dinâmica em problemas de Credit Scoring desbalanceado |
author |
Melo Junior, Leopoldo Soares de |
author_facet |
Melo Junior, Leopoldo Soares de |
author_role |
author |
dc.contributor.none.fl_str_mv |
Macêdo, José Antonio Fernandes de |
dc.contributor.author.fl_str_mv |
Melo Junior, Leopoldo Soares de |
dc.subject.por.fl_str_mv |
Credit scoring Imbalanced learning Dynamic selection classification |
topic |
Credit scoring Imbalanced learning Dynamic selection classification |
description |
Lenders, such as banks and credit card companies use credit scoring models to evaluate the potential risk posed by lending money to consumers and, therefore, to mitigate losses due to bad credit. Thus, the profitability of the banks highly depends on the models used to decide on the customer’s loans. State-of-the-art credit scoring models use machine learning and statistical methods. One of the major problems of this field is that lenders often deal with imbalanced datasets that usually contain many paid loans but very few not paid ones (called defaults). Recently, dynamic selection methods combined with preprocessing techniques have been evaluated to improve classification models in imbalanced datasets presenting advantages over the static machine learning methods. In a dynamic selection technique, samples in the neighborhood of each query sample are used to compute the base classifiers’ local competence. Then, these techniques select only locally competent classifiers according to each query sample. Most dynamic selection techniques use the k-NN algorithm to define the concept of the local region. In this thesis, we modify dynamic selection techniques to improve the prediction performance in imbalanced credit scoring datasets. First, we evaluate the performance of static techniques when submitted to several imbalanced levels. Next, we apply dynamic selection techniques in the best ensembles of the previous experiment with a new definition of the local region, the Reduced Minority k-Nearest Neighbors (RMkNN). The intuition behind RMkNN is to overcome the biased behavior of kNN in defining the local regions in imbalanced datasets, mainly selecting samples of the majority class. After, we explore improvements by modifying the performance measure used to compute the local competence of base classifiers. The intuition is to replace accuracy with a measure better suited to imbalanced datasets. This metric is FA2, the combination of F-measure with the square of accuracy. We find out that these modifications improve the prediction performance in imbalanced credit scoring datasets. Finally, we combine RMkNN and FA2 techniques to evaluate the total prediction improvement on the credit scoring problem. We conduct a comprehensive evaluation of the proposed technique against state-ofart competitors on six real-world public datasets and one private one. Experiments show that RMkNN and FA2 improve the classification performance of the evaluated datasets up to 18% regarding seven performance measures. |
publishDate |
2020 |
dc.date.none.fl_str_mv |
2020 2021-06-11T12:51:51Z 2021-06-11T12:51:51Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
format |
doctoralThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
MELO JUNIOR, Leopoldo Soares de. Improving dynamic selection prediction in imbalanced credit scoring problems. 2020. 105 f. Tese (Doutorado em Ciência da Computação) - Universidade Federal do Ceará, Fortaleza, 2020. http://www.repositorio.ufc.br/handle/riufc/58918 |
identifier_str_mv |
MELO JUNIOR, Leopoldo Soares de. Improving dynamic selection prediction in imbalanced credit scoring problems. 2020. 105 f. Tese (Doutorado em Ciência da Computação) - Universidade Federal do Ceará, Fortaleza, 2020. |
url |
http://www.repositorio.ufc.br/handle/riufc/58918 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:Repositório Institucional da Universidade Federal do Ceará (UFC) instname:Universidade Federal do Ceará (UFC) instacron:UFC |
instname_str |
Universidade Federal do Ceará (UFC) |
instacron_str |
UFC |
institution |
UFC |
reponame_str |
Repositório Institucional da Universidade Federal do Ceará (UFC) |
collection |
Repositório Institucional da Universidade Federal do Ceará (UFC) |
repository.name.fl_str_mv |
Repositório Institucional da Universidade Federal do Ceará (UFC) - Universidade Federal do Ceará (UFC) |
repository.mail.fl_str_mv |
bu@ufc.br || repositorio@ufc.br |
_version_ |
1813028727156637696 |