Markov Blanket discovery without causal sufficiency: application in credit data
Autor(a) principal: | |
---|---|
Data de Publicação: | 2021 |
Tipo de documento: | Dissertação |
Idioma: | eng |
Título da fonte: | Biblioteca Digital de Teses e Dissertações da USP |
Texto Completo: | https://www.teses.usp.br/teses/disponiveis/18/18153/tde-19012022-113726/ |
Resumo: | Faster feature selection algorithms become a necessity as Big Data dictates the zeitgeist. An important class of feature selectors are Markov Blanket (MB) learning algorithms. They are Causal Discovery algorithms that learn the local causal structure of a target variable. A common assumption in their theoretical basis, yet often violated in practice, is causal sufficiency. The M3B algorithm was proposed as the first to directly learn the MB without demanding causal sufficiency. The main drawback of M3B is that it is time inefficient, being intractable for high-dimensional inputs. Intending a faster method, we derive the Fast Markov Blanket Discovery Algorithm (FMMB). Empirical results that compare FMMB to M3B on the structural learning task show that FMMB outperforms M3B in terms of time efficiency, while preserving structural accuracy given a large enough sample size. Moreover, we introduce a new technique to aggregate bootstrapped MB structures, that first extracts a consensus MB, than constructs the aggregated structure as the union of the most probable path between each feature in the MB and the target. Comparisons with the state of the art shows that the proposed aggregation has a smaller loss of information. The analysis was conducted by using Credit-related data, with special focus on Peer-to-Peer lending platforms. Our results validate the credit scoring models used by these platforms as effective in identifying bad borrowers, yet still have room for improvement. Finally, we propose an ensemble of Bayesian Network Classifiers trained using the Cross-Entropy method. The ensemble performs better in credit scoring than Logistic Regression and Random Forests in the selected datasets. |
id |
USP_6a79bdd09c27503f01e397f6597af1c6 |
---|---|
oai_identifier_str |
oai:teses.usp.br:tde-19012022-113726 |
network_acronym_str |
USP |
network_name_str |
Biblioteca Digital de Teses e Dissertações da USP |
repository_id_str |
2721 |
spelling |
Markov Blanket discovery without causal sufficiency: application in credit dataDescoberta de Markov Blankets sem suficiência causal: aplicação em dados de créditoMarkov BlanketBayesian networksCausal discoveryCreditCréditoDescoberta causalMarkov BlanketRedes BayesianasFaster feature selection algorithms become a necessity as Big Data dictates the zeitgeist. An important class of feature selectors are Markov Blanket (MB) learning algorithms. They are Causal Discovery algorithms that learn the local causal structure of a target variable. A common assumption in their theoretical basis, yet often violated in practice, is causal sufficiency. The M3B algorithm was proposed as the first to directly learn the MB without demanding causal sufficiency. The main drawback of M3B is that it is time inefficient, being intractable for high-dimensional inputs. Intending a faster method, we derive the Fast Markov Blanket Discovery Algorithm (FMMB). Empirical results that compare FMMB to M3B on the structural learning task show that FMMB outperforms M3B in terms of time efficiency, while preserving structural accuracy given a large enough sample size. Moreover, we introduce a new technique to aggregate bootstrapped MB structures, that first extracts a consensus MB, than constructs the aggregated structure as the union of the most probable path between each feature in the MB and the target. Comparisons with the state of the art shows that the proposed aggregation has a smaller loss of information. The analysis was conducted by using Credit-related data, with special focus on Peer-to-Peer lending platforms. Our results validate the credit scoring models used by these platforms as effective in identifying bad borrowers, yet still have room for improvement. Finally, we propose an ensemble of Bayesian Network Classifiers trained using the Cross-Entropy method. The ensemble performs better in credit scoring than Logistic Regression and Random Forests in the selected datasets.Seleção de features com maior velocidade se torna uma necessidade conforme Big Data dita o zeitgeist. Uma classe importante de seletores de features são algoritmos de descoberta de Markov Blanket (MB).São algoritmos de descoberta causal que aprendem a estrutura causal local de uma variável alvo. Uma suposição comum em sua base teórica, frequentemente violada na prática, é a de suficiência causal: a crença de que todas as causas em comum das variáveis que foram medidas, compondo o conjunto de dados, também estão no conjunto de dados. Recentemente, o algoritmo M3B foi proposto. É o primeiro a aprender diretamente o MB sem demandar suficiência causal. A maior desvantagem do M3B é sua ineficiência de tempo, sendo intratável para entradas muito grandes. Aqui, nós derivamos o Fast Markov Blanket Discovery Algorithm (FMMB). Resultados empíricos comparando o FMMB com o M3B em termos de aprendizado estrutural mostram que o FMMB tem melhor desempenho em termos de tempo, enquanto preservando a acurácia da estrutura causal dado um tamanho amostral grande o suficiente. Além disso, nós introduzimos uma nova técnica para agregar resultados de estruturas de MB que advém de bootstrap, que primeiro extrai um consenso de qual é o MB, então constrói a estrutura agregada como a união do caminho mais provável entre o alvo e as features que compõem o MB. Comparações com o estado da arte mostram que a agregação proposta perde menos informação. As analises foram conduzidas usando dados de crédito, com atenção especial à plataformas de empréstimos interpessoais. Nossos resultados validam os modelos de crédito usados por essas plataformas como efetivos na identificação de maus pagadores. Por fim, propomos um ensemble de Classificadores Baseados em Redes Bayesianas treinado usando o Método da Entropia Cruzada. O ensemble performou melhor em Credit Scoring do que Regressão Linear e Random Forests nos conjuntos de dados selecionados.Biblioteca Digitais de Teses e Dissertações da USPMaciel, Carlos DiasJeronymo, Pedro Virgilio Basílio2021-12-15info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/18/18153/tde-19012022-113726/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2022-02-07T15:28:02Zoai:teses.usp.br:tde-19012022-113726Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212022-02-07T15:28:02Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false |
dc.title.none.fl_str_mv |
Markov Blanket discovery without causal sufficiency: application in credit data Descoberta de Markov Blankets sem suficiência causal: aplicação em dados de crédito |
title |
Markov Blanket discovery without causal sufficiency: application in credit data |
spellingShingle |
Markov Blanket discovery without causal sufficiency: application in credit data Jeronymo, Pedro Virgilio Basílio Markov Blanket Bayesian networks Causal discovery Credit Crédito Descoberta causal Markov Blanket Redes Bayesianas |
title_short |
Markov Blanket discovery without causal sufficiency: application in credit data |
title_full |
Markov Blanket discovery without causal sufficiency: application in credit data |
title_fullStr |
Markov Blanket discovery without causal sufficiency: application in credit data |
title_full_unstemmed |
Markov Blanket discovery without causal sufficiency: application in credit data |
title_sort |
Markov Blanket discovery without causal sufficiency: application in credit data |
author |
Jeronymo, Pedro Virgilio Basílio |
author_facet |
Jeronymo, Pedro Virgilio Basílio |
author_role |
author |
dc.contributor.none.fl_str_mv |
Maciel, Carlos Dias |
dc.contributor.author.fl_str_mv |
Jeronymo, Pedro Virgilio Basílio |
dc.subject.por.fl_str_mv |
Markov Blanket Bayesian networks Causal discovery Credit Crédito Descoberta causal Markov Blanket Redes Bayesianas |
topic |
Markov Blanket Bayesian networks Causal discovery Credit Crédito Descoberta causal Markov Blanket Redes Bayesianas |
description |
Faster feature selection algorithms become a necessity as Big Data dictates the zeitgeist. An important class of feature selectors are Markov Blanket (MB) learning algorithms. They are Causal Discovery algorithms that learn the local causal structure of a target variable. A common assumption in their theoretical basis, yet often violated in practice, is causal sufficiency. The M3B algorithm was proposed as the first to directly learn the MB without demanding causal sufficiency. The main drawback of M3B is that it is time inefficient, being intractable for high-dimensional inputs. Intending a faster method, we derive the Fast Markov Blanket Discovery Algorithm (FMMB). Empirical results that compare FMMB to M3B on the structural learning task show that FMMB outperforms M3B in terms of time efficiency, while preserving structural accuracy given a large enough sample size. Moreover, we introduce a new technique to aggregate bootstrapped MB structures, that first extracts a consensus MB, than constructs the aggregated structure as the union of the most probable path between each feature in the MB and the target. Comparisons with the state of the art shows that the proposed aggregation has a smaller loss of information. The analysis was conducted by using Credit-related data, with special focus on Peer-to-Peer lending platforms. Our results validate the credit scoring models used by these platforms as effective in identifying bad borrowers, yet still have room for improvement. Finally, we propose an ensemble of Bayesian Network Classifiers trained using the Cross-Entropy method. The ensemble performs better in credit scoring than Logistic Regression and Random Forests in the selected datasets. |
publishDate |
2021 |
dc.date.none.fl_str_mv |
2021-12-15 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://www.teses.usp.br/teses/disponiveis/18/18153/tde-19012022-113726/ |
url |
https://www.teses.usp.br/teses/disponiveis/18/18153/tde-19012022-113726/ |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
|
dc.rights.driver.fl_str_mv |
Liberar o conteúdo para acesso público. info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
Liberar o conteúdo para acesso público. |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.coverage.none.fl_str_mv |
|
dc.publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
dc.source.none.fl_str_mv |
reponame:Biblioteca Digital de Teses e Dissertações da USP instname:Universidade de São Paulo (USP) instacron:USP |
instname_str |
Universidade de São Paulo (USP) |
instacron_str |
USP |
institution |
USP |
reponame_str |
Biblioteca Digital de Teses e Dissertações da USP |
collection |
Biblioteca Digital de Teses e Dissertações da USP |
repository.name.fl_str_mv |
Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP) |
repository.mail.fl_str_mv |
virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br |
_version_ |
1815257124131831808 |