Label noise injection methods for model robustness assessment in fraud detection datasets

Santos, Sofia Jerónimo dos

Label noise injection methods for model robustness assessment in fraud detection datasets

Detalhes bibliográficos
Autor(a) principal:	Santos, Sofia Jerónimo dos
Data de Publicação:	2021
Tipo de documento:	Dissertação
Idioma:	eng
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	http://hdl.handle.net/10362/112794
Resumo:	Internship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics

Metadados do item

id	RCAP_5ed3a696c3da951d4585005d5962552a
oai_identifier_str	oai:run.unl.pt:10362/112794
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	Label noise injection methods for model robustness assessment in fraud detection datasetsLabel noiseFraud detectionRandom ForestLightGBMModel robustnessHyperparameter importanceRótulos IncorretosDeteção de FraudeRobustezImportância dos Hiper-parâmetrosInternship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsLabel noise is a common issue in real-life applications of machine learning for fraud detection, that can lead to sub-optimal decisions during the model building phase, and, ultimately, lead to poor model performance. A key factor to the impact of noisy data on the performance of a model is the algorithm used to train and its robustness to label noise. In this work,we studied the robustness of the models generated by two different supervised tree-based algorithms, Random Forest and LightGBM, to different types of random and not at random artificial label noise injection techniques, at different percentages of noise, and using different datasets to both train and evaluate them. We also observed the impacts of label noise in the evaluation of the performance of a model. Finally, we analyzed the importance of the different hyperparameters of both algorithms in their performance.We show that both algorithms are robust to random label noise at different noise percentages, however they fail to separate between the classes when in the presence of noise not at random. We also show that, for random label noise, the correlation between the model performance over the noisy validation set and the test set decreases as we increase the noise percentage, however, for noise not at random there is no obvious correlation between the two sets. Finally, we conclude which hyperparameters are the most relevant for the performance of Random Forest models in the presence of random label noise, and in most cases, neither of the studied hyperparameters for LightGBM seem to be more relevant than the others for model performance.Um problema comum na aplicação de técnicas de aprendizagem automática para a deteção de fraude é a rotulagem incorreta das instâncias, que pode levar a decisões sub-ótimas durante a fase de construção do modelo, e assim levar a que o mesmo tenha baixo desempenho. Um fator-chave do impacto que a rotulagem incorreta tem no desempenho de um modelo é o algoritmo usado na sua construção e o quão robusto é. Neste trabalho, estudámos a robustez de modelos gerados através de dois tipos diferentes de algoritmos de aprendizagem supervisionado baseados em árvores de decisão, Random Forest e LightGBM, a diferentes tipos de métodos de injeção de ruído, uns aleatórios e outros determinísticos. Avaliámos os resultados adicionando diferentes percentagens de perturbação no treino e na validação e analisámos o impacto do ruído tanto no treino, como na avaliação do desempenho do modelo. Por fim, analisámos a importância dos diferentes hiper-parâmetros têm para o aumento do nível de desempenho do modelo. Os nossos resultados mostram que ambos os algoritmos são robustos a diferentes percentagens de rótulos incorretos, quando estes são introduzidos de forma aleatória, contudo os algoritmos não conseguem distinguir entre casos de fraude e de não fraude quando são usados métodos determinísticos. Vamos também mostrar que, para rótulos incorretos introduzidos de forma aleatória, a correlação entre o desempenho de um modelo nos dados de validação com ruído e o desempenho do modelo nos dados de teste sem ruído, diminui à medida que aumentamos a percentagem de rótulos incorretos. Porém, para métodos determinísticos de inserção de rótulos incorretos, não se verifica nenhuma correlação entre os conjuntos de dados. Concluímos quais os hiper-parâmetros que são mais relevantes para o desempenho dos modelos de Random Forest quando consideramos a inserção aleatória de rótulos incorretos, e que para LightGBM, na maior parte das vezes, nenhum dos hiper-parâmetros estudados se parece destacar quando consideramos o desempenho do modelo.Castelli, MauroSilva, Maria Inês Pastor Pereira daFerreira, João Guilherme Simões BravoRUNSantos, Sofia Jerónimo dos2021-03-01T15:45:36Z2021-01-112021-01-11T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/112794TID:202654672enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T04:56:09Zoai:run.unl.pt:10362/112794Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:42:13.068686Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	Label noise injection methods for model robustness assessment in fraud detection datasets
title	Label noise injection methods for model robustness assessment in fraud detection datasets
spellingShingle	Label noise injection methods for model robustness assessment in fraud detection datasets Santos, Sofia Jerónimo dos Label noise Fraud detection Random Forest LightGBM Model robustness Hyperparameter importance Rótulos Incorretos Deteção de Fraude Robustez Importância dos Hiper-parâmetros
title_short	Label noise injection methods for model robustness assessment in fraud detection datasets
title_full	Label noise injection methods for model robustness assessment in fraud detection datasets
title_fullStr	Label noise injection methods for model robustness assessment in fraud detection datasets
title_full_unstemmed	Label noise injection methods for model robustness assessment in fraud detection datasets
title_sort	Label noise injection methods for model robustness assessment in fraud detection datasets
author	Santos, Sofia Jerónimo dos
author_facet	Santos, Sofia Jerónimo dos
author_role	author
dc.contributor.none.fl_str_mv	Castelli, Mauro Silva, Maria Inês Pastor Pereira da Ferreira, João Guilherme Simões Bravo RUN
dc.contributor.author.fl_str_mv	Santos, Sofia Jerónimo dos
dc.subject.por.fl_str_mv	Label noise Fraud detection Random Forest LightGBM Model robustness Hyperparameter importance Rótulos Incorretos Deteção de Fraude Robustez Importância dos Hiper-parâmetros
topic	Label noise Fraud detection Random Forest LightGBM Model robustness Hyperparameter importance Rótulos Incorretos Deteção de Fraude Robustez Importância dos Hiper-parâmetros
description	Internship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics
publishDate	2021
dc.date.none.fl_str_mv	2021-03-01T15:45:36Z 2021-01-11 2021-01-11T00:00:00Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10362/112794 TID:202654672
url	http://hdl.handle.net/10362/112794
identifier_str_mv	TID:202654672
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799138034263785472

Label noise injection methods for model robustness assessment in fraud detection datasets

Registros relacionados