When Two are Better Than One: Synthesizing Heavily Unbalanced Data

Detalhes bibliográficos
Autor(a) principal: Ferreira, Francisco
Data de Publicação: 2021
Outros Autores: Lourenço, Nuno, Cabral, Bruno, Fernandes, João Paulo
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10316/101173
https://doi.org/10.1109/ACCESS.2021.3126656
Resumo: Nowadays, data is king and if treated and used properly it promises to give organizations a competitive edge over rivals by enabling them to develop and design Intelligent Systems to improve their services. However, they need to fully comply with not only ethical but also regulatory obligations, where, e.g., privacy (strictly) needs to be respected when using or sharing data, thus protecting both the interests of users and organizations. Fraud Detection systems are examples of such systems where Machine Learning algorithms leverage information to classify nancial transactions as legitimate or illicit. The data used to create these solutions is usually highly structured and contains categorical and continuous features characterised by complex distributions. One of the main challenges of fraud detection is concerned with the scarcity of fraudulent instances which results in highly unbalanced datasets. Additionally, privacy is crucial, and it is usually forbidden, or not possible, to share the data of organizations and individuals for creating or improving models. In this paper we propose a framework for private data sharing based on synthetic data generation using Generative Adversarial Networks (GAN) that learns the speci cities of nancial transactions data and generates ctitious data that keeps the utility of the original datasets. Our proposal, called Duo-GAN, uses two GAN generators to handle the data imbalance problem, one generator for fraudulent instances and the other for legitimate instances. With this approach, we observed, at most, a 5% disparity in F1 scores between classi ers trained and tested with actual data and the ones trained with synthetic data and tested with actual data.
id RCAP_7eb2efed424c1e58e843cf2775895e16
oai_identifier_str oai:estudogeral.uc.pt:10316/101173
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling When Two are Better Than One: Synthesizing Heavily Unbalanced DataFraud detectiongenerative adversarial networksprivacymachine learningsynthetic data generationtabular dataNowadays, data is king and if treated and used properly it promises to give organizations a competitive edge over rivals by enabling them to develop and design Intelligent Systems to improve their services. However, they need to fully comply with not only ethical but also regulatory obligations, where, e.g., privacy (strictly) needs to be respected when using or sharing data, thus protecting both the interests of users and organizations. Fraud Detection systems are examples of such systems where Machine Learning algorithms leverage information to classify nancial transactions as legitimate or illicit. The data used to create these solutions is usually highly structured and contains categorical and continuous features characterised by complex distributions. One of the main challenges of fraud detection is concerned with the scarcity of fraudulent instances which results in highly unbalanced datasets. Additionally, privacy is crucial, and it is usually forbidden, or not possible, to share the data of organizations and individuals for creating or improving models. In this paper we propose a framework for private data sharing based on synthetic data generation using Generative Adversarial Networks (GAN) that learns the speci cities of nancial transactions data and generates ctitious data that keeps the utility of the original datasets. Our proposal, called Duo-GAN, uses two GAN generators to handle the data imbalance problem, one generator for fraudulent instances and the other for legitimate instances. With this approach, we observed, at most, a 5% disparity in F1 scores between classi ers trained and tested with actual data and the ones trained with synthetic data and tested with actual data.2021info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articlehttp://hdl.handle.net/10316/101173http://hdl.handle.net/10316/101173https://doi.org/10.1109/ACCESS.2021.3126656eng2169-3536Ferreira, FranciscoLourenço, NunoCabral, BrunoFernandes, João Pauloinfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2022-08-16T20:49:36Zoai:estudogeral.uc.pt:10316/101173Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T21:18:25.473354Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv When Two are Better Than One: Synthesizing Heavily Unbalanced Data
title When Two are Better Than One: Synthesizing Heavily Unbalanced Data
spellingShingle When Two are Better Than One: Synthesizing Heavily Unbalanced Data
Ferreira, Francisco
Fraud detection
generative adversarial networks
privacy
machine learning
synthetic data generation
tabular data
title_short When Two are Better Than One: Synthesizing Heavily Unbalanced Data
title_full When Two are Better Than One: Synthesizing Heavily Unbalanced Data
title_fullStr When Two are Better Than One: Synthesizing Heavily Unbalanced Data
title_full_unstemmed When Two are Better Than One: Synthesizing Heavily Unbalanced Data
title_sort When Two are Better Than One: Synthesizing Heavily Unbalanced Data
author Ferreira, Francisco
author_facet Ferreira, Francisco
Lourenço, Nuno
Cabral, Bruno
Fernandes, João Paulo
author_role author
author2 Lourenço, Nuno
Cabral, Bruno
Fernandes, João Paulo
author2_role author
author
author
dc.contributor.author.fl_str_mv Ferreira, Francisco
Lourenço, Nuno
Cabral, Bruno
Fernandes, João Paulo
dc.subject.por.fl_str_mv Fraud detection
generative adversarial networks
privacy
machine learning
synthetic data generation
tabular data
topic Fraud detection
generative adversarial networks
privacy
machine learning
synthetic data generation
tabular data
description Nowadays, data is king and if treated and used properly it promises to give organizations a competitive edge over rivals by enabling them to develop and design Intelligent Systems to improve their services. However, they need to fully comply with not only ethical but also regulatory obligations, where, e.g., privacy (strictly) needs to be respected when using or sharing data, thus protecting both the interests of users and organizations. Fraud Detection systems are examples of such systems where Machine Learning algorithms leverage information to classify nancial transactions as legitimate or illicit. The data used to create these solutions is usually highly structured and contains categorical and continuous features characterised by complex distributions. One of the main challenges of fraud detection is concerned with the scarcity of fraudulent instances which results in highly unbalanced datasets. Additionally, privacy is crucial, and it is usually forbidden, or not possible, to share the data of organizations and individuals for creating or improving models. In this paper we propose a framework for private data sharing based on synthetic data generation using Generative Adversarial Networks (GAN) that learns the speci cities of nancial transactions data and generates ctitious data that keeps the utility of the original datasets. Our proposal, called Duo-GAN, uses two GAN generators to handle the data imbalance problem, one generator for fraudulent instances and the other for legitimate instances. With this approach, we observed, at most, a 5% disparity in F1 scores between classi ers trained and tested with actual data and the ones trained with synthetic data and tested with actual data.
publishDate 2021
dc.date.none.fl_str_mv 2021
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10316/101173
http://hdl.handle.net/10316/101173
https://doi.org/10.1109/ACCESS.2021.3126656
url http://hdl.handle.net/10316/101173
https://doi.org/10.1109/ACCESS.2021.3126656
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv 2169-3536
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799134079058182144