When Two are Better Than One: Synthesizing Heavily Unbalanced Data
Autor(a) principal: | |
---|---|
Data de Publicação: | 2021 |
Outros Autores: | , , |
Tipo de documento: | Artigo |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10316/101173 https://doi.org/10.1109/ACCESS.2021.3126656 |
Resumo: | Nowadays, data is king and if treated and used properly it promises to give organizations a competitive edge over rivals by enabling them to develop and design Intelligent Systems to improve their services. However, they need to fully comply with not only ethical but also regulatory obligations, where, e.g., privacy (strictly) needs to be respected when using or sharing data, thus protecting both the interests of users and organizations. Fraud Detection systems are examples of such systems where Machine Learning algorithms leverage information to classify nancial transactions as legitimate or illicit. The data used to create these solutions is usually highly structured and contains categorical and continuous features characterised by complex distributions. One of the main challenges of fraud detection is concerned with the scarcity of fraudulent instances which results in highly unbalanced datasets. Additionally, privacy is crucial, and it is usually forbidden, or not possible, to share the data of organizations and individuals for creating or improving models. In this paper we propose a framework for private data sharing based on synthetic data generation using Generative Adversarial Networks (GAN) that learns the speci cities of nancial transactions data and generates ctitious data that keeps the utility of the original datasets. Our proposal, called Duo-GAN, uses two GAN generators to handle the data imbalance problem, one generator for fraudulent instances and the other for legitimate instances. With this approach, we observed, at most, a 5% disparity in F1 scores between classi ers trained and tested with actual data and the ones trained with synthetic data and tested with actual data. |
id |
RCAP_7eb2efed424c1e58e843cf2775895e16 |
---|---|
oai_identifier_str |
oai:estudogeral.uc.pt:10316/101173 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
When Two are Better Than One: Synthesizing Heavily Unbalanced DataFraud detectiongenerative adversarial networksprivacymachine learningsynthetic data generationtabular dataNowadays, data is king and if treated and used properly it promises to give organizations a competitive edge over rivals by enabling them to develop and design Intelligent Systems to improve their services. However, they need to fully comply with not only ethical but also regulatory obligations, where, e.g., privacy (strictly) needs to be respected when using or sharing data, thus protecting both the interests of users and organizations. Fraud Detection systems are examples of such systems where Machine Learning algorithms leverage information to classify nancial transactions as legitimate or illicit. The data used to create these solutions is usually highly structured and contains categorical and continuous features characterised by complex distributions. One of the main challenges of fraud detection is concerned with the scarcity of fraudulent instances which results in highly unbalanced datasets. Additionally, privacy is crucial, and it is usually forbidden, or not possible, to share the data of organizations and individuals for creating or improving models. In this paper we propose a framework for private data sharing based on synthetic data generation using Generative Adversarial Networks (GAN) that learns the speci cities of nancial transactions data and generates ctitious data that keeps the utility of the original datasets. Our proposal, called Duo-GAN, uses two GAN generators to handle the data imbalance problem, one generator for fraudulent instances and the other for legitimate instances. With this approach, we observed, at most, a 5% disparity in F1 scores between classi ers trained and tested with actual data and the ones trained with synthetic data and tested with actual data.2021info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articlehttp://hdl.handle.net/10316/101173http://hdl.handle.net/10316/101173https://doi.org/10.1109/ACCESS.2021.3126656eng2169-3536Ferreira, FranciscoLourenço, NunoCabral, BrunoFernandes, João Pauloinfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2022-08-16T20:49:36Zoai:estudogeral.uc.pt:10316/101173Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T21:18:25.473354Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
When Two are Better Than One: Synthesizing Heavily Unbalanced Data |
title |
When Two are Better Than One: Synthesizing Heavily Unbalanced Data |
spellingShingle |
When Two are Better Than One: Synthesizing Heavily Unbalanced Data Ferreira, Francisco Fraud detection generative adversarial networks privacy machine learning synthetic data generation tabular data |
title_short |
When Two are Better Than One: Synthesizing Heavily Unbalanced Data |
title_full |
When Two are Better Than One: Synthesizing Heavily Unbalanced Data |
title_fullStr |
When Two are Better Than One: Synthesizing Heavily Unbalanced Data |
title_full_unstemmed |
When Two are Better Than One: Synthesizing Heavily Unbalanced Data |
title_sort |
When Two are Better Than One: Synthesizing Heavily Unbalanced Data |
author |
Ferreira, Francisco |
author_facet |
Ferreira, Francisco Lourenço, Nuno Cabral, Bruno Fernandes, João Paulo |
author_role |
author |
author2 |
Lourenço, Nuno Cabral, Bruno Fernandes, João Paulo |
author2_role |
author author author |
dc.contributor.author.fl_str_mv |
Ferreira, Francisco Lourenço, Nuno Cabral, Bruno Fernandes, João Paulo |
dc.subject.por.fl_str_mv |
Fraud detection generative adversarial networks privacy machine learning synthetic data generation tabular data |
topic |
Fraud detection generative adversarial networks privacy machine learning synthetic data generation tabular data |
description |
Nowadays, data is king and if treated and used properly it promises to give organizations a competitive edge over rivals by enabling them to develop and design Intelligent Systems to improve their services. However, they need to fully comply with not only ethical but also regulatory obligations, where, e.g., privacy (strictly) needs to be respected when using or sharing data, thus protecting both the interests of users and organizations. Fraud Detection systems are examples of such systems where Machine Learning algorithms leverage information to classify nancial transactions as legitimate or illicit. The data used to create these solutions is usually highly structured and contains categorical and continuous features characterised by complex distributions. One of the main challenges of fraud detection is concerned with the scarcity of fraudulent instances which results in highly unbalanced datasets. Additionally, privacy is crucial, and it is usually forbidden, or not possible, to share the data of organizations and individuals for creating or improving models. In this paper we propose a framework for private data sharing based on synthetic data generation using Generative Adversarial Networks (GAN) that learns the speci cities of nancial transactions data and generates ctitious data that keeps the utility of the original datasets. Our proposal, called Duo-GAN, uses two GAN generators to handle the data imbalance problem, one generator for fraudulent instances and the other for legitimate instances. With this approach, we observed, at most, a 5% disparity in F1 scores between classi ers trained and tested with actual data and the ones trained with synthetic data and tested with actual data. |
publishDate |
2021 |
dc.date.none.fl_str_mv |
2021 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10316/101173 http://hdl.handle.net/10316/101173 https://doi.org/10.1109/ACCESS.2021.3126656 |
url |
http://hdl.handle.net/10316/101173 https://doi.org/10.1109/ACCESS.2021.3126656 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
2169-3536 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799134079058182144 |