UMAP-SMOTENC: an efficient and consistent alternative for fully synthetic data generation
Autor(a) principal: | |
---|---|
Data de Publicação: | 2023 |
Tipo de documento: | Dissertação |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10362/148985 |
Resumo: | Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Business Analytics |
id |
RCAP_36f221368002f1ea215d2bd7f55de33a |
---|---|
oai_identifier_str |
oai:run.unl.pt:10362/148985 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
UMAP-SMOTENC: an efficient and consistent alternative for fully synthetic data generationAnonymization TechniquesMachine LearningSMOTESynthetic Data GenerationUMAPDissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Business AnalyticsThe intensification of governmental legislation and the social awareness around data privacy protection severely constrains organizations' data utilization capabilities. As a result, the interest in data anonymization techniques, that ideally should preserve the patterns present on the original data but mitigate the risks of privacy leakage, has also been on the rise. While conventional methods have shown to compromise privacy, recently proposed deep learning generative approaches are computationally expensive and unreliable when used in tabular datasets, hindering the democratization and usability of data. In this paper, we explore this trade-off between privacy and the quality of the anonymized data, establishing a new equilibrium obtained using a synthetic oversampling technique, SMOTE-NC, on a non-linear compressed version of the input space, achieved with the use of UMAP. The introduced approach was developed to provide an efficient and consistent solution that can be used without significant efforts on hyperparameter tuning or resourcing to massive computing infrastructures. To evaluate the robustness of the proposed solution, an experiment was conducted comparing several metrics and models across eight datasets with diverse characteristics. The results achieved suggest that the presented method can efficiently synthesize privacy-aware data while conserving the relevant patterns of the real dataset, particularly those required for machine learning-based classification tasks.Bação, Fernando José Ferreira LucasRUNAlmeida, Gonçalo de2024-01-24T01:31:41Z2023-01-242023-01-24T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/148985TID:203220196enginfo:eu-repo/semantics/embargoedAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T05:30:46Zoai:run.unl.pt:10362/148985Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:53:35.016412Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
UMAP-SMOTENC: an efficient and consistent alternative for fully synthetic data generation |
title |
UMAP-SMOTENC: an efficient and consistent alternative for fully synthetic data generation |
spellingShingle |
UMAP-SMOTENC: an efficient and consistent alternative for fully synthetic data generation Almeida, Gonçalo de Anonymization Techniques Machine Learning SMOTE Synthetic Data Generation UMAP |
title_short |
UMAP-SMOTENC: an efficient and consistent alternative for fully synthetic data generation |
title_full |
UMAP-SMOTENC: an efficient and consistent alternative for fully synthetic data generation |
title_fullStr |
UMAP-SMOTENC: an efficient and consistent alternative for fully synthetic data generation |
title_full_unstemmed |
UMAP-SMOTENC: an efficient and consistent alternative for fully synthetic data generation |
title_sort |
UMAP-SMOTENC: an efficient and consistent alternative for fully synthetic data generation |
author |
Almeida, Gonçalo de |
author_facet |
Almeida, Gonçalo de |
author_role |
author |
dc.contributor.none.fl_str_mv |
Bação, Fernando José Ferreira Lucas RUN |
dc.contributor.author.fl_str_mv |
Almeida, Gonçalo de |
dc.subject.por.fl_str_mv |
Anonymization Techniques Machine Learning SMOTE Synthetic Data Generation UMAP |
topic |
Anonymization Techniques Machine Learning SMOTE Synthetic Data Generation UMAP |
description |
Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Business Analytics |
publishDate |
2023 |
dc.date.none.fl_str_mv |
2023-01-24 2023-01-24T00:00:00Z 2024-01-24T01:31:41Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10362/148985 TID:203220196 |
url |
http://hdl.handle.net/10362/148985 |
identifier_str_mv |
TID:203220196 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/embargoedAccess |
eu_rights_str_mv |
embargoedAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799138126045642752 |