UMAP-SMOTENC: an efficient and consistent alternative for fully synthetic data generation

Detalhes bibliográficos
Autor(a) principal: Almeida, Gonçalo de
Data de Publicação: 2023
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10362/148985
Resumo: Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Business Analytics
id RCAP_36f221368002f1ea215d2bd7f55de33a
oai_identifier_str oai:run.unl.pt:10362/148985
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling UMAP-SMOTENC: an efficient and consistent alternative for fully synthetic data generationAnonymization TechniquesMachine LearningSMOTESynthetic Data GenerationUMAPDissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Business AnalyticsThe intensification of governmental legislation and the social awareness around data privacy protection severely constrains organizations' data utilization capabilities. As a result, the interest in data anonymization techniques, that ideally should preserve the patterns present on the original data but mitigate the risks of privacy leakage, has also been on the rise. While conventional methods have shown to compromise privacy, recently proposed deep learning generative approaches are computationally expensive and unreliable when used in tabular datasets, hindering the democratization and usability of data. In this paper, we explore this trade-off between privacy and the quality of the anonymized data, establishing a new equilibrium obtained using a synthetic oversampling technique, SMOTE-NC, on a non-linear compressed version of the input space, achieved with the use of UMAP. The introduced approach was developed to provide an efficient and consistent solution that can be used without significant efforts on hyperparameter tuning or resourcing to massive computing infrastructures. To evaluate the robustness of the proposed solution, an experiment was conducted comparing several metrics and models across eight datasets with diverse characteristics. The results achieved suggest that the presented method can efficiently synthesize privacy-aware data while conserving the relevant patterns of the real dataset, particularly those required for machine learning-based classification tasks.Bação, Fernando José Ferreira LucasRUNAlmeida, Gonçalo de2024-01-24T01:31:41Z2023-01-242023-01-24T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/148985TID:203220196enginfo:eu-repo/semantics/embargoedAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T05:30:46Zoai:run.unl.pt:10362/148985Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:53:35.016412Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv UMAP-SMOTENC: an efficient and consistent alternative for fully synthetic data generation
title UMAP-SMOTENC: an efficient and consistent alternative for fully synthetic data generation
spellingShingle UMAP-SMOTENC: an efficient and consistent alternative for fully synthetic data generation
Almeida, Gonçalo de
Anonymization Techniques
Machine Learning
SMOTE
Synthetic Data Generation
UMAP
title_short UMAP-SMOTENC: an efficient and consistent alternative for fully synthetic data generation
title_full UMAP-SMOTENC: an efficient and consistent alternative for fully synthetic data generation
title_fullStr UMAP-SMOTENC: an efficient and consistent alternative for fully synthetic data generation
title_full_unstemmed UMAP-SMOTENC: an efficient and consistent alternative for fully synthetic data generation
title_sort UMAP-SMOTENC: an efficient and consistent alternative for fully synthetic data generation
author Almeida, Gonçalo de
author_facet Almeida, Gonçalo de
author_role author
dc.contributor.none.fl_str_mv Bação, Fernando José Ferreira Lucas
RUN
dc.contributor.author.fl_str_mv Almeida, Gonçalo de
dc.subject.por.fl_str_mv Anonymization Techniques
Machine Learning
SMOTE
Synthetic Data Generation
UMAP
topic Anonymization Techniques
Machine Learning
SMOTE
Synthetic Data Generation
UMAP
description Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Business Analytics
publishDate 2023
dc.date.none.fl_str_mv 2023-01-24
2023-01-24T00:00:00Z
2024-01-24T01:31:41Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10362/148985
TID:203220196
url http://hdl.handle.net/10362/148985
identifier_str_mv TID:203220196
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/embargoedAccess
eu_rights_str_mv embargoedAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799138126045642752