Utility-driven assessment of anonymized data via clustering
Autor(a) principal: | |
---|---|
Data de Publicação: | 2022 |
Outros Autores: | , |
Tipo de documento: | Artigo |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10400.6/12328 |
Resumo: | In this study, clustering is conceived as an auxiliary tool to identify groups of special interest. This approach was applied to a real dataset concerning an entire Portuguese cohort of higher education Law students. Several anonymized clustering scenarios were compared against the original cluster solution. The clustering techniques were explored as data utility models in the context of data anonymization, using k-anonymity and (ε, δ)-differential as privacy models. The purpose was to assess anonymized data utility by standard metrics, by the characteristics of the groups obtained, and the relative risk (a relevant metric in social sciences research). For a matter of self-containment, we present an overview of anonymization and clustering methods. We used a partitional clustering algorithm and analyzed several clustering validity indices to understand to what extent the data structure is preserved, or not, after data anonymization. The results suggest that for low dimensionality/cardinality datasets the anonymization procedure easily jeopardizes the clustering endeavor. In addition, there is evidence that relevant field-of-study estimates obtained from anonymized data are biased. |
id |
RCAP_b9fcc3487760539aac7274bca3b0828e |
---|---|
oai_identifier_str |
oai:ubibliorum.ubi.pt:10400.6/12328 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Utility-driven assessment of anonymized data via clusteringData privacyData utilityClusteringEducationIn this study, clustering is conceived as an auxiliary tool to identify groups of special interest. This approach was applied to a real dataset concerning an entire Portuguese cohort of higher education Law students. Several anonymized clustering scenarios were compared against the original cluster solution. The clustering techniques were explored as data utility models in the context of data anonymization, using k-anonymity and (ε, δ)-differential as privacy models. The purpose was to assess anonymized data utility by standard metrics, by the characteristics of the groups obtained, and the relative risk (a relevant metric in social sciences research). For a matter of self-containment, we present an overview of anonymization and clustering methods. We used a partitional clustering algorithm and analyzed several clustering validity indices to understand to what extent the data structure is preserved, or not, after data anonymization. The results suggest that for low dimensionality/cardinality datasets the anonymization procedure easily jeopardizes the clustering endeavor. In addition, there is evidence that relevant field-of-study estimates obtained from anonymized data are biased.Springer NatureuBibliorumFerrão, Maria EugéniaPrata, PaulaFazendeiro, Paulo2022-08-26T08:33:17Z2022-07-302022-07-30T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10400.6/12328engFerrão, M.E., Prata, P. & Fazendeiro, P. Utility-driven assessment of anonymized data via clustering. Sci Data 9, 456 (2022). https://doi.org/10.1038/s41597-022-01561-6.2052-446310.1038/s41597-022-01561-6info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-12-15T09:55:27Zoai:ubibliorum.ubi.pt:10400.6/12328Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T00:51:57.668154Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Utility-driven assessment of anonymized data via clustering |
title |
Utility-driven assessment of anonymized data via clustering |
spellingShingle |
Utility-driven assessment of anonymized data via clustering Ferrão, Maria Eugénia Data privacy Data utility Clustering Education |
title_short |
Utility-driven assessment of anonymized data via clustering |
title_full |
Utility-driven assessment of anonymized data via clustering |
title_fullStr |
Utility-driven assessment of anonymized data via clustering |
title_full_unstemmed |
Utility-driven assessment of anonymized data via clustering |
title_sort |
Utility-driven assessment of anonymized data via clustering |
author |
Ferrão, Maria Eugénia |
author_facet |
Ferrão, Maria Eugénia Prata, Paula Fazendeiro, Paulo |
author_role |
author |
author2 |
Prata, Paula Fazendeiro, Paulo |
author2_role |
author author |
dc.contributor.none.fl_str_mv |
uBibliorum |
dc.contributor.author.fl_str_mv |
Ferrão, Maria Eugénia Prata, Paula Fazendeiro, Paulo |
dc.subject.por.fl_str_mv |
Data privacy Data utility Clustering Education |
topic |
Data privacy Data utility Clustering Education |
description |
In this study, clustering is conceived as an auxiliary tool to identify groups of special interest. This approach was applied to a real dataset concerning an entire Portuguese cohort of higher education Law students. Several anonymized clustering scenarios were compared against the original cluster solution. The clustering techniques were explored as data utility models in the context of data anonymization, using k-anonymity and (ε, δ)-differential as privacy models. The purpose was to assess anonymized data utility by standard metrics, by the characteristics of the groups obtained, and the relative risk (a relevant metric in social sciences research). For a matter of self-containment, we present an overview of anonymization and clustering methods. We used a partitional clustering algorithm and analyzed several clustering validity indices to understand to what extent the data structure is preserved, or not, after data anonymization. The results suggest that for low dimensionality/cardinality datasets the anonymization procedure easily jeopardizes the clustering endeavor. In addition, there is evidence that relevant field-of-study estimates obtained from anonymized data are biased. |
publishDate |
2022 |
dc.date.none.fl_str_mv |
2022-08-26T08:33:17Z 2022-07-30 2022-07-30T00:00:00Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10400.6/12328 |
url |
http://hdl.handle.net/10400.6/12328 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
Ferrão, M.E., Prata, P. & Fazendeiro, P. Utility-driven assessment of anonymized data via clustering. Sci Data 9, 456 (2022). https://doi.org/10.1038/s41597-022-01561-6. 2052-4463 10.1038/s41597-022-01561-6 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Springer Nature |
publisher.none.fl_str_mv |
Springer Nature |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799136408066064384 |