Using NLP and Machine Learning to Detect Data Privacy Violations
Autor(a) principal: | |
---|---|
Data de Publicação: | 2020 |
Outros Autores: | , , , |
Tipo de documento: | Artigo |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10316/93821 https://doi.org/10.1109/INFOCOMWKSHPS50562.2020.9162683 |
Resumo: | Privacy concerns are constantly increasing in different sectors. Regulations such as the EU's General Data Protection Regulation (GDPR) are pressuring organizations to handle the individual's data with reinforced caution. As information systems deal with increasingly large amounts of personal data in essential services, there is a lack of mechanisms to help organizations in protecting the involved data subjects. In this paper, we propose and evaluate the use of Named Entity Recognition as a way to identify, monitor and validate Personally Identifiable Information. In our experiments, we used three of the most well-known Natural Language Processing tools (NLTK, Stanford CoreNLP, and spaCy). First, we assess the effectiveness of the tools with a generic dataset. Then, machine learning models are trained and evaluated with datasets built on data that contain personally identifiable information. The results show that models' performance was highly positive in accurately classifying both generic and more context-specific data. We observe the relationship between the datasets' training size and respective performance and estimate the appropriate size for model training within this context. Furthermore, we discuss how our proposal can effectively act as a Privacy Enhancing Technology as well as the potential risks and associated impacts. |
id |
RCAP_d369d4aa32172e9f94777513b8a5998c |
---|---|
oai_identifier_str |
oai:estudogeral.uc.pt:10316/93821 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Using NLP and Machine Learning to Detect Data Privacy ViolationsPrivacy concerns are constantly increasing in different sectors. Regulations such as the EU's General Data Protection Regulation (GDPR) are pressuring organizations to handle the individual's data with reinforced caution. As information systems deal with increasingly large amounts of personal data in essential services, there is a lack of mechanisms to help organizations in protecting the involved data subjects. In this paper, we propose and evaluate the use of Named Entity Recognition as a way to identify, monitor and validate Personally Identifiable Information. In our experiments, we used three of the most well-known Natural Language Processing tools (NLTK, Stanford CoreNLP, and spaCy). First, we assess the effectiveness of the tools with a generic dataset. Then, machine learning models are trained and evaluated with datasets built on data that contain personally identifiable information. The results show that models' performance was highly positive in accurately classifying both generic and more context-specific data. We observe the relationship between the datasets' training size and respective performance and estimate the appropriate size for model training within this context. Furthermore, we discuss how our proposal can effectively act as a Privacy Enhancing Technology as well as the potential risks and associated impacts.IEEE2020info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articlehttp://hdl.handle.net/10316/93821http://hdl.handle.net/10316/93821https://doi.org/10.1109/INFOCOMWKSHPS50562.2020.9162683eng978-1-7281-8695-5978-1-7281-8695-5 (eISSN)978-1-7281-8696-2https://doi.org/10.1109/INFOCOMWKSHPS50562.2020.9162683Silva, PauloGoncalves, CarolinaGodinho, CarolinaAntunes, NunoCurado, Maríliainfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2022-09-07T09:26:11Zoai:estudogeral.uc.pt:10316/93821Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T21:12:42.708664Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Using NLP and Machine Learning to Detect Data Privacy Violations |
title |
Using NLP and Machine Learning to Detect Data Privacy Violations |
spellingShingle |
Using NLP and Machine Learning to Detect Data Privacy Violations Silva, Paulo |
title_short |
Using NLP and Machine Learning to Detect Data Privacy Violations |
title_full |
Using NLP and Machine Learning to Detect Data Privacy Violations |
title_fullStr |
Using NLP and Machine Learning to Detect Data Privacy Violations |
title_full_unstemmed |
Using NLP and Machine Learning to Detect Data Privacy Violations |
title_sort |
Using NLP and Machine Learning to Detect Data Privacy Violations |
author |
Silva, Paulo |
author_facet |
Silva, Paulo Goncalves, Carolina Godinho, Carolina Antunes, Nuno Curado, Marília |
author_role |
author |
author2 |
Goncalves, Carolina Godinho, Carolina Antunes, Nuno Curado, Marília |
author2_role |
author author author author |
dc.contributor.author.fl_str_mv |
Silva, Paulo Goncalves, Carolina Godinho, Carolina Antunes, Nuno Curado, Marília |
description |
Privacy concerns are constantly increasing in different sectors. Regulations such as the EU's General Data Protection Regulation (GDPR) are pressuring organizations to handle the individual's data with reinforced caution. As information systems deal with increasingly large amounts of personal data in essential services, there is a lack of mechanisms to help organizations in protecting the involved data subjects. In this paper, we propose and evaluate the use of Named Entity Recognition as a way to identify, monitor and validate Personally Identifiable Information. In our experiments, we used three of the most well-known Natural Language Processing tools (NLTK, Stanford CoreNLP, and spaCy). First, we assess the effectiveness of the tools with a generic dataset. Then, machine learning models are trained and evaluated with datasets built on data that contain personally identifiable information. The results show that models' performance was highly positive in accurately classifying both generic and more context-specific data. We observe the relationship between the datasets' training size and respective performance and estimate the appropriate size for model training within this context. Furthermore, we discuss how our proposal can effectively act as a Privacy Enhancing Technology as well as the potential risks and associated impacts. |
publishDate |
2020 |
dc.date.none.fl_str_mv |
2020 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10316/93821 http://hdl.handle.net/10316/93821 https://doi.org/10.1109/INFOCOMWKSHPS50562.2020.9162683 |
url |
http://hdl.handle.net/10316/93821 https://doi.org/10.1109/INFOCOMWKSHPS50562.2020.9162683 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
978-1-7281-8695-5 978-1-7281-8695-5 (eISSN) 978-1-7281-8696-2 https://doi.org/10.1109/INFOCOMWKSHPS50562.2020.9162683 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.publisher.none.fl_str_mv |
IEEE |
publisher.none.fl_str_mv |
IEEE |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799134022440321024 |