Using NLP and Machine Learning to Detect Data Privacy Violations

Silva, Paulo; Goncalves, Carolina; Godinho, Carolina; Antunes, Nuno; Curado, Marília

Using NLP and Machine Learning to Detect Data Privacy Violations

Detalhes bibliográficos
Autor(a) principal:	Silva, Paulo
Data de Publicação:	2020
Outros Autores:	Goncalves, Carolina, Godinho, Carolina, Antunes, Nuno, Curado, Marília
Tipo de documento:	Artigo
Idioma:	eng
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	http://hdl.handle.net/10316/93821 https://doi.org/10.1109/INFOCOMWKSHPS50562.2020.9162683
Resumo:	Privacy concerns are constantly increasing in different sectors. Regulations such as the EU's General Data Protection Regulation (GDPR) are pressuring organizations to handle the individual's data with reinforced caution. As information systems deal with increasingly large amounts of personal data in essential services, there is a lack of mechanisms to help organizations in protecting the involved data subjects. In this paper, we propose and evaluate the use of Named Entity Recognition as a way to identify, monitor and validate Personally Identifiable Information. In our experiments, we used three of the most well-known Natural Language Processing tools (NLTK, Stanford CoreNLP, and spaCy). First, we assess the effectiveness of the tools with a generic dataset. Then, machine learning models are trained and evaluated with datasets built on data that contain personally identifiable information. The results show that models' performance was highly positive in accurately classifying both generic and more context-specific data. We observe the relationship between the datasets' training size and respective performance and estimate the appropriate size for model training within this context. Furthermore, we discuss how our proposal can effectively act as a Privacy Enhancing Technology as well as the potential risks and associated impacts.

Metadados do item

id	RCAP_d369d4aa32172e9f94777513b8a5998c
oai_identifier_str	oai:estudogeral.uc.pt:10316/93821
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	Using NLP and Machine Learning to Detect Data Privacy ViolationsPrivacy concerns are constantly increasing in different sectors. Regulations such as the EU's General Data Protection Regulation (GDPR) are pressuring organizations to handle the individual's data with reinforced caution. As information systems deal with increasingly large amounts of personal data in essential services, there is a lack of mechanisms to help organizations in protecting the involved data subjects. In this paper, we propose and evaluate the use of Named Entity Recognition as a way to identify, monitor and validate Personally Identifiable Information. In our experiments, we used three of the most well-known Natural Language Processing tools (NLTK, Stanford CoreNLP, and spaCy). First, we assess the effectiveness of the tools with a generic dataset. Then, machine learning models are trained and evaluated with datasets built on data that contain personally identifiable information. The results show that models' performance was highly positive in accurately classifying both generic and more context-specific data. We observe the relationship between the datasets' training size and respective performance and estimate the appropriate size for model training within this context. Furthermore, we discuss how our proposal can effectively act as a Privacy Enhancing Technology as well as the potential risks and associated impacts.IEEE2020info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articlehttp://hdl.handle.net/10316/93821http://hdl.handle.net/10316/93821https://doi.org/10.1109/INFOCOMWKSHPS50562.2020.9162683eng978-1-7281-8695-5978-1-7281-8695-5 (eISSN)978-1-7281-8696-2https://doi.org/10.1109/INFOCOMWKSHPS50562.2020.9162683Silva, PauloGoncalves, CarolinaGodinho, CarolinaAntunes, NunoCurado, Maríliainfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2022-09-07T09:26:11Zoai:estudogeral.uc.pt:10316/93821Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T21:12:42.708664Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	Using NLP and Machine Learning to Detect Data Privacy Violations
title	Using NLP and Machine Learning to Detect Data Privacy Violations
spellingShingle	Using NLP and Machine Learning to Detect Data Privacy Violations Silva, Paulo
title_short	Using NLP and Machine Learning to Detect Data Privacy Violations
title_full	Using NLP and Machine Learning to Detect Data Privacy Violations
title_fullStr	Using NLP and Machine Learning to Detect Data Privacy Violations
title_full_unstemmed	Using NLP and Machine Learning to Detect Data Privacy Violations
title_sort	Using NLP and Machine Learning to Detect Data Privacy Violations
author	Silva, Paulo
author_facet	Silva, Paulo Goncalves, Carolina Godinho, Carolina Antunes, Nuno Curado, Marília
author_role	author
author2	Goncalves, Carolina Godinho, Carolina Antunes, Nuno Curado, Marília
author2_role	author author author author
dc.contributor.author.fl_str_mv	Silva, Paulo Goncalves, Carolina Godinho, Carolina Antunes, Nuno Curado, Marília
description	Privacy concerns are constantly increasing in different sectors. Regulations such as the EU's General Data Protection Regulation (GDPR) are pressuring organizations to handle the individual's data with reinforced caution. As information systems deal with increasingly large amounts of personal data in essential services, there is a lack of mechanisms to help organizations in protecting the involved data subjects. In this paper, we propose and evaluate the use of Named Entity Recognition as a way to identify, monitor and validate Personally Identifiable Information. In our experiments, we used three of the most well-known Natural Language Processing tools (NLTK, Stanford CoreNLP, and spaCy). First, we assess the effectiveness of the tools with a generic dataset. Then, machine learning models are trained and evaluated with datasets built on data that contain personally identifiable information. The results show that models' performance was highly positive in accurately classifying both generic and more context-specific data. We observe the relationship between the datasets' training size and respective performance and estimate the appropriate size for model training within this context. Furthermore, we discuss how our proposal can effectively act as a Privacy Enhancing Technology as well as the potential risks and associated impacts.
publishDate	2020
dc.date.none.fl_str_mv	2020
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/article
format	article
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10316/93821 http://hdl.handle.net/10316/93821 https://doi.org/10.1109/INFOCOMWKSHPS50562.2020.9162683
url	http://hdl.handle.net/10316/93821 https://doi.org/10.1109/INFOCOMWKSHPS50562.2020.9162683
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	978-1-7281-8695-5 978-1-7281-8695-5 (eISSN) 978-1-7281-8696-2 https://doi.org/10.1109/INFOCOMWKSHPS50562.2020.9162683
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.publisher.none.fl_str_mv	IEEE
publisher.none.fl_str_mv	IEEE
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799134022440321024

Using NLP and Machine Learning to Detect Data Privacy Violations

Registros relacionados