Assessing NER tools for dialogue data anonymization
Autor(a) principal: | |
---|---|
Data de Publicação: | 2023 |
Tipo de documento: | Dissertação |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10071/30399 |
Resumo: | As the number of organizations processing sensitive data grows, so does the need for businesses to protect and ensure the privacy of their customers. However, the prevailing methods for protecting sensitive data often involve manual or semi-automatic procedures, which can be resource-intensive and error-prone. This dissertation addresses data anonymization by focusing on Named Entity Recognition (NER) models. Particularly, we investigate and compare various NER models for the Portuguese language to automatically and effectively anonymize unstructured data. The models SpaCy, STRING, WikiNEuRal and RoBERTta are used in the machine learning approach with the goal of identifying classes such as Person, Location, and Organization. On the other hand, the rule-based approach seeks to identify classifications such as NIF, Email, Car Plate and even Postal Code. Additionally, it was created a Flask API tool capable of processing unstructured data and anonymizing it, more specifically, given a string that simulates a message, automatically anonymize the message content that might be considered as sensitive. This tool combines many techniques for identifying and extracting mentioned entities for the Portuguese language, based on rule models and machine learning. The combination of both rule-based and machine learning models in the same tool was crucial to enable the ability to encompass more sensitive classes for anonymization. The results calculated for the extraction of entities from the tool built in this work encompasses the results for the three classes calculated with the SpaCy model, with the addition of the results calculated for the rule-models created. |
id |
RCAP_2097d11fbffc77fe74929887f35c633f |
---|---|
oai_identifier_str |
oai:repositorio.iscte-iul.pt:10071/30399 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Assessing NER tools for dialogue data anonymizationData anonymizationEntities extractionProcessamento de linguagem natural - -- NLP Natural language processingArtificialNamed entity recognitionSensitive dataAnonimização de dadosExtração de entidadesReconhecimento de entidades mencionadasDados sensíveisAs the number of organizations processing sensitive data grows, so does the need for businesses to protect and ensure the privacy of their customers. However, the prevailing methods for protecting sensitive data often involve manual or semi-automatic procedures, which can be resource-intensive and error-prone. This dissertation addresses data anonymization by focusing on Named Entity Recognition (NER) models. Particularly, we investigate and compare various NER models for the Portuguese language to automatically and effectively anonymize unstructured data. The models SpaCy, STRING, WikiNEuRal and RoBERTta are used in the machine learning approach with the goal of identifying classes such as Person, Location, and Organization. On the other hand, the rule-based approach seeks to identify classifications such as NIF, Email, Car Plate and even Postal Code. Additionally, it was created a Flask API tool capable of processing unstructured data and anonymizing it, more specifically, given a string that simulates a message, automatically anonymize the message content that might be considered as sensitive. This tool combines many techniques for identifying and extracting mentioned entities for the Portuguese language, based on rule models and machine learning. The combination of both rule-based and machine learning models in the same tool was crucial to enable the ability to encompass more sensitive classes for anonymization. The results calculated for the extraction of entities from the tool built in this work encompasses the results for the three classes calculated with the SpaCy model, with the addition of the results calculated for the rule-models created.Com o aumento do número de organizações que processam dados sensíveis, aumenta também a necessidade de as empresas assegurarem a privacidade dos seus clientes. No entanto, os métodos de segurança e proteção de dados sensíveis envolvem, frequentemente, procedimentos manuais ou semi-automáticos, os quais consomem muitos recursos e são propensos a erros. Esta tese aborda anonimização de dados, centrando-se em modelos de Reconhecimento de Entidades Mencionadas. Em particular, investigamos e comparamos vários modelos de Reconhecimento de Entidades Mencionadas para a língua portuguesa para anonimizar automaticamente dados não estruturados. Na abordagem de aprendizagem automática foram utilizados os modelos do SpaCy, STRING, WikiNEuRal e RoBERTta com o intuito de identificar classes como Pessoa, Localização e Organização. Contudo, a abordagem baseada em regras procura identificar classes como NIF, Email, Matrícula de carro e até mesmo Código Postal. Consequentemente, foi construída uma ferramenta em Flask, capaz de processar dados não estruturados e anonimizá-los, mais especificamente, capaz de, dada uma string (que simule uma mensagem), anonimizar o seu conteúdo sensível automaticamente. Esta ferramenta combina diferentes técnicas para a Identificação e Extração de Entidades Mencionadas para a língua portuguesa, baseando-se em modelos de regras e de aprendizagem automática. A junção de ambos os modelos de regras e aprendizagem automática na mesma ferramenta foi essencial para conseguirmos abranger mais classes sensíveis para anonimização, sendo que os resultados calculados para a extração de entidades da ferramenta contruída neste trabalho, engloba os resultados para as três classes calculadas com o modelo SpaCy, com a adição dos modelos de regras criados.2024-01-15T11:06:48Z2023-12-12T00:00:00Z2023-12-122023-10info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10071/30399TID:203446488engPereira, Miguel Alexandre da Silva Sarmento Falcoinfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-01-21T01:19:00Zoai:repositorio.iscte-iul.pt:10071/30399Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T01:52:33.812815Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Assessing NER tools for dialogue data anonymization |
title |
Assessing NER tools for dialogue data anonymization |
spellingShingle |
Assessing NER tools for dialogue data anonymization Pereira, Miguel Alexandre da Silva Sarmento Falco Data anonymization Entities extraction Processamento de linguagem natural - -- NLP Natural language processing Artificial Named entity recognition Sensitive data Anonimização de dados Extração de entidades Reconhecimento de entidades mencionadas Dados sensíveis |
title_short |
Assessing NER tools for dialogue data anonymization |
title_full |
Assessing NER tools for dialogue data anonymization |
title_fullStr |
Assessing NER tools for dialogue data anonymization |
title_full_unstemmed |
Assessing NER tools for dialogue data anonymization |
title_sort |
Assessing NER tools for dialogue data anonymization |
author |
Pereira, Miguel Alexandre da Silva Sarmento Falco |
author_facet |
Pereira, Miguel Alexandre da Silva Sarmento Falco |
author_role |
author |
dc.contributor.author.fl_str_mv |
Pereira, Miguel Alexandre da Silva Sarmento Falco |
dc.subject.por.fl_str_mv |
Data anonymization Entities extraction Processamento de linguagem natural - -- NLP Natural language processing Artificial Named entity recognition Sensitive data Anonimização de dados Extração de entidades Reconhecimento de entidades mencionadas Dados sensíveis |
topic |
Data anonymization Entities extraction Processamento de linguagem natural - -- NLP Natural language processing Artificial Named entity recognition Sensitive data Anonimização de dados Extração de entidades Reconhecimento de entidades mencionadas Dados sensíveis |
description |
As the number of organizations processing sensitive data grows, so does the need for businesses to protect and ensure the privacy of their customers. However, the prevailing methods for protecting sensitive data often involve manual or semi-automatic procedures, which can be resource-intensive and error-prone. This dissertation addresses data anonymization by focusing on Named Entity Recognition (NER) models. Particularly, we investigate and compare various NER models for the Portuguese language to automatically and effectively anonymize unstructured data. The models SpaCy, STRING, WikiNEuRal and RoBERTta are used in the machine learning approach with the goal of identifying classes such as Person, Location, and Organization. On the other hand, the rule-based approach seeks to identify classifications such as NIF, Email, Car Plate and even Postal Code. Additionally, it was created a Flask API tool capable of processing unstructured data and anonymizing it, more specifically, given a string that simulates a message, automatically anonymize the message content that might be considered as sensitive. This tool combines many techniques for identifying and extracting mentioned entities for the Portuguese language, based on rule models and machine learning. The combination of both rule-based and machine learning models in the same tool was crucial to enable the ability to encompass more sensitive classes for anonymization. The results calculated for the extraction of entities from the tool built in this work encompasses the results for the three classes calculated with the SpaCy model, with the addition of the results calculated for the rule-models created. |
publishDate |
2023 |
dc.date.none.fl_str_mv |
2023-12-12T00:00:00Z 2023-12-12 2023-10 2024-01-15T11:06:48Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10071/30399 TID:203446488 |
url |
http://hdl.handle.net/10071/30399 |
identifier_str_mv |
TID:203446488 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799137016667963392 |