Natural language processing for sensitive data recognition and privacy in digital documents

Detalhes bibliográficos
Autor(a) principal: Vieira, Samuel Antunes
Data de Publicação: 2024
Tipo de documento: Dissertação
Idioma: por
Título da fonte: Biblioteca de teses e dissertações da Universidade de Passo Fundo (BDTD UPF)
Texto Completo: http://tede.upf.br:8080/jspui/handle/tede/2765
Resumo: Keeping confidential information secure in personal documents has always been critical to guarantee the privacy of people or companies. With the frequent digitalization of documents and the adoption of laws and regulations, this task has become even more relevant. In this context, security applications can censor critical text in documents digital. How protecting data through censorship can require intensive manual work to identify the specific location of sensitive data and is subject to errors humans, automation is an option to handle the entire process. With that in mind, this work presents DOCDOM, a proof-of-concept software that integrates multiple tools for recognizing sensitive data and privacy in digital documents. The approach considers optical character recognition to obtain text data from documents, applies a natural language processing model focused on recognition of named entities to identify confidential data, and censor these using library resources for digital document processing. The results Preliminaries showed that DOCDOM works well, achieving evaluation metrics reasonable for two test data sets of 1000 files each (AUC-PR Curves 0.9266 and 0.6681). A detailed analysis identified that there are noise problems in some files during text classification tasks, which still need to be handled through noise distinction and filtering strategies. Despite this, the proposed solution presented acceptable initial results for a proof of concept, with good precision and accuracy for files with a simple structure and sensitive non-numeric content.
id UPF-1_c4774d00633400e49490839526cbd964
oai_identifier_str oai:tede.upf.br:tede/2765
network_acronym_str UPF-1
network_name_str Biblioteca de teses e dissertações da Universidade de Passo Fundo (BDTD UPF)
repository_id_str
spelling Rieder, Rafaelhttp://lattes.cnpq.br/3010497094377497http://lattes.cnpq.br/2195236331426283Vieira, Samuel Antunes2024-09-17T13:07:26Z2024-03-27VIEIRA, Samuel Antunes. Natural language processing for sensitive data recognition and privacy in digital documents. 2024. 47 f. Dissertação (Mestrado em Computação Aplicada) - Universidade de Passo Fundo, Passo Fundo, RS, 2024.http://tede.upf.br:8080/jspui/handle/tede/2765Keeping confidential information secure in personal documents has always been critical to guarantee the privacy of people or companies. With the frequent digitalization of documents and the adoption of laws and regulations, this task has become even more relevant. In this context, security applications can censor critical text in documents digital. How protecting data through censorship can require intensive manual work to identify the specific location of sensitive data and is subject to errors humans, automation is an option to handle the entire process. With that in mind, this work presents DOCDOM, a proof-of-concept software that integrates multiple tools for recognizing sensitive data and privacy in digital documents. The approach considers optical character recognition to obtain text data from documents, applies a natural language processing model focused on recognition of named entities to identify confidential data, and censor these using library resources for digital document processing. The results Preliminaries showed that DOCDOM works well, achieving evaluation metrics reasonable for two test data sets of 1000 files each (AUC-PR Curves 0.9266 and 0.6681). A detailed analysis identified that there are noise problems in some files during text classification tasks, which still need to be handled through noise distinction and filtering strategies. Despite this, the proposed solution presented acceptable initial results for a proof of concept, with good precision and accuracy for files with a simple structure and sensitive non-numeric content.Manter informações confidenciais seguras em documentos pessoais sempre foi fundamental para garantir a privacidade de pessoas ou empresas. Com a frequente digitalização de documentos e a adoção de leis e regulamentos, esta tarefa tornou-se ainda mais relevante. Neste contexto, as aplicações de segurança podem censurar textos críticos em documentos digitais. Como a proteção de dados por meio de censura pode exigir trabalho manual intensivo para identificar a localização específica de dados confidenciais e está sujeita a erros humanos, a automação é uma opção para lidar com todo o processo. Pensando nisso, este trabalho apresenta o DOCDOM, um software de prova de conceito que integra múltiplas ferramentas para o reconhecimento de dados sensíveis e privacidade em documentos digitais. A abordagem considera o reconhecimento ótico de caracteres para obter dados de texto de documentos, aplica um modelo de processamento de linguagem natural focado no reconhecimento de entidades nomeadas para identificar dados confidenciais, e censura estes usando recursos de bibliotecas para processamento de documentos digitais. Os resultados preliminares mostraram que o DOCDOM funciona bem, alcançando métricas de avaliação razoáveis para dois conjuntos de dados de teste de 1000 arquivos cada (Curvas AUC-PR 0,9266 e 0,6681). Uma análise detalhada identificou que existem problemas de ruído em alguns arquivos durante tarefas de classificação de texto, que ainda precisam ser tratados por meio de estratégias de distinção e filtragem de ruído. Apesar disso, a solução proposta apresentou resultados iniciais aceitáveis para uma prova de conceito, com boa precisão e acurácia para arquivos de estrutura simples e conteúdos sensíveis não numéricos.Submitted by Franciele Silva (francielesilva@upf.br) on 2024-09-17T13:07:26Z No. of bitstreams: 1 2024SamuelAntunesVieira.pdf: 2183932 bytes, checksum: 0fb5acb26ffe493823d6e53d458515d6 (MD5)Made available in DSpace on 2024-09-17T13:07:26Z (GMT). No. of bitstreams: 1 2024SamuelAntunesVieira.pdf: 2183932 bytes, checksum: 0fb5acb26ffe493823d6e53d458515d6 (MD5) Previous issue date: 2024-03-27application/pdfporUniversidade de Passo FundoPrograma de Pós-Graduação em Computação AplicadaUPFBrasilInstituto de Tecnologia – ITECProteção de dadosAutomaçãoDocumentos eletrônicosCIENCIA DA COMPUTACAO::SISTEMAS DE COMPUTACAONatural language processing for sensitive data recognition and privacy in digital documentsinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesis-70926567600265482650050060081433101329505096478930092515683771531info:eu-repo/semantics/openAccessreponame:Biblioteca de teses e dissertações da Universidade de Passo Fundo (BDTD UPF)instname:Universidade de Passo Fundo (UPF)instacron:UPFORIGINAL2024SamuelAntunesVieira.pdf2024SamuelAntunesVieira.pdfapplication/pdf2183932http://tede.upf.br:8080/jspui/bitstream/tede/2765/2/2024SamuelAntunesVieira.pdf0fb5acb26ffe493823d6e53d458515d6MD52LICENSElicense.txtlicense.txttext/plain; charset=utf-82053http://tede.upf.br:8080/jspui/bitstream/tede/2765/1/license.txt1ea0bfd7af108792edd8df732bb777fcMD51tede/27652024-09-17 10:07:26.263oai:tede.upf.br:tede/2765Tk9UQTogQ09MT1FVRSBBUVVJIEEgU1VBIFBSw5NQUklBIExJQ0VOw4dBCkVzdGEgbGljZW7Dp2EgZGUgZXhlbXBsbyDDqSBmb3JuZWNpZGEgYXBlbmFzIHBhcmEgZmlucyBpbmZvcm1hdGl2b3MuCgpMSUNFTsOHQSBERSBESVNUUklCVUnDh8ODTyBOw4NPLUVYQ0xVU0lWQQoKQ29tIGEgYXByZXNlbnRhw6fDo28gZGVzdGEgbGljZW7Dp2EsIHZvY8OqIChvIGF1dG9yIChlcykgb3UgbyB0aXR1bGFyIGRvcyBkaXJlaXRvcyBkZSBhdXRvcikgY29uY2VkZSDDoCBVbml2ZXJzaWRhZGUgZGUgUGFzc28gRnVuZG8gKFVQRikgbyBkaXJlaXRvIG7Do28tZXhjbHVzaXZvIGRlIHJlcHJvZHV6aXIsICB0cmFkdXppciAoY29uZm9ybWUgZGVmaW5pZG8gYWJhaXhvKSwgZS9vdSBkaXN0cmlidWlyIGEgc3VhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyAoaW5jbHVpbmRvIG8gcmVzdW1vKSBwb3IgdG9kbyBvIG11bmRvIG5vIGZvcm1hdG8gaW1wcmVzc28gZSBlbGV0csO0bmljbyBlIGVtIHF1YWxxdWVyIG1laW8sIGluY2x1aW5kbyBvcyBmb3JtYXRvcyDDoXVkaW8gb3UgdsOtZGVvLgoKVm9jw6ogY29uY29yZGEgcXVlIGEgVVBGIHBvZGUsIHNlbSBhbHRlcmFyIG8gY29udGXDumRvLCB0cmFuc3BvciBhIHN1YSB0ZXNlIG91IGRpc3NlcnRhw6fDo28gcGFyYSBxdWFscXVlciBtZWlvIG91IGZvcm1hdG8gcGFyYSBmaW5zIGRlIHByZXNlcnZhw6fDo28uCgpWb2PDqiB0YW1iw6ltIGNvbmNvcmRhIHF1ZSBhIFVQRiBwb2RlIG1hbnRlciBtYWlzIGRlIHVtYSBjw7NwaWEgYSBzdWEgdGVzZSBvdSBkaXNzZXJ0YcOnw6NvIHBhcmEgZmlucyBkZSBzZWd1cmFuw6dhLCBiYWNrLXVwIGUgcHJlc2VydmHDp8Ojby4KClZvY8OqIGRlY2xhcmEgcXVlIGEgc3VhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyDDqSBvcmlnaW5hbCBlIHF1ZSB2b2PDqiB0ZW0gbyBwb2RlciBkZSBjb25jZWRlciBvcyBkaXJlaXRvcyBjb250aWRvcyBuZXN0YSBsaWNlbsOnYS4gVm9jw6ogdGFtYsOpbSBkZWNsYXJhIHF1ZSBvIGRlcMOzc2l0byBkYSBzdWEgdGVzZSBvdSBkaXNzZXJ0YcOnw6NvIG7Do28sIHF1ZSBzZWphIGRlIHNldSBjb25oZWNpbWVudG8sIGluZnJpbmdlIGRpcmVpdG9zIGF1dG9yYWlzIGRlIG5pbmd1w6ltLgoKQ2FzbyBhIHN1YSB0ZXNlIG91IGRpc3NlcnRhw6fDo28gY29udGVuaGEgbWF0ZXJpYWwgcXVlIHZvY8OqIG7Do28gcG9zc3VpIGEgdGl0dWxhcmlkYWRlIGRvcyBkaXJlaXRvcyBhdXRvcmFpcywgdm9jw6ogZGVjbGFyYSBxdWUgb2J0ZXZlIGEgcGVybWlzc8OjbyBpcnJlc3RyaXRhIGRvIGRldGVudG9yIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBwYXJhIGNvbmNlZGVyIMOgIFVQRiBvcyBkaXJlaXRvcyBhcHJlc2VudGFkb3MgbmVzdGEgbGljZW7Dp2EsIGUgcXVlIGVzc2UgbWF0ZXJpYWwgZGUgcHJvcHJpZWRhZGUgZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUgaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3Ugbm8gY29udGXDumRvIGRhIHRlc2Ugb3UgZGlzc2VydGHDp8OjbyBvcmEgZGVwb3NpdGFkYS4KCkNBU08gQSBURVNFIE9VIERJU1NFUlRBw4fDg08gT1JBIERFUE9TSVRBREEgVEVOSEEgU0lETyBSRVNVTFRBRE8gREUgVU0gUEFUUk9Dw41OSU8gT1UgQVBPSU8gREUgVU1BIEFHw4pOQ0lBIERFIEZPTUVOVE8gT1UgT1VUUk8gT1JHQU5JU01PIFFVRSBOw4NPIFNFSkEgQSBVUEYsIFZPQ8OKIERFQ0xBUkEgUVVFIFJFU1BFSVRPVSBUT0RPUyBFIFFVQUlTUVVFUiBESVJFSVRPUyBERSBSRVZJU8ODTyBDT01PIFRBTULDiU0gQVMgREVNQUlTIE9CUklHQcOHw5VFUyBFWElHSURBUyBQT1IgQ09OVFJBVE8gT1UgQUNPUkRPLgoKQSBVUEYgc2UgY29tcHJvbWV0ZSBhIGlkZW50aWZpY2FyIGNsYXJhbWVudGUgbyBzZXUgbm9tZSAocykgb3UgbyhzKSBub21lKHMpIGRvKHMpIGRldGVudG9yKGVzKSBkb3MgZGlyZWl0b3MgYXV0b3JhaXMgZGEgdGVzZSBvdSBkaXNzZXJ0YcOnw6NvLCBlIG7Do28gZmFyw6EgcXVhbHF1ZXIgYWx0ZXJhw6fDo28sIGFsw6ltIGRhcXVlbGFzIGNvbmNlZGlkYXMgcG9yIGVzdGEgbGljZW7Dp2EuCg==Biblioteca Digital de Teses e DissertaçõesPUBhttp://tede.upf.br/oai/requestbiblio@upf.br || bio@upf.br || cas@upf.br || car@upf.br || lve@upf.br || sar@upf.br || sol@upf.br || upfmundi@upf.br || jucelei@upf.bropendoar:2024-09-17T13:07:26Biblioteca de teses e dissertações da Universidade de Passo Fundo (BDTD UPF) - Universidade de Passo Fundo (UPF)false
dc.title.por.fl_str_mv Natural language processing for sensitive data recognition and privacy in digital documents
title Natural language processing for sensitive data recognition and privacy in digital documents
spellingShingle Natural language processing for sensitive data recognition and privacy in digital documents
Vieira, Samuel Antunes
Proteção de dados
Automação
Documentos eletrônicos
CIENCIA DA COMPUTACAO::SISTEMAS DE COMPUTACAO
title_short Natural language processing for sensitive data recognition and privacy in digital documents
title_full Natural language processing for sensitive data recognition and privacy in digital documents
title_fullStr Natural language processing for sensitive data recognition and privacy in digital documents
title_full_unstemmed Natural language processing for sensitive data recognition and privacy in digital documents
title_sort Natural language processing for sensitive data recognition and privacy in digital documents
author Vieira, Samuel Antunes
author_facet Vieira, Samuel Antunes
author_role author
dc.contributor.advisor1.fl_str_mv Rieder, Rafael
dc.contributor.advisor1Lattes.fl_str_mv http://lattes.cnpq.br/3010497094377497
dc.contributor.authorLattes.fl_str_mv http://lattes.cnpq.br/2195236331426283
dc.contributor.author.fl_str_mv Vieira, Samuel Antunes
contributor_str_mv Rieder, Rafael
dc.subject.por.fl_str_mv Proteção de dados
Automação
Documentos eletrônicos
topic Proteção de dados
Automação
Documentos eletrônicos
CIENCIA DA COMPUTACAO::SISTEMAS DE COMPUTACAO
dc.subject.cnpq.fl_str_mv CIENCIA DA COMPUTACAO::SISTEMAS DE COMPUTACAO
description Keeping confidential information secure in personal documents has always been critical to guarantee the privacy of people or companies. With the frequent digitalization of documents and the adoption of laws and regulations, this task has become even more relevant. In this context, security applications can censor critical text in documents digital. How protecting data through censorship can require intensive manual work to identify the specific location of sensitive data and is subject to errors humans, automation is an option to handle the entire process. With that in mind, this work presents DOCDOM, a proof-of-concept software that integrates multiple tools for recognizing sensitive data and privacy in digital documents. The approach considers optical character recognition to obtain text data from documents, applies a natural language processing model focused on recognition of named entities to identify confidential data, and censor these using library resources for digital document processing. The results Preliminaries showed that DOCDOM works well, achieving evaluation metrics reasonable for two test data sets of 1000 files each (AUC-PR Curves 0.9266 and 0.6681). A detailed analysis identified that there are noise problems in some files during text classification tasks, which still need to be handled through noise distinction and filtering strategies. Despite this, the proposed solution presented acceptable initial results for a proof of concept, with good precision and accuracy for files with a simple structure and sensitive non-numeric content.
publishDate 2024
dc.date.accessioned.fl_str_mv 2024-09-17T13:07:26Z
dc.date.issued.fl_str_mv 2024-03-27
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.citation.fl_str_mv VIEIRA, Samuel Antunes. Natural language processing for sensitive data recognition and privacy in digital documents. 2024. 47 f. Dissertação (Mestrado em Computação Aplicada) - Universidade de Passo Fundo, Passo Fundo, RS, 2024.
dc.identifier.uri.fl_str_mv http://tede.upf.br:8080/jspui/handle/tede/2765
identifier_str_mv VIEIRA, Samuel Antunes. Natural language processing for sensitive data recognition and privacy in digital documents. 2024. 47 f. Dissertação (Mestrado em Computação Aplicada) - Universidade de Passo Fundo, Passo Fundo, RS, 2024.
url http://tede.upf.br:8080/jspui/handle/tede/2765
dc.language.iso.fl_str_mv por
language por
dc.relation.program.fl_str_mv -709265676002654826
dc.relation.confidence.fl_str_mv 500
500
600
dc.relation.department.fl_str_mv 8143310132950509647
dc.relation.cnpq.fl_str_mv 8930092515683771531
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Universidade de Passo Fundo
dc.publisher.program.fl_str_mv Programa de Pós-Graduação em Computação Aplicada
dc.publisher.initials.fl_str_mv UPF
dc.publisher.country.fl_str_mv Brasil
dc.publisher.department.fl_str_mv Instituto de Tecnologia – ITEC
publisher.none.fl_str_mv Universidade de Passo Fundo
dc.source.none.fl_str_mv reponame:Biblioteca de teses e dissertações da Universidade de Passo Fundo (BDTD UPF)
instname:Universidade de Passo Fundo (UPF)
instacron:UPF
instname_str Universidade de Passo Fundo (UPF)
instacron_str UPF
institution UPF
reponame_str Biblioteca de teses e dissertações da Universidade de Passo Fundo (BDTD UPF)
collection Biblioteca de teses e dissertações da Universidade de Passo Fundo (BDTD UPF)
bitstream.url.fl_str_mv http://tede.upf.br:8080/jspui/bitstream/tede/2765/2/2024SamuelAntunesVieira.pdf
http://tede.upf.br:8080/jspui/bitstream/tede/2765/1/license.txt
bitstream.checksum.fl_str_mv 0fb5acb26ffe493823d6e53d458515d6
1ea0bfd7af108792edd8df732bb777fc
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
repository.name.fl_str_mv Biblioteca de teses e dissertações da Universidade de Passo Fundo (BDTD UPF) - Universidade de Passo Fundo (UPF)
repository.mail.fl_str_mv biblio@upf.br || bio@upf.br || cas@upf.br || car@upf.br || lve@upf.br || sar@upf.br || sol@upf.br || upfmundi@upf.br || jucelei@upf.br
_version_ 1817440931766337536