Automated anonymization of legal contracts in Portuguese
Autor(a) principal: | |
---|---|
Data de Publicação: | 2022 |
Tipo de documento: | Dissertação |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10773/35124 |
Resumo: | With the introduction of the General Data Protection Regulation, many organizations were left with a large amount of documents containing public information that should have been private. Given that we are talking about quite large quantities of documents, it would be a waste of resources to edit them manually. The objective of this dissertation is the development of an autonomous system for the anonymization of sensitive information in contracts written in Portuguese. This system uses Google Cloud Vision, an API to apply the OCR tecnology, to extract any text present in a document. As there is a possibility that these documents are poorly readable, an image pre-processing is done using the OpenCV library to increase the readability of the text present in the images. Among others, the application of binarization, skew correction and noise removal algorithms were explored. Once the text has been extracted, it will be interpreted by an NLP library. In this project we chose to use spaCy, which contains a Portuguese pipeline trained with the WikiNer and UD Portuguese Bosque datasets. This library not only allows a very complete identification of the part of speech, but also contains four different categories of named entity recognition in its model. In addition to the processing carried out using the spaCy library, and since the Portuguese language does not have a great support, some rule-based algorithms were implemented in order to identify other types of more specific information such as identification number and postal codes. In the end, the information considered confidential is covered by a black rectangle drawn by OpenCV through the coordinates returned by Google Cloud Vision OCR and a new PDF is generated. |
id |
RCAP_71f675ddc6028e34765c951d2fbd11d5 |
---|---|
oai_identifier_str |
oai:ria.ua.pt:10773/35124 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Automated anonymization of legal contracts in PortugueseBinarizationContractDocumentGoogle cloud visionImage processingMachine learningNamed entity recognitionNatural language processingoptical character recognitionPortuguesePreprocessingPrivacyProcessingPytesseractSpacyWith the introduction of the General Data Protection Regulation, many organizations were left with a large amount of documents containing public information that should have been private. Given that we are talking about quite large quantities of documents, it would be a waste of resources to edit them manually. The objective of this dissertation is the development of an autonomous system for the anonymization of sensitive information in contracts written in Portuguese. This system uses Google Cloud Vision, an API to apply the OCR tecnology, to extract any text present in a document. As there is a possibility that these documents are poorly readable, an image pre-processing is done using the OpenCV library to increase the readability of the text present in the images. Among others, the application of binarization, skew correction and noise removal algorithms were explored. Once the text has been extracted, it will be interpreted by an NLP library. In this project we chose to use spaCy, which contains a Portuguese pipeline trained with the WikiNer and UD Portuguese Bosque datasets. This library not only allows a very complete identification of the part of speech, but also contains four different categories of named entity recognition in its model. In addition to the processing carried out using the spaCy library, and since the Portuguese language does not have a great support, some rule-based algorithms were implemented in order to identify other types of more specific information such as identification number and postal codes. In the end, the information considered confidential is covered by a black rectangle drawn by OpenCV through the coordinates returned by Google Cloud Vision OCR and a new PDF is generated.Com a introdução do Regulamento Geral de Proteção de Dados, muitas organizações ficaram com uma grande quantidade de documentos contendo informações públicas que deveriam ser privadas. Dado que estamos a falar de quantidades bastante elevadas de documentos, seria um desperdício de recursos editá-los manualmente. O objetivo desta dissertação é o desenvovimento de um sistema autónomo de anonimização de informação sensível em contratos escritos na língua Portuguesa. Este sistema utiliza a Google Cloud Vision, uma API de OCR, para extrair qualquer texto presente num documento. Como existe a possibilidade desses documentos serem pouco legíveis, é feito um pré-processamento de imagem através da biblioteca OpenCV para aumentar a legibilidade do texto presente nas imagens. Entre outros, foi explorada a aplicação de algoritmos de binarização, correção da inclinação e remoção de ruído. Uma vez extraído o texto, este será interpretado por uma biblioteca de nlp, neste projeto optou-se pelo uso do spaCy, que contém um pipeline português treinado com os conjuntos de dados WikiNer e UD Portuguese Bosque. Esta biblioteca não permite apenas uma identificação bastante completa da parte do discurso, mas também contém quatro categorias diferentes de reconhecimento de entidade nomeada no seu modelo. Para além do processamento efetuado com o recurso à biblioteca de spaCy, e uma vez que a língua portuguesa não tem um grande suporte, foram implementados alguns algoritmos baseados em regras de modo a identificar outros tipos de informação mais especifica como número de identificação e códigos postais. No final, as informações consideradas confidenciais são cobertas por um retângulo preto desenhado pelo OpenCV através das coordenadas retornadas pelo OCR do Google Cloud Vision e será gerado um novo PDF.2022-11-04T14:38:44Z2022-07-27T00:00:00Z2022-07-27info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10773/35124engMartins, Tomásinfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-02-22T12:07:42Zoai:ria.ua.pt:10773/35124Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:06:14.955382Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Automated anonymization of legal contracts in Portuguese |
title |
Automated anonymization of legal contracts in Portuguese |
spellingShingle |
Automated anonymization of legal contracts in Portuguese Martins, Tomás Binarization Contract Document Google cloud vision Image processing Machine learning Named entity recognition Natural language processing optical character recognition Portuguese Preprocessing Privacy Processing Pytesseract Spacy |
title_short |
Automated anonymization of legal contracts in Portuguese |
title_full |
Automated anonymization of legal contracts in Portuguese |
title_fullStr |
Automated anonymization of legal contracts in Portuguese |
title_full_unstemmed |
Automated anonymization of legal contracts in Portuguese |
title_sort |
Automated anonymization of legal contracts in Portuguese |
author |
Martins, Tomás |
author_facet |
Martins, Tomás |
author_role |
author |
dc.contributor.author.fl_str_mv |
Martins, Tomás |
dc.subject.por.fl_str_mv |
Binarization Contract Document Google cloud vision Image processing Machine learning Named entity recognition Natural language processing optical character recognition Portuguese Preprocessing Privacy Processing Pytesseract Spacy |
topic |
Binarization Contract Document Google cloud vision Image processing Machine learning Named entity recognition Natural language processing optical character recognition Portuguese Preprocessing Privacy Processing Pytesseract Spacy |
description |
With the introduction of the General Data Protection Regulation, many organizations were left with a large amount of documents containing public information that should have been private. Given that we are talking about quite large quantities of documents, it would be a waste of resources to edit them manually. The objective of this dissertation is the development of an autonomous system for the anonymization of sensitive information in contracts written in Portuguese. This system uses Google Cloud Vision, an API to apply the OCR tecnology, to extract any text present in a document. As there is a possibility that these documents are poorly readable, an image pre-processing is done using the OpenCV library to increase the readability of the text present in the images. Among others, the application of binarization, skew correction and noise removal algorithms were explored. Once the text has been extracted, it will be interpreted by an NLP library. In this project we chose to use spaCy, which contains a Portuguese pipeline trained with the WikiNer and UD Portuguese Bosque datasets. This library not only allows a very complete identification of the part of speech, but also contains four different categories of named entity recognition in its model. In addition to the processing carried out using the spaCy library, and since the Portuguese language does not have a great support, some rule-based algorithms were implemented in order to identify other types of more specific information such as identification number and postal codes. In the end, the information considered confidential is covered by a black rectangle drawn by OpenCV through the coordinates returned by Google Cloud Vision OCR and a new PDF is generated. |
publishDate |
2022 |
dc.date.none.fl_str_mv |
2022-11-04T14:38:44Z 2022-07-27T00:00:00Z 2022-07-27 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10773/35124 |
url |
http://hdl.handle.net/10773/35124 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799137717336932352 |