Named entity recognition for sensitive data discovery in Portuguese

Detalhes bibliográficos
Autor(a) principal: Dias, M.
Data de Publicação: 2020
Outros Autores: Boné, J., Ferreira, J., Ribeiro, R., Maia, R.
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10071/20414
Resumo: The process of protecting sensitive data is continually growing and becoming increasingly important, especially as a result of the directives and laws imposed by the European Union. The effort to create automatic systems is continuous, but, in most cases, the processes behind them are still manual or semi-automatic. In this work, we have developed a component that can extract and classify sensitive data, from unstructured text information in European Portuguese. The objective was to create a system that allows organizations to understand their data and comply with legal and security purposes. We studied a hybrid approach to the problem of Named Entity Recognition for the Portuguese language. This approach combines several techniques such as rule-based/lexical-based models, machine learning algorithms, and neural networks. The rule-based and lexical-based approaches were used only for a set of specific classes. For the remaining classes of entities, two statistical models were tested—Conditional Random Fields and Random Forest and, finally, a Bidirectional-LSTM approach as experimented. Regarding the statistical models, we realized that Conditional Random Fields is the one that can obtain the best results, with a f1-score of 65.50%. With the Bi-LSTM approach, we have achieved a result of 83.01%. The corpora used for training and testing were HAREM Golden Collection, SIGARRA News Corpus, and DataSense NER Corpus.
id RCAP_b11f21686ba2a708606236ea934ce7bc
oai_identifier_str oai:repositorio.iscte-iul.pt:10071/20414
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Named entity recognition for sensitive data discovery in PortugueseSensitive dataGeneral data protection regulationNatural language processingPortuguese languageNamed entity recognitionThe process of protecting sensitive data is continually growing and becoming increasingly important, especially as a result of the directives and laws imposed by the European Union. The effort to create automatic systems is continuous, but, in most cases, the processes behind them are still manual or semi-automatic. In this work, we have developed a component that can extract and classify sensitive data, from unstructured text information in European Portuguese. The objective was to create a system that allows organizations to understand their data and comply with legal and security purposes. We studied a hybrid approach to the problem of Named Entity Recognition for the Portuguese language. This approach combines several techniques such as rule-based/lexical-based models, machine learning algorithms, and neural networks. The rule-based and lexical-based approaches were used only for a set of specific classes. For the remaining classes of entities, two statistical models were tested—Conditional Random Fields and Random Forest and, finally, a Bidirectional-LSTM approach as experimented. Regarding the statistical models, we realized that Conditional Random Fields is the one that can obtain the best results, with a f1-score of 65.50%. With the Bi-LSTM approach, we have achieved a result of 83.01%. The corpora used for training and testing were HAREM Golden Collection, SIGARRA News Corpus, and DataSense NER Corpus.MDPI2020-04-23T15:50:46Z2020-01-01T00:00:00Z20202020-04-23T16:49:46Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10071/20414eng2076-341710.3390/app10072303Dias, M.Boné, J.Ferreira, J.Ribeiro, R.Maia, R.info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-09T17:57:29Zoai:repositorio.iscte-iul.pt:10071/20414Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T22:29:40.300752Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Named entity recognition for sensitive data discovery in Portuguese
title Named entity recognition for sensitive data discovery in Portuguese
spellingShingle Named entity recognition for sensitive data discovery in Portuguese
Dias, M.
Sensitive data
General data protection regulation
Natural language processing
Portuguese language
Named entity recognition
title_short Named entity recognition for sensitive data discovery in Portuguese
title_full Named entity recognition for sensitive data discovery in Portuguese
title_fullStr Named entity recognition for sensitive data discovery in Portuguese
title_full_unstemmed Named entity recognition for sensitive data discovery in Portuguese
title_sort Named entity recognition for sensitive data discovery in Portuguese
author Dias, M.
author_facet Dias, M.
Boné, J.
Ferreira, J.
Ribeiro, R.
Maia, R.
author_role author
author2 Boné, J.
Ferreira, J.
Ribeiro, R.
Maia, R.
author2_role author
author
author
author
dc.contributor.author.fl_str_mv Dias, M.
Boné, J.
Ferreira, J.
Ribeiro, R.
Maia, R.
dc.subject.por.fl_str_mv Sensitive data
General data protection regulation
Natural language processing
Portuguese language
Named entity recognition
topic Sensitive data
General data protection regulation
Natural language processing
Portuguese language
Named entity recognition
description The process of protecting sensitive data is continually growing and becoming increasingly important, especially as a result of the directives and laws imposed by the European Union. The effort to create automatic systems is continuous, but, in most cases, the processes behind them are still manual or semi-automatic. In this work, we have developed a component that can extract and classify sensitive data, from unstructured text information in European Portuguese. The objective was to create a system that allows organizations to understand their data and comply with legal and security purposes. We studied a hybrid approach to the problem of Named Entity Recognition for the Portuguese language. This approach combines several techniques such as rule-based/lexical-based models, machine learning algorithms, and neural networks. The rule-based and lexical-based approaches were used only for a set of specific classes. For the remaining classes of entities, two statistical models were tested—Conditional Random Fields and Random Forest and, finally, a Bidirectional-LSTM approach as experimented. Regarding the statistical models, we realized that Conditional Random Fields is the one that can obtain the best results, with a f1-score of 65.50%. With the Bi-LSTM approach, we have achieved a result of 83.01%. The corpora used for training and testing were HAREM Golden Collection, SIGARRA News Corpus, and DataSense NER Corpus.
publishDate 2020
dc.date.none.fl_str_mv 2020-04-23T15:50:46Z
2020-01-01T00:00:00Z
2020
2020-04-23T16:49:46Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10071/20414
url http://hdl.handle.net/10071/20414
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv 2076-3417
10.3390/app10072303
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv MDPI
publisher.none.fl_str_mv MDPI
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799134858638786560