Named entity recognition for sensitive data discovery in Portuguese

Dias, M.; Boné, J.; Ferreira, J.; Ribeiro, R.; Maia, R.

Named entity recognition for sensitive data discovery in Portuguese

Detalhes bibliográficos
Autor(a) principal:	Dias, M.
Data de Publicação:	2020
Outros Autores:	Boné, J., Ferreira, J., Ribeiro, R., Maia, R.
Tipo de documento:	Artigo
Idioma:	eng
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	http://hdl.handle.net/10071/20414
Resumo:	The process of protecting sensitive data is continually growing and becoming increasingly important, especially as a result of the directives and laws imposed by the European Union. The effort to create automatic systems is continuous, but, in most cases, the processes behind them are still manual or semi-automatic. In this work, we have developed a component that can extract and classify sensitive data, from unstructured text information in European Portuguese. The objective was to create a system that allows organizations to understand their data and comply with legal and security purposes. We studied a hybrid approach to the problem of Named Entity Recognition for the Portuguese language. This approach combines several techniques such as rule-based/lexical-based models, machine learning algorithms, and neural networks. The rule-based and lexical-based approaches were used only for a set of specific classes. For the remaining classes of entities, two statistical models were tested—Conditional Random Fields and Random Forest and, finally, a Bidirectional-LSTM approach as experimented. Regarding the statistical models, we realized that Conditional Random Fields is the one that can obtain the best results, with a f1-score of 65.50%. With the Bi-LSTM approach, we have achieved a result of 83.01%. The corpora used for training and testing were HAREM Golden Collection, SIGARRA News Corpus, and DataSense NER Corpus.

Metadados do item

id	RCAP_b11f21686ba2a708606236ea934ce7bc
oai_identifier_str	oai:repositorio.iscte-iul.pt:10071/20414
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	Named entity recognition for sensitive data discovery in PortugueseSensitive dataGeneral data protection regulationNatural language processingPortuguese languageNamed entity recognitionThe process of protecting sensitive data is continually growing and becoming increasingly important, especially as a result of the directives and laws imposed by the European Union. The effort to create automatic systems is continuous, but, in most cases, the processes behind them are still manual or semi-automatic. In this work, we have developed a component that can extract and classify sensitive data, from unstructured text information in European Portuguese. The objective was to create a system that allows organizations to understand their data and comply with legal and security purposes. We studied a hybrid approach to the problem of Named Entity Recognition for the Portuguese language. This approach combines several techniques such as rule-based/lexical-based models, machine learning algorithms, and neural networks. The rule-based and lexical-based approaches were used only for a set of specific classes. For the remaining classes of entities, two statistical models were tested—Conditional Random Fields and Random Forest and, finally, a Bidirectional-LSTM approach as experimented. Regarding the statistical models, we realized that Conditional Random Fields is the one that can obtain the best results, with a f1-score of 65.50%. With the Bi-LSTM approach, we have achieved a result of 83.01%. The corpora used for training and testing were HAREM Golden Collection, SIGARRA News Corpus, and DataSense NER Corpus.MDPI2020-04-23T15:50:46Z2020-01-01T00:00:00Z20202020-04-23T16:49:46Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10071/20414eng2076-341710.3390/app10072303Dias, M.Boné, J.Ferreira, J.Ribeiro, R.Maia, R.info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-09T17:57:29Zoai:repositorio.iscte-iul.pt:10071/20414Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T22:29:40.300752Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	Named entity recognition for sensitive data discovery in Portuguese
title	Named entity recognition for sensitive data discovery in Portuguese
spellingShingle	Named entity recognition for sensitive data discovery in Portuguese Dias, M. Sensitive data General data protection regulation Natural language processing Portuguese language Named entity recognition
title_short	Named entity recognition for sensitive data discovery in Portuguese
title_full	Named entity recognition for sensitive data discovery in Portuguese
title_fullStr	Named entity recognition for sensitive data discovery in Portuguese
title_full_unstemmed	Named entity recognition for sensitive data discovery in Portuguese
title_sort	Named entity recognition for sensitive data discovery in Portuguese
author	Dias, M.
author_facet	Dias, M. Boné, J. Ferreira, J. Ribeiro, R. Maia, R.
author_role	author
author2	Boné, J. Ferreira, J. Ribeiro, R. Maia, R.
author2_role	author author author author
dc.contributor.author.fl_str_mv	Dias, M. Boné, J. Ferreira, J. Ribeiro, R. Maia, R.
dc.subject.por.fl_str_mv	Sensitive data General data protection regulation Natural language processing Portuguese language Named entity recognition
topic	Sensitive data General data protection regulation Natural language processing Portuguese language Named entity recognition
description	The process of protecting sensitive data is continually growing and becoming increasingly important, especially as a result of the directives and laws imposed by the European Union. The effort to create automatic systems is continuous, but, in most cases, the processes behind them are still manual or semi-automatic. In this work, we have developed a component that can extract and classify sensitive data, from unstructured text information in European Portuguese. The objective was to create a system that allows organizations to understand their data and comply with legal and security purposes. We studied a hybrid approach to the problem of Named Entity Recognition for the Portuguese language. This approach combines several techniques such as rule-based/lexical-based models, machine learning algorithms, and neural networks. The rule-based and lexical-based approaches were used only for a set of specific classes. For the remaining classes of entities, two statistical models were tested—Conditional Random Fields and Random Forest and, finally, a Bidirectional-LSTM approach as experimented. Regarding the statistical models, we realized that Conditional Random Fields is the one that can obtain the best results, with a f1-score of 65.50%. With the Bi-LSTM approach, we have achieved a result of 83.01%. The corpora used for training and testing were HAREM Golden Collection, SIGARRA News Corpus, and DataSense NER Corpus.
publishDate	2020
dc.date.none.fl_str_mv	2020-04-23T15:50:46Z 2020-01-01T00:00:00Z 2020 2020-04-23T16:49:46Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/article
format	article
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10071/20414
url	http://hdl.handle.net/10071/20414
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	2076-3417 10.3390/app10072303
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	MDPI
publisher.none.fl_str_mv	MDPI
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799134858638786560

Named entity recognition for sensitive data discovery in Portuguese

Registros relacionados