An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing

Carnaz, Gonçalo; Antunes, Mário; Nogueira, Vitor Beires

An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing

Detalhes bibliográficos
Autor(a) principal:	Carnaz, Gonçalo
Data de Publicação:	2021
Outros Autores:	Antunes, Mário, Nogueira, Vitor Beires
Tipo de documento:	Artigo
Idioma:	por
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	http://hdl.handle.net/10174/34695 https://doi.org/Carnaz, G.; Antunes, M.; Nogueira, V.B. An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing. Data 2021, 6, 71. https://doi.org/10.3390/data6070071 https://doi.org/10.3390/data6070071
Resumo:	Criminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of 0.808, recall of 0.722, and F1-score of 0.733 were obtained with the classification of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identification of terms related to the criminal domain.

Metadados do item

id	RCAP_d4686fc79746ffd1ee567800e3db2109
oai_identifier_str	oai:dspace.uevora.pt:10174/34695
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processingcrime-related documentscybersecuritycriminal investigationPortuguese language corpusCriminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of 0.808, recall of 0.722, and F1-score of 0.733 were obtained with the classification of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identification of terms related to the criminal domain.2023-02-24T12:58:23Z2023-02-242021-06-26T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articlehttp://hdl.handle.net/10174/34695https://doi.org/Carnaz, G.; Antunes, M.; Nogueira, V.B. An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing. Data 2021, 6, 71. https://doi.org/10.3390/data6070071http://hdl.handle.net/10174/34695https://doi.org/10.3390/data6070071pord34707@alunos.uevora.ptmario.antunes@ipleiria.ptvbn@uevora.pt498Carnaz, GonçaloAntunes, MárioNogueira, Vitor Beiresinfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-01-03T19:37:32Zoai:dspace.uevora.pt:10174/34695Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T01:23:13.767044Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing
title	An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing
spellingShingle	An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing Carnaz, Gonçalo crime-related documents cybersecurity criminal investigation Portuguese language corpus
title_short	An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing
title_full	An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing
title_fullStr	An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing
title_full_unstemmed	An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing
title_sort	An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing
author	Carnaz, Gonçalo
author_facet	Carnaz, Gonçalo Antunes, Mário Nogueira, Vitor Beires
author_role	author
author2	Antunes, Mário Nogueira, Vitor Beires
author2_role	author author
dc.contributor.author.fl_str_mv	Carnaz, Gonçalo Antunes, Mário Nogueira, Vitor Beires
dc.subject.por.fl_str_mv	crime-related documents cybersecurity criminal investigation Portuguese language corpus
topic	crime-related documents cybersecurity criminal investigation Portuguese language corpus
description	Criminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of 0.808, recall of 0.722, and F1-score of 0.733 were obtained with the classification of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identification of terms related to the criminal domain.
publishDate	2021
dc.date.none.fl_str_mv	2021-06-26T00:00:00Z 2023-02-24T12:58:23Z 2023-02-24
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/article
format	article
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10174/34695 https://doi.org/Carnaz, G.; Antunes, M.; Nogueira, V.B. An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing. Data 2021, 6, 71. https://doi.org/10.3390/data6070071 http://hdl.handle.net/10174/34695 https://doi.org/10.3390/data6070071
url	http://hdl.handle.net/10174/34695 https://doi.org/Carnaz, G.; Antunes, M.; Nogueira, V.B. An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing. Data 2021, 6, 71. https://doi.org/10.3390/data6070071 https://doi.org/10.3390/data6070071
dc.language.iso.fl_str_mv	por
language	por
dc.relation.none.fl_str_mv	d34707@alunos.uevora.pt mario.antunes@ipleiria.pt vbn@uevora.pt 498
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1817551331251978240

An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing

Registros relacionados