NER in archival finding aids: extended

Detalhes bibliográficos
Autor(a) principal: Cunha, Luís Filipe da Costa
Data de Publicação: 2022
Outros Autores: Ramalho, José Carlos
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: https://hdl.handle.net/1822/76687
Resumo: The amount of information preserved in Portuguese archives has increased over the years. These documents represent a national heritage of high importance, as they portray the country’s history. Currently, most Portuguese archives have made their finding aids available to the public in digital format, however, these data do not have any annotation, so it is not always easy to analyze their content. In this work, Named Entity Recognition solutions were created that allow the identification and classification of several named entities from the archival finding aids. These named entities translate into crucial information about their context and, with high confidence results, they can be used for several purposes, for example, the creation of smart browsing tools by using entity linking and record linking techniques. In order to achieve high result scores, we annotated several corpora to train our own Machine Learning algorithms in this context domain. We also used different architectures, such as CNNs, LSTMs, and Maximum Entropy models. Finally, all the created datasets and ML models were made available to the public with a developed web platform, NER@DI.
id RCAP_b8991f4bc2a51961bb2f2b0878d5c6e3
oai_identifier_str oai:repositorium.sdum.uminho.pt:1822/76687
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling NER in archival finding aids: extendednamed entity recognitionarchival search aidsmachine learningdeep learningmaximum entropyCiências Naturais::Ciências da Computação e da InformaçãoScience & TechnologyThe amount of information preserved in Portuguese archives has increased over the years. These documents represent a national heritage of high importance, as they portray the country’s history. Currently, most Portuguese archives have made their finding aids available to the public in digital format, however, these data do not have any annotation, so it is not always easy to analyze their content. In this work, Named Entity Recognition solutions were created that allow the identification and classification of several named entities from the archival finding aids. These named entities translate into crucial information about their context and, with high confidence results, they can be used for several purposes, for example, the creation of smart browsing tools by using entity linking and record linking techniques. In order to achieve high result scores, we annotated several corpora to train our own Machine Learning algorithms in this context domain. We also used different architectures, such as CNNs, LSTMs, and Maximum Entropy models. Finally, all the created datasets and ML models were made available to the public with a developed web platform, NER@DI.Multidisciplinary Digital Publishing InstituteUniversidade do MinhoCunha, Luís Filipe da CostaRamalho, José Carlos2022-01-172022-01-17T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttps://hdl.handle.net/1822/76687engCunha, L.F.d.C.; Ramalho, J.C. NER in Archival Finding Aids: Extended. Mach. Learn. Knowl. Extr. 2022, 4, 42-65. https://doi.org/10.3390/make40100032504-499010.3390/make4010003https://www.mdpi.com/2504-4990/4/1/3info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-07-21T12:36:48Zoai:repositorium.sdum.uminho.pt:1822/76687Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T19:32:58.757491Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv NER in archival finding aids: extended
title NER in archival finding aids: extended
spellingShingle NER in archival finding aids: extended
Cunha, Luís Filipe da Costa
named entity recognition
archival search aids
machine learning
deep learning
maximum entropy
Ciências Naturais::Ciências da Computação e da Informação
Science & Technology
title_short NER in archival finding aids: extended
title_full NER in archival finding aids: extended
title_fullStr NER in archival finding aids: extended
title_full_unstemmed NER in archival finding aids: extended
title_sort NER in archival finding aids: extended
author Cunha, Luís Filipe da Costa
author_facet Cunha, Luís Filipe da Costa
Ramalho, José Carlos
author_role author
author2 Ramalho, José Carlos
author2_role author
dc.contributor.none.fl_str_mv Universidade do Minho
dc.contributor.author.fl_str_mv Cunha, Luís Filipe da Costa
Ramalho, José Carlos
dc.subject.por.fl_str_mv named entity recognition
archival search aids
machine learning
deep learning
maximum entropy
Ciências Naturais::Ciências da Computação e da Informação
Science & Technology
topic named entity recognition
archival search aids
machine learning
deep learning
maximum entropy
Ciências Naturais::Ciências da Computação e da Informação
Science & Technology
description The amount of information preserved in Portuguese archives has increased over the years. These documents represent a national heritage of high importance, as they portray the country’s history. Currently, most Portuguese archives have made their finding aids available to the public in digital format, however, these data do not have any annotation, so it is not always easy to analyze their content. In this work, Named Entity Recognition solutions were created that allow the identification and classification of several named entities from the archival finding aids. These named entities translate into crucial information about their context and, with high confidence results, they can be used for several purposes, for example, the creation of smart browsing tools by using entity linking and record linking techniques. In order to achieve high result scores, we annotated several corpora to train our own Machine Learning algorithms in this context domain. We also used different architectures, such as CNNs, LSTMs, and Maximum Entropy models. Finally, all the created datasets and ML models were made available to the public with a developed web platform, NER@DI.
publishDate 2022
dc.date.none.fl_str_mv 2022-01-17
2022-01-17T00:00:00Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://hdl.handle.net/1822/76687
url https://hdl.handle.net/1822/76687
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv Cunha, L.F.d.C.; Ramalho, J.C. NER in Archival Finding Aids: Extended. Mach. Learn. Knowl. Extr. 2022, 4, 42-65. https://doi.org/10.3390/make4010003
2504-4990
10.3390/make4010003
https://www.mdpi.com/2504-4990/4/1/3
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Multidisciplinary Digital Publishing Institute
publisher.none.fl_str_mv Multidisciplinary Digital Publishing Institute
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799132845366575104