NER in archival finding aids: extended
Autor(a) principal: | |
---|---|
Data de Publicação: | 2022 |
Outros Autores: | |
Tipo de documento: | Artigo |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | https://hdl.handle.net/1822/76687 |
Resumo: | The amount of information preserved in Portuguese archives has increased over the years. These documents represent a national heritage of high importance, as they portray the country’s history. Currently, most Portuguese archives have made their finding aids available to the public in digital format, however, these data do not have any annotation, so it is not always easy to analyze their content. In this work, Named Entity Recognition solutions were created that allow the identification and classification of several named entities from the archival finding aids. These named entities translate into crucial information about their context and, with high confidence results, they can be used for several purposes, for example, the creation of smart browsing tools by using entity linking and record linking techniques. In order to achieve high result scores, we annotated several corpora to train our own Machine Learning algorithms in this context domain. We also used different architectures, such as CNNs, LSTMs, and Maximum Entropy models. Finally, all the created datasets and ML models were made available to the public with a developed web platform, NER@DI. |
id |
RCAP_b8991f4bc2a51961bb2f2b0878d5c6e3 |
---|---|
oai_identifier_str |
oai:repositorium.sdum.uminho.pt:1822/76687 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
NER in archival finding aids: extendednamed entity recognitionarchival search aidsmachine learningdeep learningmaximum entropyCiências Naturais::Ciências da Computação e da InformaçãoScience & TechnologyThe amount of information preserved in Portuguese archives has increased over the years. These documents represent a national heritage of high importance, as they portray the country’s history. Currently, most Portuguese archives have made their finding aids available to the public in digital format, however, these data do not have any annotation, so it is not always easy to analyze their content. In this work, Named Entity Recognition solutions were created that allow the identification and classification of several named entities from the archival finding aids. These named entities translate into crucial information about their context and, with high confidence results, they can be used for several purposes, for example, the creation of smart browsing tools by using entity linking and record linking techniques. In order to achieve high result scores, we annotated several corpora to train our own Machine Learning algorithms in this context domain. We also used different architectures, such as CNNs, LSTMs, and Maximum Entropy models. Finally, all the created datasets and ML models were made available to the public with a developed web platform, NER@DI.Multidisciplinary Digital Publishing InstituteUniversidade do MinhoCunha, Luís Filipe da CostaRamalho, José Carlos2022-01-172022-01-17T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttps://hdl.handle.net/1822/76687engCunha, L.F.d.C.; Ramalho, J.C. NER in Archival Finding Aids: Extended. Mach. Learn. Knowl. Extr. 2022, 4, 42-65. https://doi.org/10.3390/make40100032504-499010.3390/make4010003https://www.mdpi.com/2504-4990/4/1/3info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-07-21T12:36:48Zoai:repositorium.sdum.uminho.pt:1822/76687Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T19:32:58.757491Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
NER in archival finding aids: extended |
title |
NER in archival finding aids: extended |
spellingShingle |
NER in archival finding aids: extended Cunha, Luís Filipe da Costa named entity recognition archival search aids machine learning deep learning maximum entropy Ciências Naturais::Ciências da Computação e da Informação Science & Technology |
title_short |
NER in archival finding aids: extended |
title_full |
NER in archival finding aids: extended |
title_fullStr |
NER in archival finding aids: extended |
title_full_unstemmed |
NER in archival finding aids: extended |
title_sort |
NER in archival finding aids: extended |
author |
Cunha, Luís Filipe da Costa |
author_facet |
Cunha, Luís Filipe da Costa Ramalho, José Carlos |
author_role |
author |
author2 |
Ramalho, José Carlos |
author2_role |
author |
dc.contributor.none.fl_str_mv |
Universidade do Minho |
dc.contributor.author.fl_str_mv |
Cunha, Luís Filipe da Costa Ramalho, José Carlos |
dc.subject.por.fl_str_mv |
named entity recognition archival search aids machine learning deep learning maximum entropy Ciências Naturais::Ciências da Computação e da Informação Science & Technology |
topic |
named entity recognition archival search aids machine learning deep learning maximum entropy Ciências Naturais::Ciências da Computação e da Informação Science & Technology |
description |
The amount of information preserved in Portuguese archives has increased over the years. These documents represent a national heritage of high importance, as they portray the country’s history. Currently, most Portuguese archives have made their finding aids available to the public in digital format, however, these data do not have any annotation, so it is not always easy to analyze their content. In this work, Named Entity Recognition solutions were created that allow the identification and classification of several named entities from the archival finding aids. These named entities translate into crucial information about their context and, with high confidence results, they can be used for several purposes, for example, the creation of smart browsing tools by using entity linking and record linking techniques. In order to achieve high result scores, we annotated several corpora to train our own Machine Learning algorithms in this context domain. We also used different architectures, such as CNNs, LSTMs, and Maximum Entropy models. Finally, all the created datasets and ML models were made available to the public with a developed web platform, NER@DI. |
publishDate |
2022 |
dc.date.none.fl_str_mv |
2022-01-17 2022-01-17T00:00:00Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://hdl.handle.net/1822/76687 |
url |
https://hdl.handle.net/1822/76687 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
Cunha, L.F.d.C.; Ramalho, J.C. NER in Archival Finding Aids: Extended. Mach. Learn. Knowl. Extr. 2022, 4, 42-65. https://doi.org/10.3390/make4010003 2504-4990 10.3390/make4010003 https://www.mdpi.com/2504-4990/4/1/3 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Multidisciplinary Digital Publishing Institute |
publisher.none.fl_str_mv |
Multidisciplinary Digital Publishing Institute |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799132845366575104 |