Natural language processing and machine learning in the categorization of scientific papers: a study around ?cultural heritage?
Autor(a) principal: | |
---|---|
Data de Publicação: | 2023 |
Outros Autores: | , , |
Tipo de documento: | Artigo |
Idioma: | eng |
Título da fonte: | Repositório Institucional da UNESP |
Texto Completo: | http://dx.doi.org/10.26512/rici.v16.n1.2023.47537 http://hdl.handle.net/11449/245641 |
Resumo: | Aims to verify the potential of applying Natural Language Processing (NLP) and Machine Learning (ML) techniques in the thematic categorization of scientific articles on the theme cultural heritage from two situations in which categories are established a priori and later. Applied research is developed, with quantitative and qualitative results, where the first corpus consisting of scientific articles in Portuguese, on a thematic basis of Information Science, manually selected and categorized; and the second corpus, composed of scientific articles in English retrieved from the Web of Science, automatically categorized by search strategies and application of Booleans. Both were submitted to two categorization test procedures (supervised and unsupervised algorithm). The results show that in both, the participation of the researcher is essential in defining the representativeness of the chosen sample, and this has an impact on the precision and accuracy of the applied algorithms. The importance of detailing and rigor in the pre-processing of data and sample size is highlighted, however, it is emphasized that, in the case of this study, only a larger volume of data did not guarantee that the results were representative from the point of view of the domain studied, which warns that there are always multidisciplinary discussions and analyzes that allow verifying and readjusting the sample parameters. |
id |
UNSP_a7be7ff0fad946638ceff4ec4b878ac4 |
---|---|
oai_identifier_str |
oai:repositorio.unesp.br:11449/245641 |
network_acronym_str |
UNSP |
network_name_str |
Repositório Institucional da UNESP |
repository_id_str |
2946 |
spelling |
Natural language processing and machine learning in the categorization of scientific papers: a study around ?cultural heritage?Machine learningNatural language processingNeural network algorithmCultural heritageHierarchical clustering algorithmAims to verify the potential of applying Natural Language Processing (NLP) and Machine Learning (ML) techniques in the thematic categorization of scientific articles on the theme cultural heritage from two situations in which categories are established a priori and later. Applied research is developed, with quantitative and qualitative results, where the first corpus consisting of scientific articles in Portuguese, on a thematic basis of Information Science, manually selected and categorized; and the second corpus, composed of scientific articles in English retrieved from the Web of Science, automatically categorized by search strategies and application of Booleans. Both were submitted to two categorization test procedures (supervised and unsupervised algorithm). The results show that in both, the participation of the researcher is essential in defining the representativeness of the chosen sample, and this has an impact on the precision and accuracy of the applied algorithms. The importance of detailing and rigor in the pre-processing of data and sample size is highlighted, however, it is emphasized that, in the case of this study, only a larger volume of data did not guarantee that the results were representative from the point of view of the domain studied, which warns that there are always multidisciplinary discussions and analyzes that allow verifying and readjusting the sample parameters.Univ Estadual Paulista, Programa Posgrad Ciencia Informacao, Marilia, SP, BrazilUniv Estadual Londrina, Programa Posgrad Ciencia Informacao, Londrina, PR, BrazilUniv Estadual Paulista, Programa Posgrad Ciencia Informacao, Marilia, SP, BrazilUniv Brasilia, Dept Ciencia InformacaoUniversidade Estadual Paulista (UNESP)Universidade Estadual de Londrina (UEL)Jesus, Ananda Fernanda de [UNESP]Triques, Maria LigiaSegundo, Jose Eduardo Santarem [UNESP]Albuquerque, Ana Cristina de2023-07-29T12:00:51Z2023-07-29T12:00:51Z2023-01-01info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/article167-184http://dx.doi.org/10.26512/rici.v16.n1.2023.47537Revista Ibero-americana de Ciencia da Informacao. Brasilia: Univ Brasilia, Dept Ciencia Informacao, v. 16, n. 1, p. 167-184, 2023.1983-5213http://hdl.handle.net/11449/24564110.26512/rici.v16.n1.2023.47537WOS:000992663100009Web of Sciencereponame:Repositório Institucional da UNESPinstname:Universidade Estadual Paulista (UNESP)instacron:UNESPengRevista Ibero-americana De Ciencia Da Informacaoinfo:eu-repo/semantics/openAccess2024-08-08T17:46:57Zoai:repositorio.unesp.br:11449/245641Repositório InstitucionalPUBhttp://repositorio.unesp.br/oai/requestopendoar:29462024-08-08T17:46:57Repositório Institucional da UNESP - Universidade Estadual Paulista (UNESP)false |
dc.title.none.fl_str_mv |
Natural language processing and machine learning in the categorization of scientific papers: a study around ?cultural heritage? |
title |
Natural language processing and machine learning in the categorization of scientific papers: a study around ?cultural heritage? |
spellingShingle |
Natural language processing and machine learning in the categorization of scientific papers: a study around ?cultural heritage? Jesus, Ananda Fernanda de [UNESP] Machine learning Natural language processing Neural network algorithm Cultural heritage Hierarchical clustering algorithm |
title_short |
Natural language processing and machine learning in the categorization of scientific papers: a study around ?cultural heritage? |
title_full |
Natural language processing and machine learning in the categorization of scientific papers: a study around ?cultural heritage? |
title_fullStr |
Natural language processing and machine learning in the categorization of scientific papers: a study around ?cultural heritage? |
title_full_unstemmed |
Natural language processing and machine learning in the categorization of scientific papers: a study around ?cultural heritage? |
title_sort |
Natural language processing and machine learning in the categorization of scientific papers: a study around ?cultural heritage? |
author |
Jesus, Ananda Fernanda de [UNESP] |
author_facet |
Jesus, Ananda Fernanda de [UNESP] Triques, Maria Ligia Segundo, Jose Eduardo Santarem [UNESP] Albuquerque, Ana Cristina de |
author_role |
author |
author2 |
Triques, Maria Ligia Segundo, Jose Eduardo Santarem [UNESP] Albuquerque, Ana Cristina de |
author2_role |
author author author |
dc.contributor.none.fl_str_mv |
Universidade Estadual Paulista (UNESP) Universidade Estadual de Londrina (UEL) |
dc.contributor.author.fl_str_mv |
Jesus, Ananda Fernanda de [UNESP] Triques, Maria Ligia Segundo, Jose Eduardo Santarem [UNESP] Albuquerque, Ana Cristina de |
dc.subject.por.fl_str_mv |
Machine learning Natural language processing Neural network algorithm Cultural heritage Hierarchical clustering algorithm |
topic |
Machine learning Natural language processing Neural network algorithm Cultural heritage Hierarchical clustering algorithm |
description |
Aims to verify the potential of applying Natural Language Processing (NLP) and Machine Learning (ML) techniques in the thematic categorization of scientific articles on the theme cultural heritage from two situations in which categories are established a priori and later. Applied research is developed, with quantitative and qualitative results, where the first corpus consisting of scientific articles in Portuguese, on a thematic basis of Information Science, manually selected and categorized; and the second corpus, composed of scientific articles in English retrieved from the Web of Science, automatically categorized by search strategies and application of Booleans. Both were submitted to two categorization test procedures (supervised and unsupervised algorithm). The results show that in both, the participation of the researcher is essential in defining the representativeness of the chosen sample, and this has an impact on the precision and accuracy of the applied algorithms. The importance of detailing and rigor in the pre-processing of data and sample size is highlighted, however, it is emphasized that, in the case of this study, only a larger volume of data did not guarantee that the results were representative from the point of view of the domain studied, which warns that there are always multidisciplinary discussions and analyzes that allow verifying and readjusting the sample parameters. |
publishDate |
2023 |
dc.date.none.fl_str_mv |
2023-07-29T12:00:51Z 2023-07-29T12:00:51Z 2023-01-01 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://dx.doi.org/10.26512/rici.v16.n1.2023.47537 Revista Ibero-americana de Ciencia da Informacao. Brasilia: Univ Brasilia, Dept Ciencia Informacao, v. 16, n. 1, p. 167-184, 2023. 1983-5213 http://hdl.handle.net/11449/245641 10.26512/rici.v16.n1.2023.47537 WOS:000992663100009 |
url |
http://dx.doi.org/10.26512/rici.v16.n1.2023.47537 http://hdl.handle.net/11449/245641 |
identifier_str_mv |
Revista Ibero-americana de Ciencia da Informacao. Brasilia: Univ Brasilia, Dept Ciencia Informacao, v. 16, n. 1, p. 167-184, 2023. 1983-5213 10.26512/rici.v16.n1.2023.47537 WOS:000992663100009 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
Revista Ibero-americana De Ciencia Da Informacao |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
167-184 |
dc.publisher.none.fl_str_mv |
Univ Brasilia, Dept Ciencia Informacao |
publisher.none.fl_str_mv |
Univ Brasilia, Dept Ciencia Informacao |
dc.source.none.fl_str_mv |
Web of Science reponame:Repositório Institucional da UNESP instname:Universidade Estadual Paulista (UNESP) instacron:UNESP |
instname_str |
Universidade Estadual Paulista (UNESP) |
instacron_str |
UNESP |
institution |
UNESP |
reponame_str |
Repositório Institucional da UNESP |
collection |
Repositório Institucional da UNESP |
repository.name.fl_str_mv |
Repositório Institucional da UNESP - Universidade Estadual Paulista (UNESP) |
repository.mail.fl_str_mv |
|
_version_ |
1808128210408308736 |