Natural language processing and machine learning in the categorization of scientific papers: a study around ?cultural heritage?

Detalhes bibliográficos
Autor(a) principal: Jesus, Ananda Fernanda de [UNESP]
Data de Publicação: 2023
Outros Autores: Triques, Maria Ligia, Segundo, Jose Eduardo Santarem [UNESP], Albuquerque, Ana Cristina de
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Institucional da UNESP
Texto Completo: http://dx.doi.org/10.26512/rici.v16.n1.2023.47537
http://hdl.handle.net/11449/245641
Resumo: Aims to verify the potential of applying Natural Language Processing (NLP) and Machine Learning (ML) techniques in the thematic categorization of scientific articles on the theme cultural heritage from two situations in which categories are established a priori and later. Applied research is developed, with quantitative and qualitative results, where the first corpus consisting of scientific articles in Portuguese, on a thematic basis of Information Science, manually selected and categorized; and the second corpus, composed of scientific articles in English retrieved from the Web of Science, automatically categorized by search strategies and application of Booleans. Both were submitted to two categorization test procedures (supervised and unsupervised algorithm). The results show that in both, the participation of the researcher is essential in defining the representativeness of the chosen sample, and this has an impact on the precision and accuracy of the applied algorithms. The importance of detailing and rigor in the pre-processing of data and sample size is highlighted, however, it is emphasized that, in the case of this study, only a larger volume of data did not guarantee that the results were representative from the point of view of the domain studied, which warns that there are always multidisciplinary discussions and analyzes that allow verifying and readjusting the sample parameters.
id UNSP_a7be7ff0fad946638ceff4ec4b878ac4
oai_identifier_str oai:repositorio.unesp.br:11449/245641
network_acronym_str UNSP
network_name_str Repositório Institucional da UNESP
repository_id_str 2946
spelling Natural language processing and machine learning in the categorization of scientific papers: a study around ?cultural heritage?Machine learningNatural language processingNeural network algorithmCultural heritageHierarchical clustering algorithmAims to verify the potential of applying Natural Language Processing (NLP) and Machine Learning (ML) techniques in the thematic categorization of scientific articles on the theme cultural heritage from two situations in which categories are established a priori and later. Applied research is developed, with quantitative and qualitative results, where the first corpus consisting of scientific articles in Portuguese, on a thematic basis of Information Science, manually selected and categorized; and the second corpus, composed of scientific articles in English retrieved from the Web of Science, automatically categorized by search strategies and application of Booleans. Both were submitted to two categorization test procedures (supervised and unsupervised algorithm). The results show that in both, the participation of the researcher is essential in defining the representativeness of the chosen sample, and this has an impact on the precision and accuracy of the applied algorithms. The importance of detailing and rigor in the pre-processing of data and sample size is highlighted, however, it is emphasized that, in the case of this study, only a larger volume of data did not guarantee that the results were representative from the point of view of the domain studied, which warns that there are always multidisciplinary discussions and analyzes that allow verifying and readjusting the sample parameters.Univ Estadual Paulista, Programa Posgrad Ciencia Informacao, Marilia, SP, BrazilUniv Estadual Londrina, Programa Posgrad Ciencia Informacao, Londrina, PR, BrazilUniv Estadual Paulista, Programa Posgrad Ciencia Informacao, Marilia, SP, BrazilUniv Brasilia, Dept Ciencia InformacaoUniversidade Estadual Paulista (UNESP)Universidade Estadual de Londrina (UEL)Jesus, Ananda Fernanda de [UNESP]Triques, Maria LigiaSegundo, Jose Eduardo Santarem [UNESP]Albuquerque, Ana Cristina de2023-07-29T12:00:51Z2023-07-29T12:00:51Z2023-01-01info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/article167-184http://dx.doi.org/10.26512/rici.v16.n1.2023.47537Revista Ibero-americana de Ciencia da Informacao. Brasilia: Univ Brasilia, Dept Ciencia Informacao, v. 16, n. 1, p. 167-184, 2023.1983-5213http://hdl.handle.net/11449/24564110.26512/rici.v16.n1.2023.47537WOS:000992663100009Web of Sciencereponame:Repositório Institucional da UNESPinstname:Universidade Estadual Paulista (UNESP)instacron:UNESPengRevista Ibero-americana De Ciencia Da Informacaoinfo:eu-repo/semantics/openAccess2024-08-08T17:46:57Zoai:repositorio.unesp.br:11449/245641Repositório InstitucionalPUBhttp://repositorio.unesp.br/oai/requestopendoar:29462024-08-08T17:46:57Repositório Institucional da UNESP - Universidade Estadual Paulista (UNESP)false
dc.title.none.fl_str_mv Natural language processing and machine learning in the categorization of scientific papers: a study around ?cultural heritage?
title Natural language processing and machine learning in the categorization of scientific papers: a study around ?cultural heritage?
spellingShingle Natural language processing and machine learning in the categorization of scientific papers: a study around ?cultural heritage?
Jesus, Ananda Fernanda de [UNESP]
Machine learning
Natural language processing
Neural network algorithm
Cultural heritage
Hierarchical clustering algorithm
title_short Natural language processing and machine learning in the categorization of scientific papers: a study around ?cultural heritage?
title_full Natural language processing and machine learning in the categorization of scientific papers: a study around ?cultural heritage?
title_fullStr Natural language processing and machine learning in the categorization of scientific papers: a study around ?cultural heritage?
title_full_unstemmed Natural language processing and machine learning in the categorization of scientific papers: a study around ?cultural heritage?
title_sort Natural language processing and machine learning in the categorization of scientific papers: a study around ?cultural heritage?
author Jesus, Ananda Fernanda de [UNESP]
author_facet Jesus, Ananda Fernanda de [UNESP]
Triques, Maria Ligia
Segundo, Jose Eduardo Santarem [UNESP]
Albuquerque, Ana Cristina de
author_role author
author2 Triques, Maria Ligia
Segundo, Jose Eduardo Santarem [UNESP]
Albuquerque, Ana Cristina de
author2_role author
author
author
dc.contributor.none.fl_str_mv Universidade Estadual Paulista (UNESP)
Universidade Estadual de Londrina (UEL)
dc.contributor.author.fl_str_mv Jesus, Ananda Fernanda de [UNESP]
Triques, Maria Ligia
Segundo, Jose Eduardo Santarem [UNESP]
Albuquerque, Ana Cristina de
dc.subject.por.fl_str_mv Machine learning
Natural language processing
Neural network algorithm
Cultural heritage
Hierarchical clustering algorithm
topic Machine learning
Natural language processing
Neural network algorithm
Cultural heritage
Hierarchical clustering algorithm
description Aims to verify the potential of applying Natural Language Processing (NLP) and Machine Learning (ML) techniques in the thematic categorization of scientific articles on the theme cultural heritage from two situations in which categories are established a priori and later. Applied research is developed, with quantitative and qualitative results, where the first corpus consisting of scientific articles in Portuguese, on a thematic basis of Information Science, manually selected and categorized; and the second corpus, composed of scientific articles in English retrieved from the Web of Science, automatically categorized by search strategies and application of Booleans. Both were submitted to two categorization test procedures (supervised and unsupervised algorithm). The results show that in both, the participation of the researcher is essential in defining the representativeness of the chosen sample, and this has an impact on the precision and accuracy of the applied algorithms. The importance of detailing and rigor in the pre-processing of data and sample size is highlighted, however, it is emphasized that, in the case of this study, only a larger volume of data did not guarantee that the results were representative from the point of view of the domain studied, which warns that there are always multidisciplinary discussions and analyzes that allow verifying and readjusting the sample parameters.
publishDate 2023
dc.date.none.fl_str_mv 2023-07-29T12:00:51Z
2023-07-29T12:00:51Z
2023-01-01
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://dx.doi.org/10.26512/rici.v16.n1.2023.47537
Revista Ibero-americana de Ciencia da Informacao. Brasilia: Univ Brasilia, Dept Ciencia Informacao, v. 16, n. 1, p. 167-184, 2023.
1983-5213
http://hdl.handle.net/11449/245641
10.26512/rici.v16.n1.2023.47537
WOS:000992663100009
url http://dx.doi.org/10.26512/rici.v16.n1.2023.47537
http://hdl.handle.net/11449/245641
identifier_str_mv Revista Ibero-americana de Ciencia da Informacao. Brasilia: Univ Brasilia, Dept Ciencia Informacao, v. 16, n. 1, p. 167-184, 2023.
1983-5213
10.26512/rici.v16.n1.2023.47537
WOS:000992663100009
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv Revista Ibero-americana De Ciencia Da Informacao
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv 167-184
dc.publisher.none.fl_str_mv Univ Brasilia, Dept Ciencia Informacao
publisher.none.fl_str_mv Univ Brasilia, Dept Ciencia Informacao
dc.source.none.fl_str_mv Web of Science
reponame:Repositório Institucional da UNESP
instname:Universidade Estadual Paulista (UNESP)
instacron:UNESP
instname_str Universidade Estadual Paulista (UNESP)
instacron_str UNESP
institution UNESP
reponame_str Repositório Institucional da UNESP
collection Repositório Institucional da UNESP
repository.name.fl_str_mv Repositório Institucional da UNESP - Universidade Estadual Paulista (UNESP)
repository.mail.fl_str_mv
_version_ 1808128210408308736