Python scripts for web scraping metadata from descriptions of the international scenario of research data repositories

Detalhes bibliográficos
Autor(a) principal: Semeler, Alexandre Ribas
Data de Publicação: 2023
Outros Autores: Longoni Oliveira, Arthur, Andrade Pereira, Fabiana, Matiquite, Policarpo
Tipo de documento: Artigo
Idioma: eng
por
Título da fonte: Encontros Bibli
Texto Completo: https://periodicos.ufsc.br/index.php/eb/article/view/94877
Resumo: Objective: Research data repositories are an evolution of document repositories that aim to access and preserve all materials used before, during, and after scientific research. In this context, this study aims to conduct an exploratory and descriptive investigation of the international scenario of data repositories by monitoring the descriptive metadata of the international register of this type of repositories in the Registry of Research Data Repositories (re3data.org). Methods: The process requires applying knowledge inherent to the techniques and technologies used for descriptive data analysis, information retrieval, manipulation, analysis, and data visualization. Consequently, three scripts in Python 3.11 are provided for collecting metadata from re3data and scripts and converting the metadata to enable visualization in software such as VOSviewer, a dataset with metadata descriptions of repositories and conversions for visualization of networks. The datasets produced in this study can be found in the ZENODO Data Repository (https://doi.org/10.5281/zenodo.7903109). In a collection on (05/05/2023), 3108 links to the repository descriptions were retrieved. Data and scripts were created for this methodological experiment and shared at (DOI: doi.org/10.5281/zenodo.7903109). The dataset contains a root directory with three subdirectories: (scripts) with (.py) Python codes, another directory called (data) with textual files containing tab-separated values (.TSV), and the file (Information Systems Research, RIS). The third directory (env) contains the Python libraries required to run the scripts.  Potential for reuse: The research method applied to manipulate this dataset is based on automated re3data metadata extraction and network visualization; after the data collection and analysis process, it is possible to trigger a study based on the descriptions extracted from the Registry of Research Data Repositories (re3data), researchers can visualize the international scenario of research data repositories, verified by re3data, which allows ethical monitoring of the number of research data repositories that are registered in re3data, what are their areas, institutions, countries, the language of research data, the typology of repositories and deposited data, their themes, areas of knowledge, types of access, licenses and software used. In addition, other issues can be raised while interpreting the data. The community of Librarianship and Information Science professionals need to share data and the extraction technique  these research data. Finally, it can be concluded whether information about research data repositories allows us to state that they are heterogeneous data sources that enable access and preservation of a wide range of research data types
id UFSC-29_2b7231267c80ad4cbb5ab9ab16ebaf85
oai_identifier_str oai:periodicos.ufsc.br:article/94877
network_acronym_str UFSC-29
network_name_str Encontros Bibli
repository_id_str
spelling Python scripts for web scraping metadata from descriptions of the international scenario of research data repositoriesPython scripts para o web scraping de metadados das descrições sobre os conjuntos de dados do cenário internacional de repositórios de dados de pesquisaData RepositoryResearch DataGeosciencesRe3dataRepositório de DadosDados de PesquisaGeociênciasRe3dataObjective: Research data repositories are an evolution of document repositories that aim to access and preserve all materials used before, during, and after scientific research. In this context, this study aims to conduct an exploratory and descriptive investigation of the international scenario of data repositories by monitoring the descriptive metadata of the international register of this type of repositories in the Registry of Research Data Repositories (re3data.org). Methods: The process requires applying knowledge inherent to the techniques and technologies used for descriptive data analysis, information retrieval, manipulation, analysis, and data visualization. Consequently, three scripts in Python 3.11 are provided for collecting metadata from re3data and scripts and converting the metadata to enable visualization in software such as VOSviewer, a dataset with metadata descriptions of repositories and conversions for visualization of networks. The datasets produced in this study can be found in the ZENODO Data Repository (https://doi.org/10.5281/zenodo.7903109). In a collection on (05/05/2023), 3108 links to the repository descriptions were retrieved. Data and scripts were created for this methodological experiment and shared at (DOI: doi.org/10.5281/zenodo.7903109). The dataset contains a root directory with three subdirectories: (scripts) with (.py) Python codes, another directory called (data) with textual files containing tab-separated values (.TSV), and the file (Information Systems Research, RIS). The third directory (env) contains the Python libraries required to run the scripts.  Potential for reuse: The research method applied to manipulate this dataset is based on automated re3data metadata extraction and network visualization; after the data collection and analysis process, it is possible to trigger a study based on the descriptions extracted from the Registry of Research Data Repositories (re3data), researchers can visualize the international scenario of research data repositories, verified by re3data, which allows ethical monitoring of the number of research data repositories that are registered in re3data, what are their areas, institutions, countries, the language of research data, the typology of repositories and deposited data, their themes, areas of knowledge, types of access, licenses and software used. In addition, other issues can be raised while interpreting the data. The community of Librarianship and Information Science professionals need to share data and the extraction technique  these research data. Finally, it can be concluded whether information about research data repositories allows us to state that they are heterogeneous data sources that enable access and preservation of a wide range of research data typesObjetivo: Os repositórios de dados de pesquisa são a evolução dos repositórios de documentos e visam acessar e preservar todos os materiais usados antes, durante e depois da realização pesquisa científica. Nesse contexto, o objetivo deste estudo é realizar uma abordagem exploratória e descritiva do cenário internacional de repositórios de dados de pesquisa, por meio do monitoramento dos metadados descritivos do registro internacional desse tipo de repositórios no Registry of Research Data Repositories (re3data.org). Métodos: O desenvolvimento do método exigiu a aplicação de conhecimentos inerentes às técnicas e tecnologias utilizadas para análise descritiva de dados, recuperação de informações, manipulação, análise e visualização de dados. A aplicado ao método resulta em três scripts em Python 3.11 para coleta de metadados do re3data, scripts para conversão de metadados e scripts para visualização dos metadados em softwares como o VOSviewer. Os conjuntos de dados produzidos pela pesquisa pode ser encontrados no repositório de dados ZENODO (https://doi.org/10.5281/zenodo.7903109), em uma coleção de software depositada em (05/05/2023), nela foram recuperados 3108 registros de links para descrições de repositórios distribuídos internacionalmente. Conforme o experimento metodológico o conjunto de dados contém um diretório raiz com 3 subdiretórios, um chamado (scripts) com os códigos Pyhton (.py), outro diretório chamado (data) com os arquivos textuais (Tab-separated values,TSV) contidos e o arquivo (Information Systems Research, RIS). O terceiro diretório (env) é onde estão as bibliotecas Python necessárias para executar os scripts. Potencial de reutilização: O método de pesquisa aplicado para manipular este conjunto de dados é baseado na extração automatizada de metadados do re3data e na visualização de redes; após o processo de coleta e análise dos dados é possível desencadear um estudo exploratório e descritivo sobre o cenário internacional dos repositórios de dados de pesquisa, verificados pelo re3data, o que permite o monitoramento ético da quantidade de repositórios de dados de pesquisa que estão cadastrados no re3data, quais são suas áreas, as instituições, os países o idioma o idiomas dos dados da pesquisa, a tipologia dos repositórios e dos dados depositados, suas os temáticas, áreas do conhecimento, tipos de acessos, licenças e softwares  utilizados. Além disso, outras questões podem ser levantadas durante a interpretação dos dados. O que reforça a necessidade desse conjunto de dados para a comunidade de profissionais da Biblioteconomia e da Ciência da Informação, o compartilhamento de dados e a técnica de extração podem colaborar com o reaproveitamento desses dados de pesquisa.Departamento de Ciência da Informação – UFSC2023-08-04info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionapplication/pdfapplication/pdfapplication/pdfapplication/pdfhttps://periodicos.ufsc.br/index.php/eb/article/view/9487710.5007/1518-2924.2023.e94877Encontros Bibli: revista eletrônica de biblioteconomia e ciência da informação; Vol. 28 (2023): Innovation, Technology and Sustainability; 1-8Encontros Bibli: revista electrónica de bibliotecología y ciencias de la información.; Vol. 28 (2023): Innovación, Tecnología y Sustentabilidad; 1-8Encontros Bibli: revista eletrônica de biblioteconomia e ciência da informação; v. 28 (2023): Inovação, Tecnologia e Sustentabilidade; 1-81518-2924reponame:Encontros Bibliinstname:Universidade Federal de Santa Catarina (UFSC)instacron:UFSCengporhttps://periodicos.ufsc.br/index.php/eb/article/view/94877/53958https://periodicos.ufsc.br/index.php/eb/article/view/94877/53947https://periodicos.ufsc.br/index.php/eb/article/view/94877/53948https://periodicos.ufsc.br/index.php/eb/article/view/94877/53949Copyright (c) 2023 Alexandre Ribas Semeler, Arthur Longoni Oliveira, Fabiana Andrade Pereira, Policarpo Matiquitehttps://creativecommons.org/licenses/by/4.0info:eu-repo/semantics/openAccessSemeler, Alexandre RibasLongoni Oliveira, Arthur Andrade Pereira, Fabiana Matiquite, Policarpo 2024-03-08T12:59:44Zoai:periodicos.ufsc.br:article/94877Revistahttps://periodicos.ufsc.br/index.php/eb/indexPUBhttps://periodicos.ufsc.br/index.php/eb/oaiencontrosbibli@contato.ufsc.br||portaldeperiodicos.bu@contato.ufsc.br1518-29241518-2924opendoar:2024-03-08T12:59:44Encontros Bibli - Universidade Federal de Santa Catarina (UFSC)false
dc.title.none.fl_str_mv Python scripts for web scraping metadata from descriptions of the international scenario of research data repositories
Python scripts para o web scraping de metadados das descrições sobre os conjuntos de dados do cenário internacional de repositórios de dados de pesquisa
title Python scripts for web scraping metadata from descriptions of the international scenario of research data repositories
spellingShingle Python scripts for web scraping metadata from descriptions of the international scenario of research data repositories
Semeler, Alexandre Ribas
Data Repository
Research Data
Geosciences
Re3data
Repositório de Dados
Dados de Pesquisa
Geociências
Re3data
title_short Python scripts for web scraping metadata from descriptions of the international scenario of research data repositories
title_full Python scripts for web scraping metadata from descriptions of the international scenario of research data repositories
title_fullStr Python scripts for web scraping metadata from descriptions of the international scenario of research data repositories
title_full_unstemmed Python scripts for web scraping metadata from descriptions of the international scenario of research data repositories
title_sort Python scripts for web scraping metadata from descriptions of the international scenario of research data repositories
author Semeler, Alexandre Ribas
author_facet Semeler, Alexandre Ribas
Longoni Oliveira, Arthur
Andrade Pereira, Fabiana
Matiquite, Policarpo
author_role author
author2 Longoni Oliveira, Arthur
Andrade Pereira, Fabiana
Matiquite, Policarpo
author2_role author
author
author
dc.contributor.author.fl_str_mv Semeler, Alexandre Ribas
Longoni Oliveira, Arthur
Andrade Pereira, Fabiana
Matiquite, Policarpo
dc.subject.por.fl_str_mv Data Repository
Research Data
Geosciences
Re3data
Repositório de Dados
Dados de Pesquisa
Geociências
Re3data
topic Data Repository
Research Data
Geosciences
Re3data
Repositório de Dados
Dados de Pesquisa
Geociências
Re3data
description Objective: Research data repositories are an evolution of document repositories that aim to access and preserve all materials used before, during, and after scientific research. In this context, this study aims to conduct an exploratory and descriptive investigation of the international scenario of data repositories by monitoring the descriptive metadata of the international register of this type of repositories in the Registry of Research Data Repositories (re3data.org). Methods: The process requires applying knowledge inherent to the techniques and technologies used for descriptive data analysis, information retrieval, manipulation, analysis, and data visualization. Consequently, three scripts in Python 3.11 are provided for collecting metadata from re3data and scripts and converting the metadata to enable visualization in software such as VOSviewer, a dataset with metadata descriptions of repositories and conversions for visualization of networks. The datasets produced in this study can be found in the ZENODO Data Repository (https://doi.org/10.5281/zenodo.7903109). In a collection on (05/05/2023), 3108 links to the repository descriptions were retrieved. Data and scripts were created for this methodological experiment and shared at (DOI: doi.org/10.5281/zenodo.7903109). The dataset contains a root directory with three subdirectories: (scripts) with (.py) Python codes, another directory called (data) with textual files containing tab-separated values (.TSV), and the file (Information Systems Research, RIS). The third directory (env) contains the Python libraries required to run the scripts.  Potential for reuse: The research method applied to manipulate this dataset is based on automated re3data metadata extraction and network visualization; after the data collection and analysis process, it is possible to trigger a study based on the descriptions extracted from the Registry of Research Data Repositories (re3data), researchers can visualize the international scenario of research data repositories, verified by re3data, which allows ethical monitoring of the number of research data repositories that are registered in re3data, what are their areas, institutions, countries, the language of research data, the typology of repositories and deposited data, their themes, areas of knowledge, types of access, licenses and software used. In addition, other issues can be raised while interpreting the data. The community of Librarianship and Information Science professionals need to share data and the extraction technique  these research data. Finally, it can be concluded whether information about research data repositories allows us to state that they are heterogeneous data sources that enable access and preservation of a wide range of research data types
publishDate 2023
dc.date.none.fl_str_mv 2023-08-04
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
info:eu-repo/semantics/publishedVersion
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://periodicos.ufsc.br/index.php/eb/article/view/94877
10.5007/1518-2924.2023.e94877
url https://periodicos.ufsc.br/index.php/eb/article/view/94877
identifier_str_mv 10.5007/1518-2924.2023.e94877
dc.language.iso.fl_str_mv eng
por
language eng
por
dc.relation.none.fl_str_mv https://periodicos.ufsc.br/index.php/eb/article/view/94877/53958
https://periodicos.ufsc.br/index.php/eb/article/view/94877/53947
https://periodicos.ufsc.br/index.php/eb/article/view/94877/53948
https://periodicos.ufsc.br/index.php/eb/article/view/94877/53949
dc.rights.driver.fl_str_mv https://creativecommons.org/licenses/by/4.0
info:eu-repo/semantics/openAccess
rights_invalid_str_mv https://creativecommons.org/licenses/by/4.0
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
application/pdf
application/pdf
application/pdf
dc.publisher.none.fl_str_mv Departamento de Ciência da Informação – UFSC
publisher.none.fl_str_mv Departamento de Ciência da Informação – UFSC
dc.source.none.fl_str_mv Encontros Bibli: revista eletrônica de biblioteconomia e ciência da informação; Vol. 28 (2023): Innovation, Technology and Sustainability; 1-8
Encontros Bibli: revista electrónica de bibliotecología y ciencias de la información.; Vol. 28 (2023): Innovación, Tecnología y Sustentabilidad; 1-8
Encontros Bibli: revista eletrônica de biblioteconomia e ciência da informação; v. 28 (2023): Inovação, Tecnologia e Sustentabilidade; 1-8
1518-2924
reponame:Encontros Bibli
instname:Universidade Federal de Santa Catarina (UFSC)
instacron:UFSC
instname_str Universidade Federal de Santa Catarina (UFSC)
instacron_str UFSC
institution UFSC
reponame_str Encontros Bibli
collection Encontros Bibli
repository.name.fl_str_mv Encontros Bibli - Universidade Federal de Santa Catarina (UFSC)
repository.mail.fl_str_mv encontrosbibli@contato.ufsc.br||portaldeperiodicos.bu@contato.ufsc.br
_version_ 1797067779878158336