Python scripts for web scraping metadata from descriptions of the international scenario of research data repositories
Autor(a) principal: | |
---|---|
Data de Publicação: | 2023 |
Outros Autores: | , , |
Tipo de documento: | Artigo |
Idioma: | eng por |
Título da fonte: | Encontros Bibli |
Texto Completo: | https://periodicos.ufsc.br/index.php/eb/article/view/94877 |
Resumo: | Objective: Research data repositories are an evolution of document repositories that aim to access and preserve all materials used before, during, and after scientific research. In this context, this study aims to conduct an exploratory and descriptive investigation of the international scenario of data repositories by monitoring the descriptive metadata of the international register of this type of repositories in the Registry of Research Data Repositories (re3data.org). Methods: The process requires applying knowledge inherent to the techniques and technologies used for descriptive data analysis, information retrieval, manipulation, analysis, and data visualization. Consequently, three scripts in Python 3.11 are provided for collecting metadata from re3data and scripts and converting the metadata to enable visualization in software such as VOSviewer, a dataset with metadata descriptions of repositories and conversions for visualization of networks. The datasets produced in this study can be found in the ZENODO Data Repository (https://doi.org/10.5281/zenodo.7903109). In a collection on (05/05/2023), 3108 links to the repository descriptions were retrieved. Data and scripts were created for this methodological experiment and shared at (DOI: doi.org/10.5281/zenodo.7903109). The dataset contains a root directory with three subdirectories: (scripts) with (.py) Python codes, another directory called (data) with textual files containing tab-separated values (.TSV), and the file (Information Systems Research, RIS). The third directory (env) contains the Python libraries required to run the scripts. Potential for reuse: The research method applied to manipulate this dataset is based on automated re3data metadata extraction and network visualization; after the data collection and analysis process, it is possible to trigger a study based on the descriptions extracted from the Registry of Research Data Repositories (re3data), researchers can visualize the international scenario of research data repositories, verified by re3data, which allows ethical monitoring of the number of research data repositories that are registered in re3data, what are their areas, institutions, countries, the language of research data, the typology of repositories and deposited data, their themes, areas of knowledge, types of access, licenses and software used. In addition, other issues can be raised while interpreting the data. The community of Librarianship and Information Science professionals need to share data and the extraction technique these research data. Finally, it can be concluded whether information about research data repositories allows us to state that they are heterogeneous data sources that enable access and preservation of a wide range of research data types |
id |
UFSC-29_2b7231267c80ad4cbb5ab9ab16ebaf85 |
---|---|
oai_identifier_str |
oai:periodicos.ufsc.br:article/94877 |
network_acronym_str |
UFSC-29 |
network_name_str |
Encontros Bibli |
repository_id_str |
|
spelling |
Python scripts for web scraping metadata from descriptions of the international scenario of research data repositoriesPython scripts para o web scraping de metadados das descrições sobre os conjuntos de dados do cenário internacional de repositórios de dados de pesquisaData RepositoryResearch DataGeosciencesRe3dataRepositório de DadosDados de PesquisaGeociênciasRe3dataObjective: Research data repositories are an evolution of document repositories that aim to access and preserve all materials used before, during, and after scientific research. In this context, this study aims to conduct an exploratory and descriptive investigation of the international scenario of data repositories by monitoring the descriptive metadata of the international register of this type of repositories in the Registry of Research Data Repositories (re3data.org). Methods: The process requires applying knowledge inherent to the techniques and technologies used for descriptive data analysis, information retrieval, manipulation, analysis, and data visualization. Consequently, three scripts in Python 3.11 are provided for collecting metadata from re3data and scripts and converting the metadata to enable visualization in software such as VOSviewer, a dataset with metadata descriptions of repositories and conversions for visualization of networks. The datasets produced in this study can be found in the ZENODO Data Repository (https://doi.org/10.5281/zenodo.7903109). In a collection on (05/05/2023), 3108 links to the repository descriptions were retrieved. Data and scripts were created for this methodological experiment and shared at (DOI: doi.org/10.5281/zenodo.7903109). The dataset contains a root directory with three subdirectories: (scripts) with (.py) Python codes, another directory called (data) with textual files containing tab-separated values (.TSV), and the file (Information Systems Research, RIS). The third directory (env) contains the Python libraries required to run the scripts. Potential for reuse: The research method applied to manipulate this dataset is based on automated re3data metadata extraction and network visualization; after the data collection and analysis process, it is possible to trigger a study based on the descriptions extracted from the Registry of Research Data Repositories (re3data), researchers can visualize the international scenario of research data repositories, verified by re3data, which allows ethical monitoring of the number of research data repositories that are registered in re3data, what are their areas, institutions, countries, the language of research data, the typology of repositories and deposited data, their themes, areas of knowledge, types of access, licenses and software used. In addition, other issues can be raised while interpreting the data. The community of Librarianship and Information Science professionals need to share data and the extraction technique these research data. Finally, it can be concluded whether information about research data repositories allows us to state that they are heterogeneous data sources that enable access and preservation of a wide range of research data typesObjetivo: Os repositórios de dados de pesquisa são a evolução dos repositórios de documentos e visam acessar e preservar todos os materiais usados antes, durante e depois da realização pesquisa científica. Nesse contexto, o objetivo deste estudo é realizar uma abordagem exploratória e descritiva do cenário internacional de repositórios de dados de pesquisa, por meio do monitoramento dos metadados descritivos do registro internacional desse tipo de repositórios no Registry of Research Data Repositories (re3data.org). Métodos: O desenvolvimento do método exigiu a aplicação de conhecimentos inerentes às técnicas e tecnologias utilizadas para análise descritiva de dados, recuperação de informações, manipulação, análise e visualização de dados. A aplicado ao método resulta em três scripts em Python 3.11 para coleta de metadados do re3data, scripts para conversão de metadados e scripts para visualização dos metadados em softwares como o VOSviewer. Os conjuntos de dados produzidos pela pesquisa pode ser encontrados no repositório de dados ZENODO (https://doi.org/10.5281/zenodo.7903109), em uma coleção de software depositada em (05/05/2023), nela foram recuperados 3108 registros de links para descrições de repositórios distribuídos internacionalmente. Conforme o experimento metodológico o conjunto de dados contém um diretório raiz com 3 subdiretórios, um chamado (scripts) com os códigos Pyhton (.py), outro diretório chamado (data) com os arquivos textuais (Tab-separated values,TSV) contidos e o arquivo (Information Systems Research, RIS). O terceiro diretório (env) é onde estão as bibliotecas Python necessárias para executar os scripts. Potencial de reutilização: O método de pesquisa aplicado para manipular este conjunto de dados é baseado na extração automatizada de metadados do re3data e na visualização de redes; após o processo de coleta e análise dos dados é possível desencadear um estudo exploratório e descritivo sobre o cenário internacional dos repositórios de dados de pesquisa, verificados pelo re3data, o que permite o monitoramento ético da quantidade de repositórios de dados de pesquisa que estão cadastrados no re3data, quais são suas áreas, as instituições, os países o idioma o idiomas dos dados da pesquisa, a tipologia dos repositórios e dos dados depositados, suas os temáticas, áreas do conhecimento, tipos de acessos, licenças e softwares utilizados. Além disso, outras questões podem ser levantadas durante a interpretação dos dados. O que reforça a necessidade desse conjunto de dados para a comunidade de profissionais da Biblioteconomia e da Ciência da Informação, o compartilhamento de dados e a técnica de extração podem colaborar com o reaproveitamento desses dados de pesquisa.Departamento de Ciência da Informação – UFSC2023-08-04info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersionapplication/pdfapplication/pdfapplication/pdfapplication/pdfhttps://periodicos.ufsc.br/index.php/eb/article/view/9487710.5007/1518-2924.2023.e94877Encontros Bibli: revista eletrônica de biblioteconomia e ciência da informação; Vol. 28 (2023): Innovation, Technology and Sustainability; 1-8Encontros Bibli: revista electrónica de bibliotecología y ciencias de la información.; Vol. 28 (2023): Innovación, Tecnología y Sustentabilidad; 1-8Encontros Bibli: revista eletrônica de biblioteconomia e ciência da informação; v. 28 (2023): Inovação, Tecnologia e Sustentabilidade; 1-81518-2924reponame:Encontros Bibliinstname:Universidade Federal de Santa Catarina (UFSC)instacron:UFSCengporhttps://periodicos.ufsc.br/index.php/eb/article/view/94877/53958https://periodicos.ufsc.br/index.php/eb/article/view/94877/53947https://periodicos.ufsc.br/index.php/eb/article/view/94877/53948https://periodicos.ufsc.br/index.php/eb/article/view/94877/53949Copyright (c) 2023 Alexandre Ribas Semeler, Arthur Longoni Oliveira, Fabiana Andrade Pereira, Policarpo Matiquitehttps://creativecommons.org/licenses/by/4.0info:eu-repo/semantics/openAccessSemeler, Alexandre RibasLongoni Oliveira, Arthur Andrade Pereira, Fabiana Matiquite, Policarpo 2024-03-08T12:59:44Zoai:periodicos.ufsc.br:article/94877Revistahttps://periodicos.ufsc.br/index.php/eb/indexPUBhttps://periodicos.ufsc.br/index.php/eb/oaiencontrosbibli@contato.ufsc.br||portaldeperiodicos.bu@contato.ufsc.br1518-29241518-2924opendoar:2024-03-08T12:59:44Encontros Bibli - Universidade Federal de Santa Catarina (UFSC)false |
dc.title.none.fl_str_mv |
Python scripts for web scraping metadata from descriptions of the international scenario of research data repositories Python scripts para o web scraping de metadados das descrições sobre os conjuntos de dados do cenário internacional de repositórios de dados de pesquisa |
title |
Python scripts for web scraping metadata from descriptions of the international scenario of research data repositories |
spellingShingle |
Python scripts for web scraping metadata from descriptions of the international scenario of research data repositories Semeler, Alexandre Ribas Data Repository Research Data Geosciences Re3data Repositório de Dados Dados de Pesquisa Geociências Re3data |
title_short |
Python scripts for web scraping metadata from descriptions of the international scenario of research data repositories |
title_full |
Python scripts for web scraping metadata from descriptions of the international scenario of research data repositories |
title_fullStr |
Python scripts for web scraping metadata from descriptions of the international scenario of research data repositories |
title_full_unstemmed |
Python scripts for web scraping metadata from descriptions of the international scenario of research data repositories |
title_sort |
Python scripts for web scraping metadata from descriptions of the international scenario of research data repositories |
author |
Semeler, Alexandre Ribas |
author_facet |
Semeler, Alexandre Ribas Longoni Oliveira, Arthur Andrade Pereira, Fabiana Matiquite, Policarpo |
author_role |
author |
author2 |
Longoni Oliveira, Arthur Andrade Pereira, Fabiana Matiquite, Policarpo |
author2_role |
author author author |
dc.contributor.author.fl_str_mv |
Semeler, Alexandre Ribas Longoni Oliveira, Arthur Andrade Pereira, Fabiana Matiquite, Policarpo |
dc.subject.por.fl_str_mv |
Data Repository Research Data Geosciences Re3data Repositório de Dados Dados de Pesquisa Geociências Re3data |
topic |
Data Repository Research Data Geosciences Re3data Repositório de Dados Dados de Pesquisa Geociências Re3data |
description |
Objective: Research data repositories are an evolution of document repositories that aim to access and preserve all materials used before, during, and after scientific research. In this context, this study aims to conduct an exploratory and descriptive investigation of the international scenario of data repositories by monitoring the descriptive metadata of the international register of this type of repositories in the Registry of Research Data Repositories (re3data.org). Methods: The process requires applying knowledge inherent to the techniques and technologies used for descriptive data analysis, information retrieval, manipulation, analysis, and data visualization. Consequently, three scripts in Python 3.11 are provided for collecting metadata from re3data and scripts and converting the metadata to enable visualization in software such as VOSviewer, a dataset with metadata descriptions of repositories and conversions for visualization of networks. The datasets produced in this study can be found in the ZENODO Data Repository (https://doi.org/10.5281/zenodo.7903109). In a collection on (05/05/2023), 3108 links to the repository descriptions were retrieved. Data and scripts were created for this methodological experiment and shared at (DOI: doi.org/10.5281/zenodo.7903109). The dataset contains a root directory with three subdirectories: (scripts) with (.py) Python codes, another directory called (data) with textual files containing tab-separated values (.TSV), and the file (Information Systems Research, RIS). The third directory (env) contains the Python libraries required to run the scripts. Potential for reuse: The research method applied to manipulate this dataset is based on automated re3data metadata extraction and network visualization; after the data collection and analysis process, it is possible to trigger a study based on the descriptions extracted from the Registry of Research Data Repositories (re3data), researchers can visualize the international scenario of research data repositories, verified by re3data, which allows ethical monitoring of the number of research data repositories that are registered in re3data, what are their areas, institutions, countries, the language of research data, the typology of repositories and deposited data, their themes, areas of knowledge, types of access, licenses and software used. In addition, other issues can be raised while interpreting the data. The community of Librarianship and Information Science professionals need to share data and the extraction technique these research data. Finally, it can be concluded whether information about research data repositories allows us to state that they are heterogeneous data sources that enable access and preservation of a wide range of research data types |
publishDate |
2023 |
dc.date.none.fl_str_mv |
2023-08-04 |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article info:eu-repo/semantics/publishedVersion |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://periodicos.ufsc.br/index.php/eb/article/view/94877 10.5007/1518-2924.2023.e94877 |
url |
https://periodicos.ufsc.br/index.php/eb/article/view/94877 |
identifier_str_mv |
10.5007/1518-2924.2023.e94877 |
dc.language.iso.fl_str_mv |
eng por |
language |
eng por |
dc.relation.none.fl_str_mv |
https://periodicos.ufsc.br/index.php/eb/article/view/94877/53958 https://periodicos.ufsc.br/index.php/eb/article/view/94877/53947 https://periodicos.ufsc.br/index.php/eb/article/view/94877/53948 https://periodicos.ufsc.br/index.php/eb/article/view/94877/53949 |
dc.rights.driver.fl_str_mv |
https://creativecommons.org/licenses/by/4.0 info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
https://creativecommons.org/licenses/by/4.0 |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf application/pdf application/pdf application/pdf |
dc.publisher.none.fl_str_mv |
Departamento de Ciência da Informação – UFSC |
publisher.none.fl_str_mv |
Departamento de Ciência da Informação – UFSC |
dc.source.none.fl_str_mv |
Encontros Bibli: revista eletrônica de biblioteconomia e ciência da informação; Vol. 28 (2023): Innovation, Technology and Sustainability; 1-8 Encontros Bibli: revista electrónica de bibliotecología y ciencias de la información.; Vol. 28 (2023): Innovación, Tecnología y Sustentabilidad; 1-8 Encontros Bibli: revista eletrônica de biblioteconomia e ciência da informação; v. 28 (2023): Inovação, Tecnologia e Sustentabilidade; 1-8 1518-2924 reponame:Encontros Bibli instname:Universidade Federal de Santa Catarina (UFSC) instacron:UFSC |
instname_str |
Universidade Federal de Santa Catarina (UFSC) |
instacron_str |
UFSC |
institution |
UFSC |
reponame_str |
Encontros Bibli |
collection |
Encontros Bibli |
repository.name.fl_str_mv |
Encontros Bibli - Universidade Federal de Santa Catarina (UFSC) |
repository.mail.fl_str_mv |
encontrosbibli@contato.ufsc.br||portaldeperiodicos.bu@contato.ufsc.br |
_version_ |
1797067779878158336 |