Sketched reference databases for genome-based taxonomy and comparative genomics

Detalhes bibliográficos
Autor(a) principal: Sánchez-Reyes,A.
Data de Publicação: 2024
Outros Autores: Fernández-López,M. G.
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Brazilian Journal of Biology
Texto Completo: http://old.scielo.br/scielo.php?script=sci_arttext&pid=S1519-69842024000100405
Resumo: Abstract The analysis of curated genomic, metagenomic and proteomic data is of paramount importance in the fields of biology, medicine, education, and bioinformatics. Although this type of data is usually hosted in raw format on free international repositories, the full access requires lots of computing power and large storage disk space for the domestic user. The purpose of the study is to offer a comprehensive set of microbial genomic and proteomic reference databases in an accessible and easy-to-use form to the scientific community and demonstrate its advantages and usefulness. Also, we present a case study on the applicability of the sketched data, for the determination of overall genomic coherence between two members of the Brucellacea family, which suggests they belong to the same genomospecies that remain as discrete ecotypes. A representative set of genomes, proteomes (from type material), and metagenomes were directly collected from the NCBI Assembly database and Genome Taxonomy Database (GTDB), associated with the major groups of Bacteria, Archaea, Virus, and Fungi. Sketched databases were subsequently created and stored on handy reduced representations by using the MinHash algorithm implemented in Mash software. The obtained dataset contains more than 133 GB of space disk reduced to 883.25 MB and represents 125,110 genomics/proteomic records from eight informative contexts, which have been prefiltered to make them accessible, usable, and user-friendly with limited computational resources. Potential uses of these sketched databases are discussed, including but not limited to microbial species delimitation, estimation of genomic distances and genomic novelties, paired comparisons between proteomes, genomes, and metagenomes; phylogenetic neighbor’s exploration and selection, among others.
id IIE-1_823088614c10822f81e5bac870fd5c3d
oai_identifier_str oai:scielo:S1519-69842024000100405
network_acronym_str IIE-1
network_name_str Brazilian Journal of Biology
repository_id_str
spelling Sketched reference databases for genome-based taxonomy and comparative genomicsmicrobial Mash databasegenomic distancegenome containmenttype materialmicrobial taxonomyAbstract The analysis of curated genomic, metagenomic and proteomic data is of paramount importance in the fields of biology, medicine, education, and bioinformatics. Although this type of data is usually hosted in raw format on free international repositories, the full access requires lots of computing power and large storage disk space for the domestic user. The purpose of the study is to offer a comprehensive set of microbial genomic and proteomic reference databases in an accessible and easy-to-use form to the scientific community and demonstrate its advantages and usefulness. Also, we present a case study on the applicability of the sketched data, for the determination of overall genomic coherence between two members of the Brucellacea family, which suggests they belong to the same genomospecies that remain as discrete ecotypes. A representative set of genomes, proteomes (from type material), and metagenomes were directly collected from the NCBI Assembly database and Genome Taxonomy Database (GTDB), associated with the major groups of Bacteria, Archaea, Virus, and Fungi. Sketched databases were subsequently created and stored on handy reduced representations by using the MinHash algorithm implemented in Mash software. The obtained dataset contains more than 133 GB of space disk reduced to 883.25 MB and represents 125,110 genomics/proteomic records from eight informative contexts, which have been prefiltered to make them accessible, usable, and user-friendly with limited computational resources. Potential uses of these sketched databases are discussed, including but not limited to microbial species delimitation, estimation of genomic distances and genomic novelties, paired comparisons between proteomes, genomes, and metagenomes; phylogenetic neighbor’s exploration and selection, among others.Instituto Internacional de Ecologia2024-01-01info:eu-repo/semantics/articleinfo:eu-repo/semantics/publishedVersiontext/htmlhttp://old.scielo.br/scielo.php?script=sci_arttext&pid=S1519-69842024000100405Brazilian Journal of Biology v.84 2024reponame:Brazilian Journal of Biologyinstname:Instituto Internacional de Ecologia (IIE)instacron:IIE10.1590/1519-6984.256673info:eu-repo/semantics/openAccessSánchez-Reyes,A.Fernández-López,M. G.eng2022-11-08T00:00:00Zoai:scielo:S1519-69842024000100405Revistahttps://www.scielo.br/j/bjb/https://old.scielo.br/oai/scielo-oai.phpbjb@bjb.com.br||bjb@bjb.com.br1678-43751519-6984opendoar:2022-11-08T00:00Brazilian Journal of Biology - Instituto Internacional de Ecologia (IIE)false
dc.title.none.fl_str_mv Sketched reference databases for genome-based taxonomy and comparative genomics
title Sketched reference databases for genome-based taxonomy and comparative genomics
spellingShingle Sketched reference databases for genome-based taxonomy and comparative genomics
Sánchez-Reyes,A.
microbial Mash database
genomic distance
genome containment
type material
microbial taxonomy
title_short Sketched reference databases for genome-based taxonomy and comparative genomics
title_full Sketched reference databases for genome-based taxonomy and comparative genomics
title_fullStr Sketched reference databases for genome-based taxonomy and comparative genomics
title_full_unstemmed Sketched reference databases for genome-based taxonomy and comparative genomics
title_sort Sketched reference databases for genome-based taxonomy and comparative genomics
author Sánchez-Reyes,A.
author_facet Sánchez-Reyes,A.
Fernández-López,M. G.
author_role author
author2 Fernández-López,M. G.
author2_role author
dc.contributor.author.fl_str_mv Sánchez-Reyes,A.
Fernández-López,M. G.
dc.subject.por.fl_str_mv microbial Mash database
genomic distance
genome containment
type material
microbial taxonomy
topic microbial Mash database
genomic distance
genome containment
type material
microbial taxonomy
description Abstract The analysis of curated genomic, metagenomic and proteomic data is of paramount importance in the fields of biology, medicine, education, and bioinformatics. Although this type of data is usually hosted in raw format on free international repositories, the full access requires lots of computing power and large storage disk space for the domestic user. The purpose of the study is to offer a comprehensive set of microbial genomic and proteomic reference databases in an accessible and easy-to-use form to the scientific community and demonstrate its advantages and usefulness. Also, we present a case study on the applicability of the sketched data, for the determination of overall genomic coherence between two members of the Brucellacea family, which suggests they belong to the same genomospecies that remain as discrete ecotypes. A representative set of genomes, proteomes (from type material), and metagenomes were directly collected from the NCBI Assembly database and Genome Taxonomy Database (GTDB), associated with the major groups of Bacteria, Archaea, Virus, and Fungi. Sketched databases were subsequently created and stored on handy reduced representations by using the MinHash algorithm implemented in Mash software. The obtained dataset contains more than 133 GB of space disk reduced to 883.25 MB and represents 125,110 genomics/proteomic records from eight informative contexts, which have been prefiltered to make them accessible, usable, and user-friendly with limited computational resources. Potential uses of these sketched databases are discussed, including but not limited to microbial species delimitation, estimation of genomic distances and genomic novelties, paired comparisons between proteomes, genomes, and metagenomes; phylogenetic neighbor’s exploration and selection, among others.
publishDate 2024
dc.date.none.fl_str_mv 2024-01-01
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://old.scielo.br/scielo.php?script=sci_arttext&pid=S1519-69842024000100405
url http://old.scielo.br/scielo.php?script=sci_arttext&pid=S1519-69842024000100405
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv 10.1590/1519-6984.256673
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv text/html
dc.publisher.none.fl_str_mv Instituto Internacional de Ecologia
publisher.none.fl_str_mv Instituto Internacional de Ecologia
dc.source.none.fl_str_mv Brazilian Journal of Biology v.84 2024
reponame:Brazilian Journal of Biology
instname:Instituto Internacional de Ecologia (IIE)
instacron:IIE
instname_str Instituto Internacional de Ecologia (IIE)
instacron_str IIE
institution IIE
reponame_str Brazilian Journal of Biology
collection Brazilian Journal of Biology
repository.name.fl_str_mv Brazilian Journal of Biology - Instituto Internacional de Ecologia (IIE)
repository.mail.fl_str_mv bjb@bjb.com.br||bjb@bjb.com.br
_version_ 1752129891843702784