Machine learning prediction in genomic sequences of prokaryotic viruses from metagenomic datasets

Detalhes bibliográficos
Autor(a) principal: Amgarten, Deyvid Emanuel
Data de Publicação: 2022
Tipo de documento: Tese
Idioma: eng
Título da fonte: Biblioteca Digital de Teses e Dissertações da USP
Texto Completo: https://www.teses.usp.br/teses/disponiveis/95/95131/tde-17022022-091454/
Resumo: Environmental viruses are extremely diverse and abundant in the biosphere. Several studies have shown prokaryotic viruses (or simply phages) as major players in determining biogeochemical cycles in oceans as well as driving microbial diversification. Besides this ecological role, phages may also be used for clinical purposes since they can kill bacterial cells and terminate infections. A crucial step in this process is the isolation of new phages, which can target a specific bacterial pathogen. Thus, researchers employ screening techniques to find and isolate pathogen-specific phages from environmental samples, which are a rich source of new phages. However, this task remains mostly exploratory and laborious if the researcher has no detailed information about the sample and its potential viral diversity. Having this problem in mind, we propose the development of a bioinformatic workflow to identify genomic sequences belonging to phages in environmental datasets, as well as for host prediction of the identified phages based on their genomic sequences. To achieve this goal, we implemented a random forest classifier and created the tool named MARVEL (Metagenomic Analyses and Retrieval of Viral Elements), which is able to efficiently predict phage genomic sequences in bins generated from whole community metagenomic short reads. We also developed a toolkit, name vHULK (Viral Host Unveiling Kit), which can predict phages host given only their genome as input. vHULK presents higher accuracy than available tools and it can predict both host species and genus in a multiclass prediction setting. Data generated by the application of both tools in public and private composting metagenomic datasets is used for recovery, annotation, and characterization of phage diversity in composting environments. Both tools are publicly available through a GitHub repository: https://github.com/LaboratorioBioinformatica/.
id USP_43d81b62a032a2a85d3125f04a841124
oai_identifier_str oai:teses.usp.br:tde-17022022-091454
network_acronym_str USP
network_name_str Biblioteca Digital de Teses e Dissertações da USP
repository_id_str 2721
spelling Machine learning prediction in genomic sequences of prokaryotic viruses from metagenomic datasetsPredição em sequências de vírus de procariotos através da aplicação de técnicas de aprendizado de máquina em dados metagenômicosAprendizado de máquinaBacteriófagosFagosHost predictionMachine learningMetagenômicaMetagenomicsPhage predictionPhagesPredição de hospedeiro viralProkaryotic virusesVirologyVirusVírusVírus ambientaisVírus de procariotosEnvironmental viruses are extremely diverse and abundant in the biosphere. Several studies have shown prokaryotic viruses (or simply phages) as major players in determining biogeochemical cycles in oceans as well as driving microbial diversification. Besides this ecological role, phages may also be used for clinical purposes since they can kill bacterial cells and terminate infections. A crucial step in this process is the isolation of new phages, which can target a specific bacterial pathogen. Thus, researchers employ screening techniques to find and isolate pathogen-specific phages from environmental samples, which are a rich source of new phages. However, this task remains mostly exploratory and laborious if the researcher has no detailed information about the sample and its potential viral diversity. Having this problem in mind, we propose the development of a bioinformatic workflow to identify genomic sequences belonging to phages in environmental datasets, as well as for host prediction of the identified phages based on their genomic sequences. To achieve this goal, we implemented a random forest classifier and created the tool named MARVEL (Metagenomic Analyses and Retrieval of Viral Elements), which is able to efficiently predict phage genomic sequences in bins generated from whole community metagenomic short reads. We also developed a toolkit, name vHULK (Viral Host Unveiling Kit), which can predict phages host given only their genome as input. vHULK presents higher accuracy than available tools and it can predict both host species and genus in a multiclass prediction setting. Data generated by the application of both tools in public and private composting metagenomic datasets is used for recovery, annotation, and characterization of phage diversity in composting environments. Both tools are publicly available through a GitHub repository: https://github.com/LaboratorioBioinformatica/.Vírus ambientais são extremamente diversos e abundantes na biosfera. Estudos têm demostrado que vírus que infectam procariotos (ou simplesmente fagos) são determinantes no direcionamento de ciclos biogeoquímicos em oceanos, além de influenciarem de forma significativa a diversificação de seus hospedeiros. Sem considerar esse papel ecológico, fagos também estão sendo utilizados para propósitos clínicos graças à habilidade de infectar bactérias e terminar infecções bacterianas. Um passo crucial para esta aplicação é o isolamento de fagos que tenham como alvo um determinado patógeno bacteriano de interesse. Para isso, pesquisadores geralmente recorrem a amostras ambientais num processo dispendioso de tentativa e erro de isolamento experimental. Ter informações importantes sobre a diversidade de fagos em uma amostra, assim como potenciais hospedeiros poderia ajudar neste processo. Sendo assim, nesta tese nós propomos o desenvolvimento de um pipeline de bioinformática para recuperação de genomas de fagos de amostras ambientais, assim como para predição de hospedeiros desses genomas. Para atingir esse objetivo, nós treinamos um classificador random forest para diferenciação de sequências de fagos e o implementamos na ferramenta chamada de MARVEL. Nós também desenvolvemos a ferramenta chamada vHULK, que é capaz de predizer hospedeiros bacterianos dada a sequência do genoma do fago. Ambas as ferramentas apresentam alta acurácia e performance quando comparadas com o estado da arte em cada problema de predição. Resultados gerados pela aplicação das ferramentas desenvolvidas nesta tese em datasets metagenômicos de compostagem e solo são apresentados como uma prova de conceito e estudo de caso. Ambas as ferramentas encontram-se disponíveis no repositório público: https://github.com/LaboratorioBioinformatica/.Biblioteca Digitais de Teses e Dissertações da USPSetubal, João CarlosSilva, Aline Maria daAmgarten, Deyvid Emanuel2022-01-28info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/95/95131/tde-17022022-091454/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2022-03-15T21:11:02Zoai:teses.usp.br:tde-17022022-091454Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212022-03-15T21:11:02Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv Machine learning prediction in genomic sequences of prokaryotic viruses from metagenomic datasets
Predição em sequências de vírus de procariotos através da aplicação de técnicas de aprendizado de máquina em dados metagenômicos
title Machine learning prediction in genomic sequences of prokaryotic viruses from metagenomic datasets
spellingShingle Machine learning prediction in genomic sequences of prokaryotic viruses from metagenomic datasets
Amgarten, Deyvid Emanuel
Aprendizado de máquina
Bacteriófagos
Fagos
Host prediction
Machine learning
Metagenômica
Metagenomics
Phage prediction
Phages
Predição de hospedeiro viral
Prokaryotic viruses
Virology
Virus
Vírus
Vírus ambientais
Vírus de procariotos
title_short Machine learning prediction in genomic sequences of prokaryotic viruses from metagenomic datasets
title_full Machine learning prediction in genomic sequences of prokaryotic viruses from metagenomic datasets
title_fullStr Machine learning prediction in genomic sequences of prokaryotic viruses from metagenomic datasets
title_full_unstemmed Machine learning prediction in genomic sequences of prokaryotic viruses from metagenomic datasets
title_sort Machine learning prediction in genomic sequences of prokaryotic viruses from metagenomic datasets
author Amgarten, Deyvid Emanuel
author_facet Amgarten, Deyvid Emanuel
author_role author
dc.contributor.none.fl_str_mv Setubal, João Carlos
Silva, Aline Maria da
dc.contributor.author.fl_str_mv Amgarten, Deyvid Emanuel
dc.subject.por.fl_str_mv Aprendizado de máquina
Bacteriófagos
Fagos
Host prediction
Machine learning
Metagenômica
Metagenomics
Phage prediction
Phages
Predição de hospedeiro viral
Prokaryotic viruses
Virology
Virus
Vírus
Vírus ambientais
Vírus de procariotos
topic Aprendizado de máquina
Bacteriófagos
Fagos
Host prediction
Machine learning
Metagenômica
Metagenomics
Phage prediction
Phages
Predição de hospedeiro viral
Prokaryotic viruses
Virology
Virus
Vírus
Vírus ambientais
Vírus de procariotos
description Environmental viruses are extremely diverse and abundant in the biosphere. Several studies have shown prokaryotic viruses (or simply phages) as major players in determining biogeochemical cycles in oceans as well as driving microbial diversification. Besides this ecological role, phages may also be used for clinical purposes since they can kill bacterial cells and terminate infections. A crucial step in this process is the isolation of new phages, which can target a specific bacterial pathogen. Thus, researchers employ screening techniques to find and isolate pathogen-specific phages from environmental samples, which are a rich source of new phages. However, this task remains mostly exploratory and laborious if the researcher has no detailed information about the sample and its potential viral diversity. Having this problem in mind, we propose the development of a bioinformatic workflow to identify genomic sequences belonging to phages in environmental datasets, as well as for host prediction of the identified phages based on their genomic sequences. To achieve this goal, we implemented a random forest classifier and created the tool named MARVEL (Metagenomic Analyses and Retrieval of Viral Elements), which is able to efficiently predict phage genomic sequences in bins generated from whole community metagenomic short reads. We also developed a toolkit, name vHULK (Viral Host Unveiling Kit), which can predict phages host given only their genome as input. vHULK presents higher accuracy than available tools and it can predict both host species and genus in a multiclass prediction setting. Data generated by the application of both tools in public and private composting metagenomic datasets is used for recovery, annotation, and characterization of phage diversity in composting environments. Both tools are publicly available through a GitHub repository: https://github.com/LaboratorioBioinformatica/.
publishDate 2022
dc.date.none.fl_str_mv 2022-01-28
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://www.teses.usp.br/teses/disponiveis/95/95131/tde-17022022-091454/
url https://www.teses.usp.br/teses/disponiveis/95/95131/tde-17022022-091454/
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv Liberar o conteúdo para acesso público.
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Liberar o conteúdo para acesso público.
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv
reponame:Biblioteca Digital de Teses e Dissertações da USP
instname:Universidade de São Paulo (USP)
instacron:USP
instname_str Universidade de São Paulo (USP)
instacron_str USP
institution USP
reponame_str Biblioteca Digital de Teses e Dissertações da USP
collection Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br
_version_ 1809090791128170496