Ferramenta computacional para identificação de micro-organismos com base em assinaturas genômicas

Detalhes bibliográficos
Autor(a) principal: Andrighetti, Tahila [UNESP]
Data de Publicação: 2015
Tipo de documento: Dissertação
Idioma: por
Título da fonte: Repositório Institucional da UNESP
Texto Completo: http://hdl.handle.net/11449/132017
http://www.athena.biblioteca.unesp.br/exlibris/bd/cathedra/11-11-2015/000851881.pdf
Resumo: Microbial communities play a crucial role in all ecosystems on Earth since they metabolize essential compounds. Given this relevant role they are investigated in Medicine, Biotechnology, Ecology, Food Sciences among other fields. However, only 1% of all known micro-organisms species can be cultivated in vitro. The unravelling of their functions and taxonomic classification demands the development of new approaches. With the advent of new sequencing strategies, the entire genome of microrganisms on a given habitat can be experimentally extracted, but the fragments obtained are small (<1500 bps), and the data processing remains a huge challenge. The most used metagenomic analysis tools classify the sequences by homology. However, the computational time grows exponentially as the read length decreases. There is an evident need for alternative methods that can analyze metagenomic data quickly and accurately. This study proposes a new bacteria sequences identification method to be used in metagenomic data. The genomes of 2164 bacterial strains were obtained from the GenBank and distributed into test and control sets. Each group was randomly fragmented into sequences of 64, 128, 256, 512, 1024, 2048, and 4096 base pair. The sequences organization measures applied in the reads were: GC content, dinucleotide abundance and diplets, triplets and tetraplets entropy. The average and standard deviation of the control sequences values of each species, genus and families of bacteria were calculated. Combinations of genomic signatures and entropy were performed allowing classifying bacteria sequences into family, genus and species. The performance of the proposed methodology was determined by measuring sensitivity, specificity, accuracy and harmonic mean for the test set. The results indicated that the GC content presented the best performance among the signatures investigated. We also considered combinations of features, the combination considering GC ...
id UNSP_51b277c45bbb4d520699fe7efe41ab14
oai_identifier_str oai:repositorio.unesp.br:11449/132017
network_acronym_str UNSP
network_name_str Repositório Institucional da UNESP
repository_id_str 2946
spelling Ferramenta computacional para identificação de micro-organismos com base em assinaturas genômicasNucleotídeosBioinformáticaEntropiaGenoma humanoMicro-organismosSeqüenciamento de nucleotídeoNucleotide sequenceMicrobial communities play a crucial role in all ecosystems on Earth since they metabolize essential compounds. Given this relevant role they are investigated in Medicine, Biotechnology, Ecology, Food Sciences among other fields. However, only 1% of all known micro-organisms species can be cultivated in vitro. The unravelling of their functions and taxonomic classification demands the development of new approaches. With the advent of new sequencing strategies, the entire genome of microrganisms on a given habitat can be experimentally extracted, but the fragments obtained are small (<1500 bps), and the data processing remains a huge challenge. The most used metagenomic analysis tools classify the sequences by homology. However, the computational time grows exponentially as the read length decreases. There is an evident need for alternative methods that can analyze metagenomic data quickly and accurately. This study proposes a new bacteria sequences identification method to be used in metagenomic data. The genomes of 2164 bacterial strains were obtained from the GenBank and distributed into test and control sets. Each group was randomly fragmented into sequences of 64, 128, 256, 512, 1024, 2048, and 4096 base pair. The sequences organization measures applied in the reads were: GC content, dinucleotide abundance and diplets, triplets and tetraplets entropy. The average and standard deviation of the control sequences values of each species, genus and families of bacteria were calculated. Combinations of genomic signatures and entropy were performed allowing classifying bacteria sequences into family, genus and species. The performance of the proposed methodology was determined by measuring sensitivity, specificity, accuracy and harmonic mean for the test set. The results indicated that the GC content presented the best performance among the signatures investigated. We also considered combinations of features, the combination considering GC ...Comunidades microbianas desempenham papéis cruciais em todos ecosistemas da Terra, uma vez que metabolizam compostos essenciais. Essa característica torna importantes alvos de pesquisas em diversas áreas como médica, ambiental, alimentícia e biotecnológica. Entretanto, somente 1% de todas espécies de micro-organismos conhecidos podem ser cultivadas in vitro, dificultando o estudo de suas funções e de sua classificação taxonômica. Com o surgimento de novas tecnologias de sequenciamento, o genoma inteiro de micro-organismos de um habitat pode ser experimentalmente extraído, mas em pequenos fragmentos (¡1500 pb), tornando o processamento dos dados um grande desafio. As ferramentas de análise de metagenômica mais utilizadas classificam as sequências por homologia. Entretanto, o tempo computacional aumenta exponencialmente conforme o tamanho dos fragmentos diminuem. Isso mostra uma necessidade evidente de métodos alternativos que possam analisar dados de metagenômica de maneira rápida e precisa. Esse estudo propõe um novo método de identificação de sequências de bactérias que analisa esses dados. Os genomas de 2164 linhagens de bactérias foram obtidos pelo GenBank e fragmentados em grupos de teste e controle. Cada grupo foi aleatóriamente fragmentado em sequências de 64, 128, 256, 512, 1024, 2048 e 4096 pares de base. As medidas de organização de sequências aplicadas nos fragmentos foram: conteúdo GC, abundância de dinucleotídeos e entropias de dipletes, tripletes e tetrapletes. Foram calculados a média e o desvio padrão dos valores das sequências controle para cada espécie, gênero e família de bactéria. Foram feitas combinações de medidas para classificar as sequências em famílias, gêneros e espécies. A performance da metodologia foi determinada por medidas de sensibilidade, especificidade, precição e média harmônica para conjuntos de...Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)FAPESP: 2013/1517-4Universidade Estadual Paulista (Unesp)Rybarczyk Filho, José Luiz [UNESP]Lemke, Ney [UNESP]Universidade Estadual Paulista (Unesp)Andrighetti, Tahila [UNESP]2015-12-10T14:23:05Z2015-12-10T14:23:05Z2015-02-27info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesis53 f.application/pdfANDRIGHETTI, Tahila. Ferramenta computacional para identificação de micro-organismos com base em assinaturas genômicas. 2015. 53 f. Dissertação (mestrado) - Universidade Estadual Paulista Júlio de Mesquita Filho, Instituto de Biociências de Botucatu, 2015.http://hdl.handle.net/11449/132017000851881http://www.athena.biblioteca.unesp.br/exlibris/bd/cathedra/11-11-2015/000851881.pdf33004064026P97977035910952141Alephreponame:Repositório Institucional da UNESPinstname:Universidade Estadual Paulista (UNESP)instacron:UNESPporinfo:eu-repo/semantics/openAccess2024-01-08T06:28:33Zoai:repositorio.unesp.br:11449/132017Repositório InstitucionalPUBhttp://repositorio.unesp.br/oai/requestopendoar:29462024-08-05T22:28:24.973581Repositório Institucional da UNESP - Universidade Estadual Paulista (UNESP)false
dc.title.none.fl_str_mv Ferramenta computacional para identificação de micro-organismos com base em assinaturas genômicas
title Ferramenta computacional para identificação de micro-organismos com base em assinaturas genômicas
spellingShingle Ferramenta computacional para identificação de micro-organismos com base em assinaturas genômicas
Andrighetti, Tahila [UNESP]
Nucleotídeos
Bioinformática
Entropia
Genoma humano
Micro-organismos
Seqüenciamento de nucleotídeo
Nucleotide sequence
title_short Ferramenta computacional para identificação de micro-organismos com base em assinaturas genômicas
title_full Ferramenta computacional para identificação de micro-organismos com base em assinaturas genômicas
title_fullStr Ferramenta computacional para identificação de micro-organismos com base em assinaturas genômicas
title_full_unstemmed Ferramenta computacional para identificação de micro-organismos com base em assinaturas genômicas
title_sort Ferramenta computacional para identificação de micro-organismos com base em assinaturas genômicas
author Andrighetti, Tahila [UNESP]
author_facet Andrighetti, Tahila [UNESP]
author_role author
dc.contributor.none.fl_str_mv Rybarczyk Filho, José Luiz [UNESP]
Lemke, Ney [UNESP]
Universidade Estadual Paulista (Unesp)
dc.contributor.author.fl_str_mv Andrighetti, Tahila [UNESP]
dc.subject.por.fl_str_mv Nucleotídeos
Bioinformática
Entropia
Genoma humano
Micro-organismos
Seqüenciamento de nucleotídeo
Nucleotide sequence
topic Nucleotídeos
Bioinformática
Entropia
Genoma humano
Micro-organismos
Seqüenciamento de nucleotídeo
Nucleotide sequence
description Microbial communities play a crucial role in all ecosystems on Earth since they metabolize essential compounds. Given this relevant role they are investigated in Medicine, Biotechnology, Ecology, Food Sciences among other fields. However, only 1% of all known micro-organisms species can be cultivated in vitro. The unravelling of their functions and taxonomic classification demands the development of new approaches. With the advent of new sequencing strategies, the entire genome of microrganisms on a given habitat can be experimentally extracted, but the fragments obtained are small (<1500 bps), and the data processing remains a huge challenge. The most used metagenomic analysis tools classify the sequences by homology. However, the computational time grows exponentially as the read length decreases. There is an evident need for alternative methods that can analyze metagenomic data quickly and accurately. This study proposes a new bacteria sequences identification method to be used in metagenomic data. The genomes of 2164 bacterial strains were obtained from the GenBank and distributed into test and control sets. Each group was randomly fragmented into sequences of 64, 128, 256, 512, 1024, 2048, and 4096 base pair. The sequences organization measures applied in the reads were: GC content, dinucleotide abundance and diplets, triplets and tetraplets entropy. The average and standard deviation of the control sequences values of each species, genus and families of bacteria were calculated. Combinations of genomic signatures and entropy were performed allowing classifying bacteria sequences into family, genus and species. The performance of the proposed methodology was determined by measuring sensitivity, specificity, accuracy and harmonic mean for the test set. The results indicated that the GC content presented the best performance among the signatures investigated. We also considered combinations of features, the combination considering GC ...
publishDate 2015
dc.date.none.fl_str_mv 2015-12-10T14:23:05Z
2015-12-10T14:23:05Z
2015-02-27
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv ANDRIGHETTI, Tahila. Ferramenta computacional para identificação de micro-organismos com base em assinaturas genômicas. 2015. 53 f. Dissertação (mestrado) - Universidade Estadual Paulista Júlio de Mesquita Filho, Instituto de Biociências de Botucatu, 2015.
http://hdl.handle.net/11449/132017
000851881
http://www.athena.biblioteca.unesp.br/exlibris/bd/cathedra/11-11-2015/000851881.pdf
33004064026P9
7977035910952141
identifier_str_mv ANDRIGHETTI, Tahila. Ferramenta computacional para identificação de micro-organismos com base em assinaturas genômicas. 2015. 53 f. Dissertação (mestrado) - Universidade Estadual Paulista Júlio de Mesquita Filho, Instituto de Biociências de Botucatu, 2015.
000851881
33004064026P9
7977035910952141
url http://hdl.handle.net/11449/132017
http://www.athena.biblioteca.unesp.br/exlibris/bd/cathedra/11-11-2015/000851881.pdf
dc.language.iso.fl_str_mv por
language por
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv 53 f.
application/pdf
dc.publisher.none.fl_str_mv Universidade Estadual Paulista (Unesp)
publisher.none.fl_str_mv Universidade Estadual Paulista (Unesp)
dc.source.none.fl_str_mv Aleph
reponame:Repositório Institucional da UNESP
instname:Universidade Estadual Paulista (UNESP)
instacron:UNESP
instname_str Universidade Estadual Paulista (UNESP)
instacron_str UNESP
institution UNESP
reponame_str Repositório Institucional da UNESP
collection Repositório Institucional da UNESP
repository.name.fl_str_mv Repositório Institucional da UNESP - Universidade Estadual Paulista (UNESP)
repository.mail.fl_str_mv
_version_ 1808129429528903680