Ferramenta computacional para identificação de micro-organismos com base em assinaturas genômicas
Autor(a) principal: | |
---|---|
Data de Publicação: | 2015 |
Tipo de documento: | Dissertação |
Idioma: | por |
Título da fonte: | Repositório Institucional da UNESP |
Texto Completo: | http://hdl.handle.net/11449/132017 http://www.athena.biblioteca.unesp.br/exlibris/bd/cathedra/11-11-2015/000851881.pdf |
Resumo: | Microbial communities play a crucial role in all ecosystems on Earth since they metabolize essential compounds. Given this relevant role they are investigated in Medicine, Biotechnology, Ecology, Food Sciences among other fields. However, only 1% of all known micro-organisms species can be cultivated in vitro. The unravelling of their functions and taxonomic classification demands the development of new approaches. With the advent of new sequencing strategies, the entire genome of microrganisms on a given habitat can be experimentally extracted, but the fragments obtained are small (<1500 bps), and the data processing remains a huge challenge. The most used metagenomic analysis tools classify the sequences by homology. However, the computational time grows exponentially as the read length decreases. There is an evident need for alternative methods that can analyze metagenomic data quickly and accurately. This study proposes a new bacteria sequences identification method to be used in metagenomic data. The genomes of 2164 bacterial strains were obtained from the GenBank and distributed into test and control sets. Each group was randomly fragmented into sequences of 64, 128, 256, 512, 1024, 2048, and 4096 base pair. The sequences organization measures applied in the reads were: GC content, dinucleotide abundance and diplets, triplets and tetraplets entropy. The average and standard deviation of the control sequences values of each species, genus and families of bacteria were calculated. Combinations of genomic signatures and entropy were performed allowing classifying bacteria sequences into family, genus and species. The performance of the proposed methodology was determined by measuring sensitivity, specificity, accuracy and harmonic mean for the test set. The results indicated that the GC content presented the best performance among the signatures investigated. We also considered combinations of features, the combination considering GC ... |
id |
UNSP_51b277c45bbb4d520699fe7efe41ab14 |
---|---|
oai_identifier_str |
oai:repositorio.unesp.br:11449/132017 |
network_acronym_str |
UNSP |
network_name_str |
Repositório Institucional da UNESP |
repository_id_str |
2946 |
spelling |
Ferramenta computacional para identificação de micro-organismos com base em assinaturas genômicasNucleotídeosBioinformáticaEntropiaGenoma humanoMicro-organismosSeqüenciamento de nucleotídeoNucleotide sequenceMicrobial communities play a crucial role in all ecosystems on Earth since they metabolize essential compounds. Given this relevant role they are investigated in Medicine, Biotechnology, Ecology, Food Sciences among other fields. However, only 1% of all known micro-organisms species can be cultivated in vitro. The unravelling of their functions and taxonomic classification demands the development of new approaches. With the advent of new sequencing strategies, the entire genome of microrganisms on a given habitat can be experimentally extracted, but the fragments obtained are small (<1500 bps), and the data processing remains a huge challenge. The most used metagenomic analysis tools classify the sequences by homology. However, the computational time grows exponentially as the read length decreases. There is an evident need for alternative methods that can analyze metagenomic data quickly and accurately. This study proposes a new bacteria sequences identification method to be used in metagenomic data. The genomes of 2164 bacterial strains were obtained from the GenBank and distributed into test and control sets. Each group was randomly fragmented into sequences of 64, 128, 256, 512, 1024, 2048, and 4096 base pair. The sequences organization measures applied in the reads were: GC content, dinucleotide abundance and diplets, triplets and tetraplets entropy. The average and standard deviation of the control sequences values of each species, genus and families of bacteria were calculated. Combinations of genomic signatures and entropy were performed allowing classifying bacteria sequences into family, genus and species. The performance of the proposed methodology was determined by measuring sensitivity, specificity, accuracy and harmonic mean for the test set. The results indicated that the GC content presented the best performance among the signatures investigated. We also considered combinations of features, the combination considering GC ...Comunidades microbianas desempenham papéis cruciais em todos ecosistemas da Terra, uma vez que metabolizam compostos essenciais. Essa característica torna importantes alvos de pesquisas em diversas áreas como médica, ambiental, alimentícia e biotecnológica. Entretanto, somente 1% de todas espécies de micro-organismos conhecidos podem ser cultivadas in vitro, dificultando o estudo de suas funções e de sua classificação taxonômica. Com o surgimento de novas tecnologias de sequenciamento, o genoma inteiro de micro-organismos de um habitat pode ser experimentalmente extraído, mas em pequenos fragmentos (¡1500 pb), tornando o processamento dos dados um grande desafio. As ferramentas de análise de metagenômica mais utilizadas classificam as sequências por homologia. Entretanto, o tempo computacional aumenta exponencialmente conforme o tamanho dos fragmentos diminuem. Isso mostra uma necessidade evidente de métodos alternativos que possam analisar dados de metagenômica de maneira rápida e precisa. Esse estudo propõe um novo método de identificação de sequências de bactérias que analisa esses dados. Os genomas de 2164 linhagens de bactérias foram obtidos pelo GenBank e fragmentados em grupos de teste e controle. Cada grupo foi aleatóriamente fragmentado em sequências de 64, 128, 256, 512, 1024, 2048 e 4096 pares de base. As medidas de organização de sequências aplicadas nos fragmentos foram: conteúdo GC, abundância de dinucleotídeos e entropias de dipletes, tripletes e tetrapletes. Foram calculados a média e o desvio padrão dos valores das sequências controle para cada espécie, gênero e família de bactéria. Foram feitas combinações de medidas para classificar as sequências em famílias, gêneros e espécies. A performance da metodologia foi determinada por medidas de sensibilidade, especificidade, precição e média harmônica para conjuntos de...Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)FAPESP: 2013/1517-4Universidade Estadual Paulista (Unesp)Rybarczyk Filho, José Luiz [UNESP]Lemke, Ney [UNESP]Universidade Estadual Paulista (Unesp)Andrighetti, Tahila [UNESP]2015-12-10T14:23:05Z2015-12-10T14:23:05Z2015-02-27info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesis53 f.application/pdfANDRIGHETTI, Tahila. Ferramenta computacional para identificação de micro-organismos com base em assinaturas genômicas. 2015. 53 f. Dissertação (mestrado) - Universidade Estadual Paulista Júlio de Mesquita Filho, Instituto de Biociências de Botucatu, 2015.http://hdl.handle.net/11449/132017000851881http://www.athena.biblioteca.unesp.br/exlibris/bd/cathedra/11-11-2015/000851881.pdf33004064026P97977035910952141Alephreponame:Repositório Institucional da UNESPinstname:Universidade Estadual Paulista (UNESP)instacron:UNESPporinfo:eu-repo/semantics/openAccess2024-01-08T06:28:33Zoai:repositorio.unesp.br:11449/132017Repositório InstitucionalPUBhttp://repositorio.unesp.br/oai/requestopendoar:29462024-08-05T22:28:24.973581Repositório Institucional da UNESP - Universidade Estadual Paulista (UNESP)false |
dc.title.none.fl_str_mv |
Ferramenta computacional para identificação de micro-organismos com base em assinaturas genômicas |
title |
Ferramenta computacional para identificação de micro-organismos com base em assinaturas genômicas |
spellingShingle |
Ferramenta computacional para identificação de micro-organismos com base em assinaturas genômicas Andrighetti, Tahila [UNESP] Nucleotídeos Bioinformática Entropia Genoma humano Micro-organismos Seqüenciamento de nucleotídeo Nucleotide sequence |
title_short |
Ferramenta computacional para identificação de micro-organismos com base em assinaturas genômicas |
title_full |
Ferramenta computacional para identificação de micro-organismos com base em assinaturas genômicas |
title_fullStr |
Ferramenta computacional para identificação de micro-organismos com base em assinaturas genômicas |
title_full_unstemmed |
Ferramenta computacional para identificação de micro-organismos com base em assinaturas genômicas |
title_sort |
Ferramenta computacional para identificação de micro-organismos com base em assinaturas genômicas |
author |
Andrighetti, Tahila [UNESP] |
author_facet |
Andrighetti, Tahila [UNESP] |
author_role |
author |
dc.contributor.none.fl_str_mv |
Rybarczyk Filho, José Luiz [UNESP] Lemke, Ney [UNESP] Universidade Estadual Paulista (Unesp) |
dc.contributor.author.fl_str_mv |
Andrighetti, Tahila [UNESP] |
dc.subject.por.fl_str_mv |
Nucleotídeos Bioinformática Entropia Genoma humano Micro-organismos Seqüenciamento de nucleotídeo Nucleotide sequence |
topic |
Nucleotídeos Bioinformática Entropia Genoma humano Micro-organismos Seqüenciamento de nucleotídeo Nucleotide sequence |
description |
Microbial communities play a crucial role in all ecosystems on Earth since they metabolize essential compounds. Given this relevant role they are investigated in Medicine, Biotechnology, Ecology, Food Sciences among other fields. However, only 1% of all known micro-organisms species can be cultivated in vitro. The unravelling of their functions and taxonomic classification demands the development of new approaches. With the advent of new sequencing strategies, the entire genome of microrganisms on a given habitat can be experimentally extracted, but the fragments obtained are small (<1500 bps), and the data processing remains a huge challenge. The most used metagenomic analysis tools classify the sequences by homology. However, the computational time grows exponentially as the read length decreases. There is an evident need for alternative methods that can analyze metagenomic data quickly and accurately. This study proposes a new bacteria sequences identification method to be used in metagenomic data. The genomes of 2164 bacterial strains were obtained from the GenBank and distributed into test and control sets. Each group was randomly fragmented into sequences of 64, 128, 256, 512, 1024, 2048, and 4096 base pair. The sequences organization measures applied in the reads were: GC content, dinucleotide abundance and diplets, triplets and tetraplets entropy. The average and standard deviation of the control sequences values of each species, genus and families of bacteria were calculated. Combinations of genomic signatures and entropy were performed allowing classifying bacteria sequences into family, genus and species. The performance of the proposed methodology was determined by measuring sensitivity, specificity, accuracy and harmonic mean for the test set. The results indicated that the GC content presented the best performance among the signatures investigated. We also considered combinations of features, the combination considering GC ... |
publishDate |
2015 |
dc.date.none.fl_str_mv |
2015-12-10T14:23:05Z 2015-12-10T14:23:05Z 2015-02-27 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
ANDRIGHETTI, Tahila. Ferramenta computacional para identificação de micro-organismos com base em assinaturas genômicas. 2015. 53 f. Dissertação (mestrado) - Universidade Estadual Paulista Júlio de Mesquita Filho, Instituto de Biociências de Botucatu, 2015. http://hdl.handle.net/11449/132017 000851881 http://www.athena.biblioteca.unesp.br/exlibris/bd/cathedra/11-11-2015/000851881.pdf 33004064026P9 7977035910952141 |
identifier_str_mv |
ANDRIGHETTI, Tahila. Ferramenta computacional para identificação de micro-organismos com base em assinaturas genômicas. 2015. 53 f. Dissertação (mestrado) - Universidade Estadual Paulista Júlio de Mesquita Filho, Instituto de Biociências de Botucatu, 2015. 000851881 33004064026P9 7977035910952141 |
url |
http://hdl.handle.net/11449/132017 http://www.athena.biblioteca.unesp.br/exlibris/bd/cathedra/11-11-2015/000851881.pdf |
dc.language.iso.fl_str_mv |
por |
language |
por |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
53 f. application/pdf |
dc.publisher.none.fl_str_mv |
Universidade Estadual Paulista (Unesp) |
publisher.none.fl_str_mv |
Universidade Estadual Paulista (Unesp) |
dc.source.none.fl_str_mv |
Aleph reponame:Repositório Institucional da UNESP instname:Universidade Estadual Paulista (UNESP) instacron:UNESP |
instname_str |
Universidade Estadual Paulista (UNESP) |
instacron_str |
UNESP |
institution |
UNESP |
reponame_str |
Repositório Institucional da UNESP |
collection |
Repositório Institucional da UNESP |
repository.name.fl_str_mv |
Repositório Institucional da UNESP - Universidade Estadual Paulista (UNESP) |
repository.mail.fl_str_mv |
|
_version_ |
1808129429528903680 |