Uso da álgebra linear para análise de similaridades e extração de padrões em sequências protéicas

Detalhes bibliográficos
Autor(a) principal: Braulio Roberto Goncalves Marinho Couto
Data de Publicação: 2010
Tipo de documento: Tese
Idioma: por
Título da fonte: Repositório Institucional da UFMG
Texto Completo: http://hdl.handle.net/1843/BUOS-8L4RSA
Resumo: Extracting patterns from protein sequence data is one of the challenges of Computational Biology. Here we use linear algebra methods and logistic regression models to analyze sequences without the requirement of multiples alignments. Firstly, we consider a biomolecular sequence as a complex written language that is recoded as p-peptide frequency vector using all possible overlapping p-peptides window. With 20 amino acids is generated a 20p high-dimensional vector, where p is the word-size. After that, singular value decomposition (SVD) and/or logistic regression models are applied on data to extract patterns or to allow visualizing of high dimensional data. Spearman correlation (r) was used to evaluate the association between statistics used by BLAST and similarity metrics used by SVD. Euclidean distance was negatively correlated with bit score (r>-0.6) and positively correlated with E value (r>+0.7). Cosine had negative correlation with E value (r>-0.7) and positive correlation with bit score (r>+0.8). In addition, we compared edit distance between each pair of sequences with respective cosines and Euclidean distances from SVD. Correlation between cosine and edit distance was -0.32 (P < 0.01) and between Euclidean distance and edit distance was +0.70 (P < 0.01). Besides, the ability of SVD in classifying sequences according to their categories was evaluated. With a 3-peptide frequency matrix, all queries were correctly classified (accuracy = 100%). We proposed a biological significance of the SVD: the singular value spectrum visualized as scree plots unreveals the main components, the process that exists hidden in the protein database. A feature selection for protein sequence classification was made by using logistic regression models and SVD. In addition to the feature selection, combining logistic regression models with SVD allowed better classification of unknown sequences than using SVD alone. We also presented a method that utilizes information from known protein databases to build logistic regression models that allow prediction of a new amino acids sequence. We successfully tested the method in ten instances, which generated models for predicting insulin, globin, keratin, cytochrome, albumin, collagen, fibrinogen and proteins related with cystic fibrosis, Alzheimer disease and schizophrenia. SVD, followed by optimization allows visualization of high dimensional genomes by mapping multivariate data from their high dimensional representation into 2D or 3D space. All results found in this work and the characteristics described are important because SVD can be a solution for the potential problems with alignment algorithms and can be a substitute for those methods, for example, in whole genome analysis.
id UFMG_59f5c7ae8863adaa1a0a051fba52c688
oai_identifier_str oai:repositorio.ufmg.br:1843/BUOS-8L4RSA
network_acronym_str UFMG
network_name_str Repositório Institucional da UFMG
repository_id_str
spelling Uso da álgebra linear para análise de similaridades e extração de padrões em sequências protéicasBioinformáticaBioinformáticaExtracting patterns from protein sequence data is one of the challenges of Computational Biology. Here we use linear algebra methods and logistic regression models to analyze sequences without the requirement of multiples alignments. Firstly, we consider a biomolecular sequence as a complex written language that is recoded as p-peptide frequency vector using all possible overlapping p-peptides window. With 20 amino acids is generated a 20p high-dimensional vector, where p is the word-size. After that, singular value decomposition (SVD) and/or logistic regression models are applied on data to extract patterns or to allow visualizing of high dimensional data. Spearman correlation (r) was used to evaluate the association between statistics used by BLAST and similarity metrics used by SVD. Euclidean distance was negatively correlated with bit score (r>-0.6) and positively correlated with E value (r>+0.7). Cosine had negative correlation with E value (r>-0.7) and positive correlation with bit score (r>+0.8). In addition, we compared edit distance between each pair of sequences with respective cosines and Euclidean distances from SVD. Correlation between cosine and edit distance was -0.32 (P < 0.01) and between Euclidean distance and edit distance was +0.70 (P < 0.01). Besides, the ability of SVD in classifying sequences according to their categories was evaluated. With a 3-peptide frequency matrix, all queries were correctly classified (accuracy = 100%). We proposed a biological significance of the SVD: the singular value spectrum visualized as scree plots unreveals the main components, the process that exists hidden in the protein database. A feature selection for protein sequence classification was made by using logistic regression models and SVD. In addition to the feature selection, combining logistic regression models with SVD allowed better classification of unknown sequences than using SVD alone. We also presented a method that utilizes information from known protein databases to build logistic regression models that allow prediction of a new amino acids sequence. We successfully tested the method in ten instances, which generated models for predicting insulin, globin, keratin, cytochrome, albumin, collagen, fibrinogen and proteins related with cystic fibrosis, Alzheimer disease and schizophrenia. SVD, followed by optimization allows visualization of high dimensional genomes by mapping multivariate data from their high dimensional representation into 2D or 3D space. All results found in this work and the characteristics described are important because SVD can be a solution for the potential problems with alignment algorithms and can be a substitute for those methods, for example, in whole genome analysis.Extrair padrões de dados de seqüências de proteínas é um dos desafios da Biologia Computacional. Neste trabalho, é apresentada uma metodologia que usa técnicas de Álgebra Linear, Estatística e Otimização para a análise de sequências primárias de proteínas. Inicialmente, cada sequência é transformada num vetor de frequências de peptídeos de tamanho p, considerando todas as combinações possíveis de aminoácidos para formarem um p-peptídeo. Com 20 aminoácidos, o modelo de espaço vetorial é formado por vetores de tamanho 20p. Para avaliar a validade biológica do método, medidas de similaridade da SVD, distância Euclidiana e cosseno, foram comparadas com medidas de similaridade usadas por um programa de alinhamento de sequências (BLAST). A distância euclidiana foi negativamente correlacionada com bit score (r>-0,6) e positivamente correlacionado com E value (r>+0,7). Já o cosseno apresentou correlação negativa com E value (r>-0,7) e correlação positiva com bit score (r>+0,8). Foi obtida também uma estimava para o grau de concordância entre cosseno e distância Euclidiana com o resultado gerado por um programa padrão de alinhamento de sequências, quando da classificação de uma sequência desconhecida. Quanto à interpretação biológica para a SVD, pode-se afirmar que os valores singulares visualizados como scree plots revelam os principais componentes, o número de processos escondidos num banco de dados de sequências protéicas. Ao se aliar a SVD com técnicas de otimização, foi possível a visualização multidimensional de genomas e de outros dados multivariados em 2D ou 3D. Já a combinação de modelos de regressão logística com SVD permitiu a seleção de atributos importantes para a classificação de seqüências protéicas. A principal contribuição desta tese refere-se à validade biológica do uso da decomposição em valores singulares (SVD) para análise de similaridade e extração de padrões em sequências protéicas. Antes da realização deste trabalho, persistiam muitas dúvidas em relação à significância biológica de se considerar uma proteína como um vetor no espaço multidimensional e, principalmente, quanto à validade da análise de similaridade por meio de técnicas de Álgebra Linear. Mesmo sem se trabalhar com matrizes de substituição nem com algoritmos de alinhamentos de sequências, foram obtidos resultados biologicamente válidos. Descrever uma proteína na forma de um vetor permite que não só a SVD possa ser usada na sua análise, mas todas as outras ferramentas utilizadas para a manipulação de vetores e matrizes, da Álgebra Linear, Física, Estatística, Geometria, Computação, também poderão ser usadas na busca por similaridades e na extração de padrões em sequências protéicas.Universidade Federal de Minas GeraisUFMGMarcos Augusto dos SantosMarcelo Matos SantoroMohammed J. ZakiCarlos Henrique da SilveiraFrederico Ferreira Campos FilhoJose Miguel OrtegaBraulio Roberto Goncalves Marinho Couto2019-08-14T08:10:24Z2019-08-14T08:10:24Z2010-11-23info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttp://hdl.handle.net/1843/BUOS-8L4RSAinfo:eu-repo/semantics/openAccessporreponame:Repositório Institucional da UFMGinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMG2019-11-14T18:27:40Zoai:repositorio.ufmg.br:1843/BUOS-8L4RSARepositório InstitucionalPUBhttps://repositorio.ufmg.br/oairepositorio@ufmg.bropendoar:2019-11-14T18:27:40Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)false
dc.title.none.fl_str_mv Uso da álgebra linear para análise de similaridades e extração de padrões em sequências protéicas
title Uso da álgebra linear para análise de similaridades e extração de padrões em sequências protéicas
spellingShingle Uso da álgebra linear para análise de similaridades e extração de padrões em sequências protéicas
Braulio Roberto Goncalves Marinho Couto
Bioinformática
Bioinformática
title_short Uso da álgebra linear para análise de similaridades e extração de padrões em sequências protéicas
title_full Uso da álgebra linear para análise de similaridades e extração de padrões em sequências protéicas
title_fullStr Uso da álgebra linear para análise de similaridades e extração de padrões em sequências protéicas
title_full_unstemmed Uso da álgebra linear para análise de similaridades e extração de padrões em sequências protéicas
title_sort Uso da álgebra linear para análise de similaridades e extração de padrões em sequências protéicas
author Braulio Roberto Goncalves Marinho Couto
author_facet Braulio Roberto Goncalves Marinho Couto
author_role author
dc.contributor.none.fl_str_mv Marcos Augusto dos Santos
Marcelo Matos Santoro
Mohammed J. Zaki
Carlos Henrique da Silveira
Frederico Ferreira Campos Filho
Jose Miguel Ortega
dc.contributor.author.fl_str_mv Braulio Roberto Goncalves Marinho Couto
dc.subject.por.fl_str_mv Bioinformática
Bioinformática
topic Bioinformática
Bioinformática
description Extracting patterns from protein sequence data is one of the challenges of Computational Biology. Here we use linear algebra methods and logistic regression models to analyze sequences without the requirement of multiples alignments. Firstly, we consider a biomolecular sequence as a complex written language that is recoded as p-peptide frequency vector using all possible overlapping p-peptides window. With 20 amino acids is generated a 20p high-dimensional vector, where p is the word-size. After that, singular value decomposition (SVD) and/or logistic regression models are applied on data to extract patterns or to allow visualizing of high dimensional data. Spearman correlation (r) was used to evaluate the association between statistics used by BLAST and similarity metrics used by SVD. Euclidean distance was negatively correlated with bit score (r>-0.6) and positively correlated with E value (r>+0.7). Cosine had negative correlation with E value (r>-0.7) and positive correlation with bit score (r>+0.8). In addition, we compared edit distance between each pair of sequences with respective cosines and Euclidean distances from SVD. Correlation between cosine and edit distance was -0.32 (P < 0.01) and between Euclidean distance and edit distance was +0.70 (P < 0.01). Besides, the ability of SVD in classifying sequences according to their categories was evaluated. With a 3-peptide frequency matrix, all queries were correctly classified (accuracy = 100%). We proposed a biological significance of the SVD: the singular value spectrum visualized as scree plots unreveals the main components, the process that exists hidden in the protein database. A feature selection for protein sequence classification was made by using logistic regression models and SVD. In addition to the feature selection, combining logistic regression models with SVD allowed better classification of unknown sequences than using SVD alone. We also presented a method that utilizes information from known protein databases to build logistic regression models that allow prediction of a new amino acids sequence. We successfully tested the method in ten instances, which generated models for predicting insulin, globin, keratin, cytochrome, albumin, collagen, fibrinogen and proteins related with cystic fibrosis, Alzheimer disease and schizophrenia. SVD, followed by optimization allows visualization of high dimensional genomes by mapping multivariate data from their high dimensional representation into 2D or 3D space. All results found in this work and the characteristics described are important because SVD can be a solution for the potential problems with alignment algorithms and can be a substitute for those methods, for example, in whole genome analysis.
publishDate 2010
dc.date.none.fl_str_mv 2010-11-23
2019-08-14T08:10:24Z
2019-08-14T08:10:24Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/1843/BUOS-8L4RSA
url http://hdl.handle.net/1843/BUOS-8L4RSA
dc.language.iso.fl_str_mv por
language por
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Universidade Federal de Minas Gerais
UFMG
publisher.none.fl_str_mv Universidade Federal de Minas Gerais
UFMG
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFMG
instname:Universidade Federal de Minas Gerais (UFMG)
instacron:UFMG
instname_str Universidade Federal de Minas Gerais (UFMG)
instacron_str UFMG
institution UFMG
reponame_str Repositório Institucional da UFMG
collection Repositório Institucional da UFMG
repository.name.fl_str_mv Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)
repository.mail.fl_str_mv repositorio@ufmg.br
_version_ 1816829769691430912