Uso da álgebra linear para análise de similaridades e extração de padrões em sequências protéicas
Autor(a) principal: | |
---|---|
Data de Publicação: | 2010 |
Tipo de documento: | Tese |
Idioma: | por |
Título da fonte: | Repositório Institucional da UFMG |
Texto Completo: | http://hdl.handle.net/1843/BUOS-8L4RSA |
Resumo: | Extracting patterns from protein sequence data is one of the challenges of Computational Biology. Here we use linear algebra methods and logistic regression models to analyze sequences without the requirement of multiples alignments. Firstly, we consider a biomolecular sequence as a complex written language that is recoded as p-peptide frequency vector using all possible overlapping p-peptides window. With 20 amino acids is generated a 20p high-dimensional vector, where p is the word-size. After that, singular value decomposition (SVD) and/or logistic regression models are applied on data to extract patterns or to allow visualizing of high dimensional data. Spearman correlation (r) was used to evaluate the association between statistics used by BLAST and similarity metrics used by SVD. Euclidean distance was negatively correlated with bit score (r>-0.6) and positively correlated with E value (r>+0.7). Cosine had negative correlation with E value (r>-0.7) and positive correlation with bit score (r>+0.8). In addition, we compared edit distance between each pair of sequences with respective cosines and Euclidean distances from SVD. Correlation between cosine and edit distance was -0.32 (P < 0.01) and between Euclidean distance and edit distance was +0.70 (P < 0.01). Besides, the ability of SVD in classifying sequences according to their categories was evaluated. With a 3-peptide frequency matrix, all queries were correctly classified (accuracy = 100%). We proposed a biological significance of the SVD: the singular value spectrum visualized as scree plots unreveals the main components, the process that exists hidden in the protein database. A feature selection for protein sequence classification was made by using logistic regression models and SVD. In addition to the feature selection, combining logistic regression models with SVD allowed better classification of unknown sequences than using SVD alone. We also presented a method that utilizes information from known protein databases to build logistic regression models that allow prediction of a new amino acids sequence. We successfully tested the method in ten instances, which generated models for predicting insulin, globin, keratin, cytochrome, albumin, collagen, fibrinogen and proteins related with cystic fibrosis, Alzheimer disease and schizophrenia. SVD, followed by optimization allows visualization of high dimensional genomes by mapping multivariate data from their high dimensional representation into 2D or 3D space. All results found in this work and the characteristics described are important because SVD can be a solution for the potential problems with alignment algorithms and can be a substitute for those methods, for example, in whole genome analysis. |
id |
UFMG_59f5c7ae8863adaa1a0a051fba52c688 |
---|---|
oai_identifier_str |
oai:repositorio.ufmg.br:1843/BUOS-8L4RSA |
network_acronym_str |
UFMG |
network_name_str |
Repositório Institucional da UFMG |
repository_id_str |
|
spelling |
Uso da álgebra linear para análise de similaridades e extração de padrões em sequências protéicasBioinformáticaBioinformáticaExtracting patterns from protein sequence data is one of the challenges of Computational Biology. Here we use linear algebra methods and logistic regression models to analyze sequences without the requirement of multiples alignments. Firstly, we consider a biomolecular sequence as a complex written language that is recoded as p-peptide frequency vector using all possible overlapping p-peptides window. With 20 amino acids is generated a 20p high-dimensional vector, where p is the word-size. After that, singular value decomposition (SVD) and/or logistic regression models are applied on data to extract patterns or to allow visualizing of high dimensional data. Spearman correlation (r) was used to evaluate the association between statistics used by BLAST and similarity metrics used by SVD. Euclidean distance was negatively correlated with bit score (r>-0.6) and positively correlated with E value (r>+0.7). Cosine had negative correlation with E value (r>-0.7) and positive correlation with bit score (r>+0.8). In addition, we compared edit distance between each pair of sequences with respective cosines and Euclidean distances from SVD. Correlation between cosine and edit distance was -0.32 (P < 0.01) and between Euclidean distance and edit distance was +0.70 (P < 0.01). Besides, the ability of SVD in classifying sequences according to their categories was evaluated. With a 3-peptide frequency matrix, all queries were correctly classified (accuracy = 100%). We proposed a biological significance of the SVD: the singular value spectrum visualized as scree plots unreveals the main components, the process that exists hidden in the protein database. A feature selection for protein sequence classification was made by using logistic regression models and SVD. In addition to the feature selection, combining logistic regression models with SVD allowed better classification of unknown sequences than using SVD alone. We also presented a method that utilizes information from known protein databases to build logistic regression models that allow prediction of a new amino acids sequence. We successfully tested the method in ten instances, which generated models for predicting insulin, globin, keratin, cytochrome, albumin, collagen, fibrinogen and proteins related with cystic fibrosis, Alzheimer disease and schizophrenia. SVD, followed by optimization allows visualization of high dimensional genomes by mapping multivariate data from their high dimensional representation into 2D or 3D space. All results found in this work and the characteristics described are important because SVD can be a solution for the potential problems with alignment algorithms and can be a substitute for those methods, for example, in whole genome analysis.Extrair padrões de dados de seqüências de proteínas é um dos desafios da Biologia Computacional. Neste trabalho, é apresentada uma metodologia que usa técnicas de Álgebra Linear, Estatística e Otimização para a análise de sequências primárias de proteínas. Inicialmente, cada sequência é transformada num vetor de frequências de peptídeos de tamanho p, considerando todas as combinações possíveis de aminoácidos para formarem um p-peptídeo. Com 20 aminoácidos, o modelo de espaço vetorial é formado por vetores de tamanho 20p. Para avaliar a validade biológica do método, medidas de similaridade da SVD, distância Euclidiana e cosseno, foram comparadas com medidas de similaridade usadas por um programa de alinhamento de sequências (BLAST). A distância euclidiana foi negativamente correlacionada com bit score (r>-0,6) e positivamente correlacionado com E value (r>+0,7). Já o cosseno apresentou correlação negativa com E value (r>-0,7) e correlação positiva com bit score (r>+0,8). Foi obtida também uma estimava para o grau de concordância entre cosseno e distância Euclidiana com o resultado gerado por um programa padrão de alinhamento de sequências, quando da classificação de uma sequência desconhecida. Quanto à interpretação biológica para a SVD, pode-se afirmar que os valores singulares visualizados como scree plots revelam os principais componentes, o número de processos escondidos num banco de dados de sequências protéicas. Ao se aliar a SVD com técnicas de otimização, foi possível a visualização multidimensional de genomas e de outros dados multivariados em 2D ou 3D. Já a combinação de modelos de regressão logística com SVD permitiu a seleção de atributos importantes para a classificação de seqüências protéicas. A principal contribuição desta tese refere-se à validade biológica do uso da decomposição em valores singulares (SVD) para análise de similaridade e extração de padrões em sequências protéicas. Antes da realização deste trabalho, persistiam muitas dúvidas em relação à significância biológica de se considerar uma proteína como um vetor no espaço multidimensional e, principalmente, quanto à validade da análise de similaridade por meio de técnicas de Álgebra Linear. Mesmo sem se trabalhar com matrizes de substituição nem com algoritmos de alinhamentos de sequências, foram obtidos resultados biologicamente válidos. Descrever uma proteína na forma de um vetor permite que não só a SVD possa ser usada na sua análise, mas todas as outras ferramentas utilizadas para a manipulação de vetores e matrizes, da Álgebra Linear, Física, Estatística, Geometria, Computação, também poderão ser usadas na busca por similaridades e na extração de padrões em sequências protéicas.Universidade Federal de Minas GeraisUFMGMarcos Augusto dos SantosMarcelo Matos SantoroMohammed J. ZakiCarlos Henrique da SilveiraFrederico Ferreira Campos FilhoJose Miguel OrtegaBraulio Roberto Goncalves Marinho Couto2019-08-14T08:10:24Z2019-08-14T08:10:24Z2010-11-23info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttp://hdl.handle.net/1843/BUOS-8L4RSAinfo:eu-repo/semantics/openAccessporreponame:Repositório Institucional da UFMGinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMG2019-11-14T18:27:40Zoai:repositorio.ufmg.br:1843/BUOS-8L4RSARepositório InstitucionalPUBhttps://repositorio.ufmg.br/oairepositorio@ufmg.bropendoar:2019-11-14T18:27:40Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)false |
dc.title.none.fl_str_mv |
Uso da álgebra linear para análise de similaridades e extração de padrões em sequências protéicas |
title |
Uso da álgebra linear para análise de similaridades e extração de padrões em sequências protéicas |
spellingShingle |
Uso da álgebra linear para análise de similaridades e extração de padrões em sequências protéicas Braulio Roberto Goncalves Marinho Couto Bioinformática Bioinformática |
title_short |
Uso da álgebra linear para análise de similaridades e extração de padrões em sequências protéicas |
title_full |
Uso da álgebra linear para análise de similaridades e extração de padrões em sequências protéicas |
title_fullStr |
Uso da álgebra linear para análise de similaridades e extração de padrões em sequências protéicas |
title_full_unstemmed |
Uso da álgebra linear para análise de similaridades e extração de padrões em sequências protéicas |
title_sort |
Uso da álgebra linear para análise de similaridades e extração de padrões em sequências protéicas |
author |
Braulio Roberto Goncalves Marinho Couto |
author_facet |
Braulio Roberto Goncalves Marinho Couto |
author_role |
author |
dc.contributor.none.fl_str_mv |
Marcos Augusto dos Santos Marcelo Matos Santoro Mohammed J. Zaki Carlos Henrique da Silveira Frederico Ferreira Campos Filho Jose Miguel Ortega |
dc.contributor.author.fl_str_mv |
Braulio Roberto Goncalves Marinho Couto |
dc.subject.por.fl_str_mv |
Bioinformática Bioinformática |
topic |
Bioinformática Bioinformática |
description |
Extracting patterns from protein sequence data is one of the challenges of Computational Biology. Here we use linear algebra methods and logistic regression models to analyze sequences without the requirement of multiples alignments. Firstly, we consider a biomolecular sequence as a complex written language that is recoded as p-peptide frequency vector using all possible overlapping p-peptides window. With 20 amino acids is generated a 20p high-dimensional vector, where p is the word-size. After that, singular value decomposition (SVD) and/or logistic regression models are applied on data to extract patterns or to allow visualizing of high dimensional data. Spearman correlation (r) was used to evaluate the association between statistics used by BLAST and similarity metrics used by SVD. Euclidean distance was negatively correlated with bit score (r>-0.6) and positively correlated with E value (r>+0.7). Cosine had negative correlation with E value (r>-0.7) and positive correlation with bit score (r>+0.8). In addition, we compared edit distance between each pair of sequences with respective cosines and Euclidean distances from SVD. Correlation between cosine and edit distance was -0.32 (P < 0.01) and between Euclidean distance and edit distance was +0.70 (P < 0.01). Besides, the ability of SVD in classifying sequences according to their categories was evaluated. With a 3-peptide frequency matrix, all queries were correctly classified (accuracy = 100%). We proposed a biological significance of the SVD: the singular value spectrum visualized as scree plots unreveals the main components, the process that exists hidden in the protein database. A feature selection for protein sequence classification was made by using logistic regression models and SVD. In addition to the feature selection, combining logistic regression models with SVD allowed better classification of unknown sequences than using SVD alone. We also presented a method that utilizes information from known protein databases to build logistic regression models that allow prediction of a new amino acids sequence. We successfully tested the method in ten instances, which generated models for predicting insulin, globin, keratin, cytochrome, albumin, collagen, fibrinogen and proteins related with cystic fibrosis, Alzheimer disease and schizophrenia. SVD, followed by optimization allows visualization of high dimensional genomes by mapping multivariate data from their high dimensional representation into 2D or 3D space. All results found in this work and the characteristics described are important because SVD can be a solution for the potential problems with alignment algorithms and can be a substitute for those methods, for example, in whole genome analysis. |
publishDate |
2010 |
dc.date.none.fl_str_mv |
2010-11-23 2019-08-14T08:10:24Z 2019-08-14T08:10:24Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
format |
doctoralThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/1843/BUOS-8L4RSA |
url |
http://hdl.handle.net/1843/BUOS-8L4RSA |
dc.language.iso.fl_str_mv |
por |
language |
por |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Universidade Federal de Minas Gerais UFMG |
publisher.none.fl_str_mv |
Universidade Federal de Minas Gerais UFMG |
dc.source.none.fl_str_mv |
reponame:Repositório Institucional da UFMG instname:Universidade Federal de Minas Gerais (UFMG) instacron:UFMG |
instname_str |
Universidade Federal de Minas Gerais (UFMG) |
instacron_str |
UFMG |
institution |
UFMG |
reponame_str |
Repositório Institucional da UFMG |
collection |
Repositório Institucional da UFMG |
repository.name.fl_str_mv |
Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG) |
repository.mail.fl_str_mv |
repositorio@ufmg.br |
_version_ |
1816829769691430912 |