Positional Amino acid frquency patterns for automatic protein annotation

Detalhes bibliográficos
Autor(a) principal: Silva, Andreia Carina Pereira da
Data de Publicação: 2015
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10451/22424
Resumo: Tese de mestrado, Bioinformática e Biologia computacional (Bioinformática), Universidade de Lisboa, Faculdade de Ciências, 2015
id RCAP_6ed7c10353c9c70136cd48bdbbc1d454
oai_identifier_str oai:repositorio.ul.pt:10451/22424
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Positional Amino acid frquency patterns for automatic protein annotationAnotação automática de proteínasPSI-BLASTK-means clusteringAssociation rule learningGene OntologyTeses de mestrado - 2015Departamento de InformáticaTese de mestrado, Bioinformática e Biologia computacional (Bioinformática), Universidade de Lisboa, Faculdade de Ciências, 2015Today most proteins contained in protein data bases have been annotated through electronic inference. Due to the amount of data being generated by high throughput methods, electronic inference remains the only viable path to understand proteins’ biochemical functions(s), cellular location(s), participation in cellular processes, as well as, its structure and interactions. The feature learning model here proposed aims to introduce a new perspective on protein function annotation problem at a positional amino acid level. Initially, the probabilistic scores for each amino acid at each protein position is acquired, via a traditional PSI-BLAST search; this generates a PSSM with said information. Each protein’s positional amino acid frequency pattern (PAFP) is sieved through a threshold to decrease the number of PAFPs irrelevant to the protein’s function. Afterwards, these are clustered to their Euclidean closer relatives, via k-means algorithm; identifying, in this manner, s sort of fingerprint of amino acid score patterns. These are then associated to Gene Ontology terms retrieved for the training proteins, using arules package from R, i. e., establish association rules between the resulting K-means clusters of PAFPs and the Go terms. The 300 threshold for the sum of PAFPs generated 280 GO terms, with a support of 0.0005, about 30 proteins, and a confidence of 40%. These terms were used to describe 516591 proteins out of 549008 in Swiss-Prot the release of July 2015. Most GO terms were, not leaf level, but higher. The model infers far more proteins to each Go term than the ones annotated to it, however it also fails to allocate proteins annotated with the GO term, resulting in high recall levels, but not equivalently high precision. However, note that these results do not mean the inference is incorrect but in fact that there is no evidence to support it one way or the other. Also, in the training set there are 7271 GO terms with a support of at least 30 proteins, it would be expectable for the model to return a similar number of identified GO terms. Despite, falling short of what was expected, the results strongly suggest that the existence of certain PAFPs within proteins may be important for their function. It is also interesting that the strongest signal was found on terms for which the positive ratio is very low, which are typically very difficult classification problems. Results strongly suggest that it may be possible to find annotation clues by looking on amino acids substitution patterns alone. The results however were not perfect and more work will certainly be required to further validate the initial findings.Falcão, André Osório e Cruz de Azerêdo, 1969-Repositório da Universidade de LisboaSilva, Andreia Carina Pereira da2016-01-26T11:32:24Z201520152015-01-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10451/22424TID:201067781enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-08T16:09:36Zoai:repositorio.ul.pt:10451/22424Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T21:40:01.362672Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Positional Amino acid frquency patterns for automatic protein annotation
title Positional Amino acid frquency patterns for automatic protein annotation
spellingShingle Positional Amino acid frquency patterns for automatic protein annotation
Silva, Andreia Carina Pereira da
Anotação automática de proteínas
PSI-BLAST
K-means clustering
Association rule learning
Gene Ontology
Teses de mestrado - 2015
Departamento de Informática
title_short Positional Amino acid frquency patterns for automatic protein annotation
title_full Positional Amino acid frquency patterns for automatic protein annotation
title_fullStr Positional Amino acid frquency patterns for automatic protein annotation
title_full_unstemmed Positional Amino acid frquency patterns for automatic protein annotation
title_sort Positional Amino acid frquency patterns for automatic protein annotation
author Silva, Andreia Carina Pereira da
author_facet Silva, Andreia Carina Pereira da
author_role author
dc.contributor.none.fl_str_mv Falcão, André Osório e Cruz de Azerêdo, 1969-
Repositório da Universidade de Lisboa
dc.contributor.author.fl_str_mv Silva, Andreia Carina Pereira da
dc.subject.por.fl_str_mv Anotação automática de proteínas
PSI-BLAST
K-means clustering
Association rule learning
Gene Ontology
Teses de mestrado - 2015
Departamento de Informática
topic Anotação automática de proteínas
PSI-BLAST
K-means clustering
Association rule learning
Gene Ontology
Teses de mestrado - 2015
Departamento de Informática
description Tese de mestrado, Bioinformática e Biologia computacional (Bioinformática), Universidade de Lisboa, Faculdade de Ciências, 2015
publishDate 2015
dc.date.none.fl_str_mv 2015
2015
2015-01-01T00:00:00Z
2016-01-26T11:32:24Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10451/22424
TID:201067781
url http://hdl.handle.net/10451/22424
identifier_str_mv TID:201067781
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799134307513532416