Positional Amino acid frquency patterns for automatic protein annotation
Autor(a) principal: | |
---|---|
Data de Publicação: | 2015 |
Tipo de documento: | Dissertação |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10451/22424 |
Resumo: | Tese de mestrado, Bioinformática e Biologia computacional (Bioinformática), Universidade de Lisboa, Faculdade de Ciências, 2015 |
id |
RCAP_6ed7c10353c9c70136cd48bdbbc1d454 |
---|---|
oai_identifier_str |
oai:repositorio.ul.pt:10451/22424 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Positional Amino acid frquency patterns for automatic protein annotationAnotação automática de proteínasPSI-BLASTK-means clusteringAssociation rule learningGene OntologyTeses de mestrado - 2015Departamento de InformáticaTese de mestrado, Bioinformática e Biologia computacional (Bioinformática), Universidade de Lisboa, Faculdade de Ciências, 2015Today most proteins contained in protein data bases have been annotated through electronic inference. Due to the amount of data being generated by high throughput methods, electronic inference remains the only viable path to understand proteins’ biochemical functions(s), cellular location(s), participation in cellular processes, as well as, its structure and interactions. The feature learning model here proposed aims to introduce a new perspective on protein function annotation problem at a positional amino acid level. Initially, the probabilistic scores for each amino acid at each protein position is acquired, via a traditional PSI-BLAST search; this generates a PSSM with said information. Each protein’s positional amino acid frequency pattern (PAFP) is sieved through a threshold to decrease the number of PAFPs irrelevant to the protein’s function. Afterwards, these are clustered to their Euclidean closer relatives, via k-means algorithm; identifying, in this manner, s sort of fingerprint of amino acid score patterns. These are then associated to Gene Ontology terms retrieved for the training proteins, using arules package from R, i. e., establish association rules between the resulting K-means clusters of PAFPs and the Go terms. The 300 threshold for the sum of PAFPs generated 280 GO terms, with a support of 0.0005, about 30 proteins, and a confidence of 40%. These terms were used to describe 516591 proteins out of 549008 in Swiss-Prot the release of July 2015. Most GO terms were, not leaf level, but higher. The model infers far more proteins to each Go term than the ones annotated to it, however it also fails to allocate proteins annotated with the GO term, resulting in high recall levels, but not equivalently high precision. However, note that these results do not mean the inference is incorrect but in fact that there is no evidence to support it one way or the other. Also, in the training set there are 7271 GO terms with a support of at least 30 proteins, it would be expectable for the model to return a similar number of identified GO terms. Despite, falling short of what was expected, the results strongly suggest that the existence of certain PAFPs within proteins may be important for their function. It is also interesting that the strongest signal was found on terms for which the positive ratio is very low, which are typically very difficult classification problems. Results strongly suggest that it may be possible to find annotation clues by looking on amino acids substitution patterns alone. The results however were not perfect and more work will certainly be required to further validate the initial findings.Falcão, André Osório e Cruz de Azerêdo, 1969-Repositório da Universidade de LisboaSilva, Andreia Carina Pereira da2016-01-26T11:32:24Z201520152015-01-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10451/22424TID:201067781enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-08T16:09:36Zoai:repositorio.ul.pt:10451/22424Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T21:40:01.362672Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Positional Amino acid frquency patterns for automatic protein annotation |
title |
Positional Amino acid frquency patterns for automatic protein annotation |
spellingShingle |
Positional Amino acid frquency patterns for automatic protein annotation Silva, Andreia Carina Pereira da Anotação automática de proteínas PSI-BLAST K-means clustering Association rule learning Gene Ontology Teses de mestrado - 2015 Departamento de Informática |
title_short |
Positional Amino acid frquency patterns for automatic protein annotation |
title_full |
Positional Amino acid frquency patterns for automatic protein annotation |
title_fullStr |
Positional Amino acid frquency patterns for automatic protein annotation |
title_full_unstemmed |
Positional Amino acid frquency patterns for automatic protein annotation |
title_sort |
Positional Amino acid frquency patterns for automatic protein annotation |
author |
Silva, Andreia Carina Pereira da |
author_facet |
Silva, Andreia Carina Pereira da |
author_role |
author |
dc.contributor.none.fl_str_mv |
Falcão, André Osório e Cruz de Azerêdo, 1969- Repositório da Universidade de Lisboa |
dc.contributor.author.fl_str_mv |
Silva, Andreia Carina Pereira da |
dc.subject.por.fl_str_mv |
Anotação automática de proteínas PSI-BLAST K-means clustering Association rule learning Gene Ontology Teses de mestrado - 2015 Departamento de Informática |
topic |
Anotação automática de proteínas PSI-BLAST K-means clustering Association rule learning Gene Ontology Teses de mestrado - 2015 Departamento de Informática |
description |
Tese de mestrado, Bioinformática e Biologia computacional (Bioinformática), Universidade de Lisboa, Faculdade de Ciências, 2015 |
publishDate |
2015 |
dc.date.none.fl_str_mv |
2015 2015 2015-01-01T00:00:00Z 2016-01-26T11:32:24Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10451/22424 TID:201067781 |
url |
http://hdl.handle.net/10451/22424 |
identifier_str_mv |
TID:201067781 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799134307513532416 |