Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone

Ruiz-Blanco Y.B.; Agüero-Chapin G.; García-Hernández E.; Álvarez O.; Antunes A.; Green J.

Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone

Detalhes bibliográficos
Autor(a) principal:	Ruiz-Blanco Y.B.
Data de Publicação:	2017
Outros Autores:	Agüero-Chapin G., García-Hernández E., Álvarez O., Antunes A., Green J.
Tipo de documento:	Artigo
Idioma:	eng
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	https://hdl.handle.net/10216/120519
Resumo:	Background: Computational prediction of protein function constitutes one of the more complex problems in Bioinformatics, because of the diversity of functions and mechanisms in that proteins exert in nature. This issue is reinforced especially for proteins that share very low primary or tertiary structure similarity to existing annotated proteomes. In this sense, new alignment-free (AF) tools are needed to overcome the inherent limitations of classic alignment-based approaches to this issue. We have recently introduced AF protein-numerical-encoding programs (TI2BioP and ProtDCal), whose sequence-based features have been successfully applied to detect remote protein homologs, post-translational modifications and antibacterial peptides. Here we aim to demonstrate the applicability of 4 AF protein descriptor families, implemented in our programs, for the identification enzyme-like proteins. At the same time, the use of our novel family of 3D-structure-based descriptors is introduced for the first time. The Dobson & Doig (D&D) benchmark dataset is used for the evaluation of our AF protein descriptors, because of its proven structural diversity that permits one to emulate an experiment within the twilight zone of alignment-based methods (pair-wise identity <30%). The performance of our sequence-based predictor was further assessed using a subset of formerly uncharacterized proteins which currently represent a benchmark annotation dataset. Results: Four protein descriptor families (sequence-composition-based (0D), linear-topology-based (1D), pseudo-fold-topology-based (2D) and 3D-structure features (3D), were assessed using the D&D benchmark dataset. We show that only the families of ProtDCal's descriptors (0D, 1D and 3D) encode significant information for enzymes and non-enzymes discrimination. The obtained 3D-structure-based classifier ranked first among several other SVM-based methods assessed in this dataset. Furthermore, the model leveraging 1D descriptors, showed a higher success rate than EzyPred on a benchmark annotation dataset from the Shewanella oneidensis proteome. Conclusions: The applicability of ProtDCal as a general-purpose-AF protein modelling method is illustrated through the discrimination between two comprehensive protein functional classes. The observed performances using the highly diverse D&D dataset, and the set of formerly uncharacterized (hard-to-annotate) proteins of Shewanella oneidensis, places our methodology on the top range of methods to model and predict protein function using alignment-free approaches. © 2017 The Author(s).

Metadados do item

id	RCAP_8c7fba1ff8f8b7ba8dac3ae7b612020d
oai_identifier_str	oai:repositorio-aberto.up.pt:10216/120519
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zoneAlignmentBacteriaBenchmarkingClassification (of information)Encoding (symbols)EnzymesSupport vector machinesTopologyAntibacterial peptidesComputational predictionsDescriptorsPost-translational modificationsProtDCalProtein analysisSequence based featuresTI2BioPProteinsBackground: Computational prediction of protein function constitutes one of the more complex problems in Bioinformatics, because of the diversity of functions and mechanisms in that proteins exert in nature. This issue is reinforced especially for proteins that share very low primary or tertiary structure similarity to existing annotated proteomes. In this sense, new alignment-free (AF) tools are needed to overcome the inherent limitations of classic alignment-based approaches to this issue. We have recently introduced AF protein-numerical-encoding programs (TI2BioP and ProtDCal), whose sequence-based features have been successfully applied to detect remote protein homologs, post-translational modifications and antibacterial peptides. Here we aim to demonstrate the applicability of 4 AF protein descriptor families, implemented in our programs, for the identification enzyme-like proteins. At the same time, the use of our novel family of 3D-structure-based descriptors is introduced for the first time. The Dobson & Doig (D&D) benchmark dataset is used for the evaluation of our AF protein descriptors, because of its proven structural diversity that permits one to emulate an experiment within the twilight zone of alignment-based methods (pair-wise identity <30%). The performance of our sequence-based predictor was further assessed using a subset of formerly uncharacterized proteins which currently represent a benchmark annotation dataset. Results: Four protein descriptor families (sequence-composition-based (0D), linear-topology-based (1D), pseudo-fold-topology-based (2D) and 3D-structure features (3D), were assessed using the D&D benchmark dataset. We show that only the families of ProtDCal's descriptors (0D, 1D and 3D) encode significant information for enzymes and non-enzymes discrimination. The obtained 3D-structure-based classifier ranked first among several other SVM-based methods assessed in this dataset. Furthermore, the model leveraging 1D descriptors, showed a higher success rate than EzyPred on a benchmark annotation dataset from the Shewanella oneidensis proteome. Conclusions: The applicability of ProtDCal as a general-purpose-AF protein modelling method is illustrated through the discrimination between two comprehensive protein functional classes. The observed performances using the highly diverse D&D dataset, and the set of formerly uncharacterized (hard-to-annotate) proteins of Shewanella oneidensis, places our methodology on the top range of methods to model and predict protein function using alignment-free approaches. © 2017 The Author(s).BMC20172017-01-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttps://hdl.handle.net/10216/120519eng1471210510.1186/s12859-017-1758-xRuiz-Blanco Y.B.Agüero-Chapin G.García-Hernández E.Álvarez O.Antunes A.Green J.info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-29T13:33:59Zoai:repositorio-aberto.up.pt:10216/120519Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T23:42:46.174922Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone
title	Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone
spellingShingle	Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone Ruiz-Blanco Y.B. Alignment Bacteria Benchmarking Classification (of information) Encoding (symbols) Enzymes Support vector machines Topology Antibacterial peptides Computational predictions Descriptors Post-translational modifications ProtDCal Protein analysis Sequence based features TI2BioP Proteins
title_short	Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone
title_full	Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone
title_fullStr	Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone
title_full_unstemmed	Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone
title_sort	Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone
author	Ruiz-Blanco Y.B.
author_facet	Ruiz-Blanco Y.B. Agüero-Chapin G. García-Hernández E. Álvarez O. Antunes A. Green J.
author_role	author
author2	Agüero-Chapin G. García-Hernández E. Álvarez O. Antunes A. Green J.
author2_role	author author author author author
dc.contributor.author.fl_str_mv	Ruiz-Blanco Y.B. Agüero-Chapin G. García-Hernández E. Álvarez O. Antunes A. Green J.
dc.subject.por.fl_str_mv	Alignment Bacteria Benchmarking Classification (of information) Encoding (symbols) Enzymes Support vector machines Topology Antibacterial peptides Computational predictions Descriptors Post-translational modifications ProtDCal Protein analysis Sequence based features TI2BioP Proteins
topic	Alignment Bacteria Benchmarking Classification (of information) Encoding (symbols) Enzymes Support vector machines Topology Antibacterial peptides Computational predictions Descriptors Post-translational modifications ProtDCal Protein analysis Sequence based features TI2BioP Proteins
description	Background: Computational prediction of protein function constitutes one of the more complex problems in Bioinformatics, because of the diversity of functions and mechanisms in that proteins exert in nature. This issue is reinforced especially for proteins that share very low primary or tertiary structure similarity to existing annotated proteomes. In this sense, new alignment-free (AF) tools are needed to overcome the inherent limitations of classic alignment-based approaches to this issue. We have recently introduced AF protein-numerical-encoding programs (TI2BioP and ProtDCal), whose sequence-based features have been successfully applied to detect remote protein homologs, post-translational modifications and antibacterial peptides. Here we aim to demonstrate the applicability of 4 AF protein descriptor families, implemented in our programs, for the identification enzyme-like proteins. At the same time, the use of our novel family of 3D-structure-based descriptors is introduced for the first time. The Dobson & Doig (D&D) benchmark dataset is used for the evaluation of our AF protein descriptors, because of its proven structural diversity that permits one to emulate an experiment within the twilight zone of alignment-based methods (pair-wise identity <30%). The performance of our sequence-based predictor was further assessed using a subset of formerly uncharacterized proteins which currently represent a benchmark annotation dataset. Results: Four protein descriptor families (sequence-composition-based (0D), linear-topology-based (1D), pseudo-fold-topology-based (2D) and 3D-structure features (3D), were assessed using the D&D benchmark dataset. We show that only the families of ProtDCal's descriptors (0D, 1D and 3D) encode significant information for enzymes and non-enzymes discrimination. The obtained 3D-structure-based classifier ranked first among several other SVM-based methods assessed in this dataset. Furthermore, the model leveraging 1D descriptors, showed a higher success rate than EzyPred on a benchmark annotation dataset from the Shewanella oneidensis proteome. Conclusions: The applicability of ProtDCal as a general-purpose-AF protein modelling method is illustrated through the discrimination between two comprehensive protein functional classes. The observed performances using the highly diverse D&D dataset, and the set of formerly uncharacterized (hard-to-annotate) proteins of Shewanella oneidensis, places our methodology on the top range of methods to model and predict protein function using alignment-free approaches. © 2017 The Author(s).
publishDate	2017
dc.date.none.fl_str_mv	2017 2017-01-01T00:00:00Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/article
format	article
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://hdl.handle.net/10216/120519
url	https://hdl.handle.net/10216/120519
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	14712105 10.1186/s12859-017-1758-x
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	BMC
publisher.none.fl_str_mv	BMC
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799135742815895552

Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone

Registros relacionados