The impact of sequence length and number of sequences on promoter prediction performance.

Carvalho, Sávio Gonçalves; Cota, Renata Guerra de Sá; Merschmann, Luiz Henrique de Campos

The impact of sequence length and number of sequences on promoter prediction performance.

Detalhes bibliográficos
Autor(a) principal:	Carvalho, Sávio Gonçalves
Data de Publicação:	2015
Outros Autores:	Cota, Renata Guerra de Sá, Merschmann, Luiz Henrique de Campos
Tipo de documento:	Artigo
Idioma:	eng
Título da fonte:	Repositório Institucional da UFOP
Texto Completo:	http://www.repositorio.ufop.br/handle/123456789/6937 https://doi.org/10.1186/1471-2105-16-S19-S5
Resumo:	Background: The advent of rapid evolution on sequencing capacity of new genomes has evidenced the need for data analysis automation aiming at speeding up the genomic annotation process and reducing its cost. Given that one important step for functional genomic annotation is the promoter identification, several studies have been taken in order to propose computational approaches to predict promoters. Different classifiers and characteristics of the promoter sequences have been used to deal with this prediction problem. However, several works in literature have addressed the promoter prediction problem using datasets containing sequences of 250 nucleotides or more. As the sequence length defines the amount of dataset attributes, even considering a limited number of properties to characterize the sequences, datasets with a high number of attributes are generated for training classifiers. Once high dimensional datasets can degrade the classifiers predictive performance or even require an infeasible processing time, predicting promoters by training classifiers from datasets with a reduced number of attributes, it is essential to obtain good predictive performance with low computational cost. To the best of our knowledge, there is no work in literature that verified in a systematic way the relation between the sequences length and the predictive performance of classifiers. Thus, in this work, we have evaluated the impact of sequence length variation and training dataset size (number of sequences) on the predictive performance of classifiers. Results: We have built sixteen datasets composed of different sized sequences (ranging in length from 12 to 301 nucleotides) and evaluated them using the SVM, Random Forest and k NN classifiers. The best predictive performances reached by SVM and Random Forest remained relatively stable for datasets composed of sequences varying in length from 301 to 41 nucleotides, while k-NN achieved its best performance for the dataset composed of 101 nucleotides. We have also analyzed, using sequences composed of only 41 nucleotides, the impact of increasing the number of sequences in a dataset on the predictive performance of the same three classifiers. Datasets containing 14,000, 80,000, 100,000 and 120,000 sequences were built and evaluated. All classifiers achieved better predictive performance for datasets containing 80,000 sequences or more. Conclusion: The experimental results show that several datasets composed of shorter sequences achieved better predictive performance when compared with datasets composed of longer sequences, and also consumed a significantly shorter processing time. Furthermore, increasing the number of sequences in a dataset proved to be beneficial to the predictive power of classifiers.

Metadados do item

id	UFOP_d9cdac3596309fdbe69d4855c7831016
oai_identifier_str	oai:localhost:123456789/6937
network_acronym_str	UFOP
network_name_str	Repositório Institucional da UFOP
repository_id_str	3233
spelling	Carvalho, Sávio GonçalvesCota, Renata Guerra de SáMerschmann, Luiz Henrique de Campos2016-08-26T20:02:35Z2016-08-26T20:02:35Z2015CARVALHO, S. G.; COTA, R. G. de S.; MERSCHAMANN, L. H. de C. The impact of sequence length and number of sequences on promoter prediction performance. BMC Bioinformatics, v. 16, p. S5, 2015. Disponível em: <http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-16-S19-S5#Declarations>. Acesso em: 07 ago. 2016.1471-2105http://www.repositorio.ufop.br/handle/123456789/6937https://doi.org/10.1186/1471-2105-16-S19-S5Background: The advent of rapid evolution on sequencing capacity of new genomes has evidenced the need for data analysis automation aiming at speeding up the genomic annotation process and reducing its cost. Given that one important step for functional genomic annotation is the promoter identification, several studies have been taken in order to propose computational approaches to predict promoters. Different classifiers and characteristics of the promoter sequences have been used to deal with this prediction problem. However, several works in literature have addressed the promoter prediction problem using datasets containing sequences of 250 nucleotides or more. As the sequence length defines the amount of dataset attributes, even considering a limited number of properties to characterize the sequences, datasets with a high number of attributes are generated for training classifiers. Once high dimensional datasets can degrade the classifiers predictive performance or even require an infeasible processing time, predicting promoters by training classifiers from datasets with a reduced number of attributes, it is essential to obtain good predictive performance with low computational cost. To the best of our knowledge, there is no work in literature that verified in a systematic way the relation between the sequences length and the predictive performance of classifiers. Thus, in this work, we have evaluated the impact of sequence length variation and training dataset size (number of sequences) on the predictive performance of classifiers. Results: We have built sixteen datasets composed of different sized sequences (ranging in length from 12 to 301 nucleotides) and evaluated them using the SVM, Random Forest and k NN classifiers. The best predictive performances reached by SVM and Random Forest remained relatively stable for datasets composed of sequences varying in length from 301 to 41 nucleotides, while k-NN achieved its best performance for the dataset composed of 101 nucleotides. We have also analyzed, using sequences composed of only 41 nucleotides, the impact of increasing the number of sequences in a dataset on the predictive performance of the same three classifiers. Datasets containing 14,000, 80,000, 100,000 and 120,000 sequences were built and evaluated. All classifiers achieved better predictive performance for datasets containing 80,000 sequences or more. Conclusion: The experimental results show that several datasets composed of shorter sequences achieved better predictive performance when compared with datasets composed of longer sequences, and also consumed a significantly shorter processing time. Furthermore, increasing the number of sequences in a dataset proved to be beneficial to the predictive power of classifiers.This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http:// creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Fonte: o próprio artigo.info:eu-repo/semantics/openAccessThe impact of sequence length and number of sequences on promoter prediction performance.info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleengreponame:Repositório Institucional da UFOPinstname:Universidade Federal de Ouro Preto (UFOP)instacron:UFOPLICENSElicense.txtlicense.txttext/plain; charset=utf-8924http://www.repositorio.ufop.br/bitstream/123456789/6937/2/license.txt62604f8d955274beb56c80ce1ee5dcaeMD52ORIGINALARTIGO_ImpactSequenceLength.pdfARTIGO_ImpactSequenceLength.pdfapplication/pdf1729448http://www.repositorio.ufop.br/bitstream/123456789/6937/1/ARTIGO_ImpactSequenceLength.pdf8555c73150e9812a433e03e572d26251MD51123456789/69372019-10-10 10:47:40.673oai:localhost:123456789/6937RGVjbGFyYcOnw6NvIGRlIGRpc3RyaWJ1acOnw6NvIG7Do28tZXhjbHVzaXZhCgpPIHJlZmVyaWRvIGF1dG9yOgoKYSlEZWNsYXJhIHF1ZSBvIGRvY3VtZW50byBlbnRyZWd1ZSDDqSBzZXUgdHJhYmFsaG8gb3JpZ2luYWwgZSBxdWUgZGV0w6ltIG8gZGlyZWl0byBkZSBjb25jZWRlciBvcyBkaXJlaXRvcyBjb250aWRvcyBuZXN0YSBsaWNlbsOnYS4gRGVjbGFyYSB0YW1iw6ltIHF1ZSBhIGVudHJlZ2EgZG8gZG9jdW1lbnRvIG7Do28gaW5mcmluZ2UsIHRhbnRvIHF1YW50byBsaGUgw6kgcG9zc8OtdmVsIHNhYmVyLCBvcyBkaXJlaXRvcyBkZSBxdWFscXVlciBwZXNzb2Egb3UgZW50aWRhZGUuCgpiKVNlIG8gZG9jdW1lbnRvIGVudHJlZ3VlIGNvbnTDqW0gbWF0ZXJpYWwgZG8gcXVhbCBuw6NvIGRldMOpbSBvcyBkaXJlaXRvcyBkZSBhdXRvciwgZGVjbGFyYSBxdWUgb2J0ZXZlIGF1dG9yaXphw6fDo28gZG8gZGV0ZW50b3IgZG9zIGRpcmVpdG9zIGRlIGF1dG9yIHBhcmEgY29uY2VkZXIgw6AgVW5pdmVyc2lkYWRlIEZlZGVyYWwgZGUgT3VybyBQcmV0by9VRk9QIG9zIGRpcmVpdG9zIHJlcXVlcmlkb3MgcG9yIGVzdGEgbGljZW7Dp2EgZSBxdWUgZXNzZSBtYXRlcmlhbCwgY3Vqb3MgZGlyZWl0b3Mgc8OjbyBkZSB0ZXJjZWlyb3MsIGVzdMOhIGNsYXJhbWVudGUgaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3UgY29udGXDumRvcyBkbyBkb2N1bWVudG8gZW50cmVndWUuCgpjKVNlIG8gZG9jdW1lbnRvIGVudHJlZ3VlIMOpIGJhc2VhZG8gZW0gdHJhYmFsaG8gZmluYW5jaWFkbyBvdSBhcG9pYWRvIHBvciBvdXRyYSBpbnN0aXR1acOnw6NvIHF1ZSBuw6NvIGEgVUZPUCwgZGVjbGFyYSBxdWUgY3VtcHJpdSBxdWFpc3F1ZXIgb2JyaWdhw6fDtWVzIGV4aWdpZGFzIHBlbG8gY29udHJhdG8gb3UgYWNvcmRvLgoKRepositório InstitucionalPUBhttp://www.repositorio.ufop.br/oai/requestrepositorio@ufop.edu.bropendoar:32332019-10-10T14:47:40Repositório Institucional da UFOP - Universidade Federal de Ouro Preto (UFOP)false
dc.title.pt_BR.fl_str_mv	The impact of sequence length and number of sequences on promoter prediction performance.
title	The impact of sequence length and number of sequences on promoter prediction performance.
spellingShingle	The impact of sequence length and number of sequences on promoter prediction performance. Carvalho, Sávio Gonçalves
title_short	The impact of sequence length and number of sequences on promoter prediction performance.
title_full	The impact of sequence length and number of sequences on promoter prediction performance.
title_fullStr	The impact of sequence length and number of sequences on promoter prediction performance.
title_full_unstemmed	The impact of sequence length and number of sequences on promoter prediction performance.
title_sort	The impact of sequence length and number of sequences on promoter prediction performance.
author	Carvalho, Sávio Gonçalves
author_facet	Carvalho, Sávio Gonçalves Cota, Renata Guerra de Sá Merschmann, Luiz Henrique de Campos
author_role	author
author2	Cota, Renata Guerra de Sá Merschmann, Luiz Henrique de Campos
author2_role	author author
dc.contributor.author.fl_str_mv	Carvalho, Sávio Gonçalves Cota, Renata Guerra de Sá Merschmann, Luiz Henrique de Campos
description	Background: The advent of rapid evolution on sequencing capacity of new genomes has evidenced the need for data analysis automation aiming at speeding up the genomic annotation process and reducing its cost. Given that one important step for functional genomic annotation is the promoter identification, several studies have been taken in order to propose computational approaches to predict promoters. Different classifiers and characteristics of the promoter sequences have been used to deal with this prediction problem. However, several works in literature have addressed the promoter prediction problem using datasets containing sequences of 250 nucleotides or more. As the sequence length defines the amount of dataset attributes, even considering a limited number of properties to characterize the sequences, datasets with a high number of attributes are generated for training classifiers. Once high dimensional datasets can degrade the classifiers predictive performance or even require an infeasible processing time, predicting promoters by training classifiers from datasets with a reduced number of attributes, it is essential to obtain good predictive performance with low computational cost. To the best of our knowledge, there is no work in literature that verified in a systematic way the relation between the sequences length and the predictive performance of classifiers. Thus, in this work, we have evaluated the impact of sequence length variation and training dataset size (number of sequences) on the predictive performance of classifiers. Results: We have built sixteen datasets composed of different sized sequences (ranging in length from 12 to 301 nucleotides) and evaluated them using the SVM, Random Forest and k NN classifiers. The best predictive performances reached by SVM and Random Forest remained relatively stable for datasets composed of sequences varying in length from 301 to 41 nucleotides, while k-NN achieved its best performance for the dataset composed of 101 nucleotides. We have also analyzed, using sequences composed of only 41 nucleotides, the impact of increasing the number of sequences in a dataset on the predictive performance of the same three classifiers. Datasets containing 14,000, 80,000, 100,000 and 120,000 sequences were built and evaluated. All classifiers achieved better predictive performance for datasets containing 80,000 sequences or more. Conclusion: The experimental results show that several datasets composed of shorter sequences achieved better predictive performance when compared with datasets composed of longer sequences, and also consumed a significantly shorter processing time. Furthermore, increasing the number of sequences in a dataset proved to be beneficial to the predictive power of classifiers.
publishDate	2015
dc.date.issued.fl_str_mv	2015
dc.date.accessioned.fl_str_mv	2016-08-26T20:02:35Z
dc.date.available.fl_str_mv	2016-08-26T20:02:35Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/article
format	article
status_str	publishedVersion
dc.identifier.citation.fl_str_mv	CARVALHO, S. G.; COTA, R. G. de S.; MERSCHAMANN, L. H. de C. The impact of sequence length and number of sequences on promoter prediction performance. BMC Bioinformatics, v. 16, p. S5, 2015. Disponível em: <http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-16-S19-S5#Declarations>. Acesso em: 07 ago. 2016.
dc.identifier.uri.fl_str_mv	http://www.repositorio.ufop.br/handle/123456789/6937
dc.identifier.issn.none.fl_str_mv	1471-2105
dc.identifier.doi.none.fl_str_mv	https://doi.org/10.1186/1471-2105-16-S19-S5
identifier_str_mv	CARVALHO, S. G.; COTA, R. G. de S.; MERSCHAMANN, L. H. de C. The impact of sequence length and number of sequences on promoter prediction performance. BMC Bioinformatics, v. 16, p. S5, 2015. Disponível em: <http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-16-S19-S5#Declarations>. Acesso em: 07 ago. 2016. 1471-2105
url	http://www.repositorio.ufop.br/handle/123456789/6937 https://doi.org/10.1186/1471-2105-16-S19-S5
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFOP instname:Universidade Federal de Ouro Preto (UFOP) instacron:UFOP
instname_str	Universidade Federal de Ouro Preto (UFOP)
instacron_str	UFOP
institution	UFOP
reponame_str	Repositório Institucional da UFOP
collection	Repositório Institucional da UFOP
bitstream.url.fl_str_mv	http://www.repositorio.ufop.br/bitstream/123456789/6937/2/license.txt http://www.repositorio.ufop.br/bitstream/123456789/6937/1/ARTIGO_ImpactSequenceLength.pdf
bitstream.checksum.fl_str_mv	62604f8d955274beb56c80ce1ee5dcae 8555c73150e9812a433e03e572d26251
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da UFOP - Universidade Federal de Ouro Preto (UFOP)
repository.mail.fl_str_mv	repositorio@ufop.edu.br
_version_	1801685760828506112

The impact of sequence length and number of sequences on promoter prediction performance.

Registros relacionados