A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences

Carels, Nicolas; Frias, Diego

A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences

Detalhes bibliográficos
Autor(a) principal:	Carels, Nicolas
Data de Publicação:	2013
Outros Autores:	Frias, Diego
Tipo de documento:	Artigo
Idioma:	eng
Título da fonte:	Repositório Institucional da FIOCRUZ (ARCA)
Texto Completo:	https://www.arca.fiocruz.br/handle/icict/11675
Resumo:	Fundação Oswaldo Cruz. Instituto Oswaldo Cruz. Laboratório de Genômica Funcional e Bioinformática. Rio de Janeiro, RJ, Brasil.

Metadados do item

id	CRUZ_9e25d011fad1b8bd520e9930eaff579b
oai_identifier_str	oai:www.arca.fiocruz.br:icict/11675
network_acronym_str	CRUZ
network_name_str	Repositório Institucional da FIOCRUZ (ARCA)
repository_id_str	2135
spelling	Carels, NicolasFrias, Diego2015-09-21T17:25:09Z2015-09-21T17:25:09Z2013CARELS, Nicolas; FRIAS, Diego. A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences. Bioinformatics and Biology Insights, n.7, p.35–54, 2013.1177-9322https://www.arca.fiocruz.br/handle/icict/1167510.4137/BBI.S10053engLibertas AcademicaA Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequencesinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleFundação Oswaldo Cruz. Instituto Oswaldo Cruz. Laboratório de Genômica Funcional e Bioinformática. Rio de Janeiro, RJ, Brasil.Universidade do Estado da Bahia (UNE B). Departamento de Ciências Exatas e da Terr. Salvador, BA, Brasil.Abstract: In this study, we investigated the modalities of coding open reading frame (cORF) classification of expressed sequence tags (EST) by using the universal feature method (UFM). The UFM algorithm is based on the scoring of purine bias (Rrr) and stop codon frequencies. UFM classifies ORFs as coding or non-coding through a score based on 5 factors: (i) stop codon frequency; (ii) the product of the probabilities of purines occurring in the three positions of nucleotide triplets; (iii) the product of the probabilities of Cytosine (C), Guanine (G), and Adenine (A) occurring in the 1st, 2nd, and 3rd positions of triplets, respectively; (iv) the probabilities of a G occurring in the 1st and 2nd positions of triplets; and (v) the probabilities of a T occurring in the 1st and an A in the 2nd position of triplets. Because UFM is based on primary determinants of coding sequences that are conserved throughout the biosphere, it is suitable for cORF classification of any sequence in eukaryote transcriptomes without prior knowledge. Considering the protein sequences of the Protein Data Bank (RCSB PDB or more simply PDB) as a reference, we found that UFM classifies cORFs of $200 bp (if the coding strand is known) and cORFs of $300 bp (if the coding strand is unknown), and releases them in their coding strand and coding frame, which allows their automatic translation into protein sequences with a success rate equal to or higher than 95%. We first established the statistical parameters of UFM using ESTs from Plasmodium falciparum, Arabidopsis thaliana, Oryza sativa, Zea mays, Drosophila melanogaster, Homo sapiens and Chlamydomonas reinhardtii in reference to the protein sequences of PDB. Second, we showed that the success rate of cORF classification using UFM is expected to apply to approximately 95% of higher eukaryote genes that encode for proteins. Third, we used UFM in combination with CAP3 to assemble large EST samples into cORFs that we used to analyze transcriptome phenotypes in rice, maize, and humans. We discuss the error rate and the interference of noisy sequences such as pseudogenes, transposons, and retrotransposons. This method is suitable for rapid cORF extraction from transcriptome data and allows correct description of the genome phenotypes of plant genomes without prior knowledge. Additional care is necessary when addressing the human transcriptome due to the interference caused by large amounts of noisy sequences. UFM can be regarded as a low complexity tool for prior knowledge extraction concerning the coding fraction of the transcriptome of any eukaryote. Due to its low level of complexity, UFM is also very robust to variations of codon usage.GenomicsRNYESTORFCDSUFMClassificationTranscriptomaGenômicainfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da FIOCRUZ (ARCA)instname:Fundação Oswaldo Cruz (FIOCRUZ)instacron:FIOCRUZLICENSElicense.txttext/plain1914https://www.arca.fiocruz.br/bitstream/icict/11675/1/license.txt7d48279ffeed55da8dfe2f8e81f3b81fMD51ORIGINALnicolas_farrelefrias_IOC_2013.pdfapplication/pdf2293297https://www.arca.fiocruz.br/bitstream/icict/11675/2/nicolas_farrelefrias_IOC_2013.pdf22a26c143725fc8272d56eab95256053MD52TEXTnicolas_farrelefrias_IOC_2013.pdf.txtnicolas_farrelefrias_IOC_2013.pdf.txtExtracted texttext/plain76113https://www.arca.fiocruz.br/bitstream/icict/11675/3/nicolas_farrelefrias_IOC_2013.pdf.txtfd124a54685e2a0e4ca292aa5a379f96MD53icict/116752022-06-24 12:17:44.072oai:www.arca.fiocruz.br:icict/11675TElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEKCkFvIGNvbmNvcmRhciBlIGFjZWl0YXIgZXN0YSBsaWNlbsOnYSB2b2PDqiAoYXV0b3Igb3UgZGV0ZW50b3IgZG9zIGRpcmVpdG9zIGF1dG9yYWlzKToKCmEpIERlY2xhcmEgcXVlIGNvbmhlY2UgYSBwb2zDrXRpY2EgZGUgY29weXJpZ2h0IGRhIGVkaXRvcmEgZG8gc2V1IGRvY3VtZW50by4KCmIpIERlY2xhcmEgcXVlIGNvbmhlY2UgZSBhY2VpdGEgYXMgRGlyZXRyaXplcyBwYXJhIG8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgRnVuZGHDp8OjbyBPc3dhbGRvIENydXogKEZJT0NSVVopLgoKYykgQ29uY2VkZSDDoCBGSU9DUlVaIG8gZGlyZWl0byBuw6NvLWV4Y2x1c2l2byBkZSBhcnF1aXZhciwgcmVwcm9kdXppciwgY29udmVydGVyIChjb21vIGRlZmluaWRvIGEgc2VndWlyKSwgY29tdW5pY2FyCiAKZS9vdSBkaXN0cmlidWlyIG5vIFJlcG9zaXTDs3JpbyBkYSBGSU9DUlVaLCBvIGRvY3VtZW50byBlbnRyZWd1ZSAoaW5jbHVpbmRvIG8gcmVzdW1vL2Fic3RyYWN0KSBlbSBmb3JtYXRvIGRpZ2l0YWwgb3UgCgpwb3IgcXVhbHF1ZXIgb3V0cm8gbWVpby4KCmQpIERlY2xhcmEgcXVlIGF1dG9yaXphIGEgRklPQ1JVWiBhIGFycXVpdmFyIG1haXMgZGUgdW1hIGPDs3BpYSBkZXN0ZSBkb2N1bWVudG8gZSBjb252ZXJ0w6otbG8sIHNlbSBhbHRlcmFyIG8gc2V1IGNvbnRlw7pkbywgCgpwYXJhIHF1YWxxdWVyIGZvcm1hdG8gZGUgYXJxdWl2bywgbWVpbyBvdSBzdXBvcnRlLCBwYXJhIGVmZWl0b3MgZGUgc2VndXJhbsOnYSwgcHJlc2VydmHDp8OjbyAoYmFja3VwKSBlIGFjZXNzby4KCmUpIERlY2xhcmEgcXVlIG8gZG9jdW1lbnRvIHN1Ym1ldGlkbyDDqSBvIHNldSB0cmFiYWxobyBvcmlnaW5hbCwgZSBxdWUgZGV0w6ltIG8gZGlyZWl0byBkZSBjb25jZWRlciBhIHRlcmNlaXJvcyBvcyBkaXJlaXRvcyAKCmNvbnRpZG9zIG5lc3RhIGxpY2Vuw6dhLiBEZWNsYXJhIHRhbWLDqW0gcXVlIGEgZW50cmVnYSBkbyBkb2N1bWVudG8gbsOjbyBpbmZyaW5nZSBvcyBkaXJlaXRvcyBkZSBxdWFscXVlciBvdXRyYSBwZXNzb2Egb3UgZW50aWRhZGUuCgpmKSBEZWNsYXJhIHF1ZSwgbm8gY2FzbyBkbyBkb2N1bWVudG8gc3VibWV0aWRvIGNvbnRlciBtYXRlcmlhbCBkbyBxdWFsIG7Do28gZGV0w6ltIG9zIGRpcmVpdG9zIGRlIGF1dG9yLCBvYnRldmUgYSBhdXRvcml6YcOnw6NvIAoKaXJyZXN0cml0YSBkbyByZXNwZWN0aXZvIGRldGVudG9yIGRlc3NlcyBkaXJlaXRvcywgcGFyYSBjZWRlciBhIEZJT0NSVVogb3MgZGlyZWl0b3MgcmVxdWVyaWRvcyBwb3IgZXN0YSBMaWNlbsOnYSBlIGF1dG9yaXphciBhIAoKdXRpbGl6w6EtbG9zIGxlZ2FsbWVudGUuIERlY2xhcmEgdGFtYsOpbSBxdWUgZXNzZSBtYXRlcmlhbCBjdWpvcyBkaXJlaXRvcyBzw6NvIGRlIHRlcmNlaXJvcyBlc3TDoSBjbGFyYW1lbnRlIGlkZW50aWZpY2FkbyBlIHJlY29uaGVjaWRvIAoKbm8gdGV4dG8gb3UgY29udGXDumRvIGRvIGRvY3VtZW50byBlbnRyZWd1ZS4KCmcpIFNFIE8gRE9DVU1FTlRPIEVOVFJFR1VFIMOJIEJBU0VBRE8gRU0gVFJBQkFMSE8gRklOQU5DSUFETyBPVSBBUE9JQURPIFBPUiBPVVRSQSBJTlNUSVRVScOHw4NPIFFVRSBOw4NPIEEgRklPQ1JVWiwgREVDTEFSQSBRVUUgQ1VNUFJJVSAKClFVQUlTUVVFUiBPQlJJR0HDh8OVRVMgRVhJR0lEQVMgUEVMTyBSRVNQRUNUSVZPIENPTlRSQVRPIE9VIEFDT1JETy4gQSBGSU9DUlVaIGlkZW50aWZpY2Fyw6EgY2xhcmFtZW50ZSBvKHMpIG5vbWUocykgZG8ocykgYXV0b3IoZXMpIGRvcyAKCmRpcmVpdG9zIGRvIGRvY3VtZW50byBlbnRyZWd1ZSBlIG7Do28gZmFyw6EgcXVhbHF1ZXIgYWx0ZXJhw6fDo28sIHBhcmEgYWzDqW0gZG8gcHJldmlzdG8gbmEgYWzDrW5lYSBjKS4KRepositório InstitucionalPUBhttps://www.arca.fiocruz.br/oai/requestrepositorio.arca@fiocruz.bropendoar:21352022-06-24T15:17:44Repositório Institucional da FIOCRUZ (ARCA) - Fundação Oswaldo Cruz (FIOCRUZ)false
dc.title.pt_BR.fl_str_mv	A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences
title	A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences
spellingShingle	A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences Carels, Nicolas Genomics RNY EST ORF CDS UFM Classification Transcriptoma Genômica
title_short	A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences
title_full	A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences
title_fullStr	A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences
title_full_unstemmed	A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences
title_sort	A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences
author	Carels, Nicolas
author_facet	Carels, Nicolas Frias, Diego
author_role	author
author2	Frias, Diego
author2_role	author
dc.contributor.author.fl_str_mv	Carels, Nicolas Frias, Diego
dc.subject.en.pt_BR.fl_str_mv	Genomics RNY EST ORF CDS UFM Classification
topic	Genomics RNY EST ORF CDS UFM Classification Transcriptoma Genômica
dc.subject.decs.pt_BR.fl_str_mv	Transcriptoma Genômica
description	Fundação Oswaldo Cruz. Instituto Oswaldo Cruz. Laboratório de Genômica Funcional e Bioinformática. Rio de Janeiro, RJ, Brasil.
publishDate	2013
dc.date.issued.fl_str_mv	2013
dc.date.accessioned.fl_str_mv	2015-09-21T17:25:09Z
dc.date.available.fl_str_mv	2015-09-21T17:25:09Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/article
format	article
status_str	publishedVersion
dc.identifier.citation.fl_str_mv	CARELS, Nicolas; FRIAS, Diego. A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences. Bioinformatics and Biology Insights, n.7, p.35–54, 2013.
dc.identifier.uri.fl_str_mv	https://www.arca.fiocruz.br/handle/icict/11675
dc.identifier.issn.pt_BR.fl_str_mv	1177-9322
dc.identifier.doi.pt_BR.fl_str_mv	10.4137/BBI.S10053
identifier_str_mv	CARELS, Nicolas; FRIAS, Diego. A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences. Bioinformatics and Biology Insights, n.7, p.35–54, 2013. 1177-9322 10.4137/BBI.S10053
url	https://www.arca.fiocruz.br/handle/icict/11675
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.publisher.none.fl_str_mv	Libertas Academica
publisher.none.fl_str_mv	Libertas Academica
dc.source.none.fl_str_mv	reponame:Repositório Institucional da FIOCRUZ (ARCA) instname:Fundação Oswaldo Cruz (FIOCRUZ) instacron:FIOCRUZ
instname_str	Fundação Oswaldo Cruz (FIOCRUZ)
instacron_str	FIOCRUZ
institution	FIOCRUZ
reponame_str	Repositório Institucional da FIOCRUZ (ARCA)
collection	Repositório Institucional da FIOCRUZ (ARCA)
bitstream.url.fl_str_mv	https://www.arca.fiocruz.br/bitstream/icict/11675/1/license.txt https://www.arca.fiocruz.br/bitstream/icict/11675/2/nicolas_farrelefrias_IOC_2013.pdf https://www.arca.fiocruz.br/bitstream/icict/11675/3/nicolas_farrelefrias_IOC_2013.pdf.txt
bitstream.checksum.fl_str_mv	7d48279ffeed55da8dfe2f8e81f3b81f 22a26c143725fc8272d56eab95256053 fd124a54685e2a0e4ca292aa5a379f96
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da FIOCRUZ (ARCA) - Fundação Oswaldo Cruz (FIOCRUZ)
repository.mail.fl_str_mv	repositorio.arca@fiocruz.br
_version_	1813009056113098752

A Statistical Method without Training Step for the Classification of Coding Frame in Transcriptome Sequences

Registros relacionados