A support vector machine based method to distinguish long non-coding RNAs from protein coding transcripts

Detalhes bibliográficos
Autor(a) principal: Schneider, Hugo W.
Data de Publicação: 2017
Outros Autores: Raiol, Tainá, Brigido, Marcelo M., Walter, Maria Emilia M. T., Stadler, Peter F.
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Institucional da FIOCRUZ (ARCA)
Texto Completo: https://www.arca.fiocruz.br/handle/icict/42741
Resumo: Cnpq
id CRUZ_f82bd651207a51e6d2b177ad15cbabde
oai_identifier_str oai:www.arca.fiocruz.br:icict/42741
network_acronym_str CRUZ
network_name_str Repositório Institucional da FIOCRUZ (ARCA)
repository_id_str 2135
spelling Schneider, Hugo W.Raiol, TaináBrigido, Marcelo M.Walter, Maria Emilia M. T.Stadler, Peter F.2020-08-13T11:44:31Z2020-08-13T11:44:31Z2017SCHNEIDER, Hugo W. et al. A support vector machine based method to distinguish long non-coding RNAs from protein coding transcripts. BMC Genomics, [London], v. 18, n. 804, p.1-14, 2017.1471-2164https://www.arca.fiocruz.br/handle/icict/4274110.1186/s12864-017-4178-4CnpqUniversity of Brasilia. Instituto de Ciências Exatas. Department of Computer Science. Brasília, DF, Brazil.Fundação Oswaldo Cruz. Fiocruz Brasília. Brasília, DF, Brasil.University of Brasilia. Instituto de Ciencias Biologicas. Laboratory of Molecular Biology. Brasília, DF, Brazil.University of Brasilia. Instituto de Ciências Exatas. Department of Computer Science. Brasília, DF, Brazil.University of Leipzig. Department of Computer Science and Interdisciplinary Center for Bioinformatics. Bioinformatics Group. Leipzig, Germany.Background: In recent years, a rapidly increasing number of RNA transcripts has been generated by thousands of sequencing projects around the world, creating enormous volumes of transcript data to be analyzed. An important problem to be addressed when analyzing this data is distinguishing between long non-coding RNAs (lncRNAs) and protein coding transcripts (PCTs). Thus, we present a Support Vector Machine (SVM) based method to distinguish lncRNAs from PCTs, using features based on frequencies of nucleotide patterns and ORF lengths, in transcripts. Methods: The proposed method is based on SVM and uses the first ORF relative length and frequencies of nucleotide patterns selected by PCA as features. FASTA files were used as input to calculate all possible features. These features were divided in two sets: (i) 336 frequencies of nucleotide patterns; and (ii) 4 features derived from ORFs. PCA were applied to the first set to identify 6 groups of frequencies that could most contribute to the distinction. Twenty-four experiments using the 6 groups from the first set and the features from the second set where built to create the best model to distinguish lncRNAs from PCTs. Results: This method was trained and tested with human (Homo sapiens), mouse (Mus musculus) and zebrafish (Danio rerio) data, achieving 98.21%, 98.03% and 96.09%, accuracy, respectively. Our method was compared to other tools available in the literature (CPAT, CPC, iSeeRNA, lncRNApred, lncRScan-SVM and FEELnc), and showed an improvement in accuracy by ≈ 3.00%. In addition, to validate our model, the mouse data was classified with the human model, and vice-versa, achieving ≈ 97.80% accuracy in both cases, showing that the model is not overfit. The SVM models were validated with data from rat (Rattus norvegicus), pig (Sus scrofa) and fruit fly (Drosophila melanogaster), and obtained more than 84.00% accuracy in all these organisms. Our results also showed that 81.2% of human pseudogenes and 91.7% of mouse pseudogenes were classified as non-coding. Moreover, our method was capable of re-annotating two uncharacterized sequences of Swiss-Prot database with high probability of being lncRNAs. Finally, in order to use the method to annotate transcripts derived from RNA-seq, previously identified lncRNAs of human, gorilla (Gorilla gorilla) and rhesus macaque (Macaca mulatta) were analyzed, having successfully classified 98.62%, 80.8% and 91.9%, respectively. Conclusions: The SVM method proposed in this work presents high performance to distinguish lncRNAs from PCTs, as shown in the results. To build the model, besides using features known in the literature regarding ORFs, we used PCA to identify features among nucleotide pattern frequencies that contribute the most in distinguishing lncRNAs from PCTs, in reference data sets. Interestingly, models created with two evolutionary distant species could distinguish lncRNAs of even more distant species.engSpringer NatureComputational BiologyMolecular Sequence AnnotationMiceOpen Reading FramesRNA, MessengerRNA, UntranslatedZebrafishSupport Vector MachineLong non-coding RNA (lncRNA)Machine learningPrincipal component analysis (PCA)Support vector machine (SVM)lncRNA prediction with nucleotide pattern frequencies and ORF lengthBiologia ComputacionalCamundongosAnotação de Sequência MolecularFases de Leitura AbertaRNA MensageiroRNA não TraduzidoPeixe-ZebraMáquina de Vetores de SuporteA support vector machine based method to distinguish long non-coding RNAs from protein coding transcriptsinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleinfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da FIOCRUZ (ARCA)instname:Fundação Oswaldo Cruz (FIOCRUZ)instacron:FIOCRUZLICENSElicense.txtlicense.txttext/plain; charset=utf-83074https://www.arca.fiocruz.br/bitstream/icict/42741/1/license.txtd3e717dbb24bfc607ede047f44d29a0eMD51ORIGINALve_Taina_Raiol_etal.pdfve_Taina_Raiol_etal.pdfapplication/pdf1630576https://www.arca.fiocruz.br/bitstream/icict/42741/2/ve_Taina_Raiol_etal.pdf32022f8505f26c5e63c677c552d5526fMD52TEXTve_Taina_Raiol_etal.pdf.txtve_Taina_Raiol_etal.pdf.txtExtracted texttext/plain57112https://www.arca.fiocruz.br/bitstream/icict/42741/3/ve_Taina_Raiol_etal.pdf.txtbedb9d211726162c82e6f1f0742023ccMD53icict/427412020-08-14 02:09:33.353oai:www.arca.fiocruz.br:icict/42741Q0VTU8ODTyBOw4NPIEVYQ0xVU0lWQSBERSBESVJFSVRPUyBBVVRPUkFJUw0KDQpKYXF1ZWxpbmUgRmVycmVpcmEgZGUgU291emEsIENQRjogMDE4Ljk4OC43MTEtNzUsIHZpbmN1bGFkbyBhIEZpb2NydXogQnJhc8OtbGlhCgpBbyBhY2VpdGFyIG9zIFRFUk1PUyBlIENPTkRJw4fDlUVTIGRlc3RhIENFU1PDg08sIG8gQVVUT1IgZS9vdSBUSVRVTEFSIGRlIGRpcmVpdG9zCmF1dG9yYWlzIHNvYnJlIGEgT0JSQSBkZSBxdWUgdHJhdGEgZXN0ZSBkb2N1bWVudG86CgooMSkgQ0VERSBlIFRSQU5TRkVSRSwgdG90YWwgZSBncmF0dWl0YW1lbnRlLCDDoCBGSU9DUlVaIC0gRlVOREHDh8ODTyBPU1dBTERPIENSVVosIGVtCmNhcsOhdGVyIHBlcm1hbmVudGUsIGlycmV2b2fDoXZlbCBlIE7Dg08gRVhDTFVTSVZPLCB0b2RvcyBvcyBkaXJlaXRvcyBwYXRyaW1vbmlhaXMgTsODTwpDT01FUkNJQUlTIGRlIHV0aWxpemHDp8OjbyBkYSBPQlJBIGFydMOtc3RpY2EgZS9vdSBjaWVudMOtZmljYSBpbmRpY2FkYSBhY2ltYSwgaW5jbHVzaXZlIG9zIGRpcmVpdG9zCmRlIHZveiBlIGltYWdlbSB2aW5jdWxhZG9zIMOgIE9CUkEsIGR1cmFudGUgdG9kbyBvIHByYXpvIGRlIGR1cmHDp8OjbyBkb3MgZGlyZWl0b3MgYXV0b3JhaXMsIGVtCnF1YWxxdWVyIGlkaW9tYSBlIGVtIHRvZG9zIG9zIHBhw61zZXM7CgooMikgQUNFSVRBIHF1ZSBhIGNlc3PDo28gdG90YWwgbsOjbyBleGNsdXNpdmEsIHBlcm1hbmVudGUgZSBpcnJldm9nw6F2ZWwgZG9zIGRpcmVpdG9zIGF1dG9yYWlzCnBhdHJpbW9uaWFpcyBuw6NvIGNvbWVyY2lhaXMgZGUgdXRpbGl6YcOnw6NvIGRlIHF1ZSB0cmF0YSBlc3RlIGRvY3VtZW50byBpbmNsdWksIGV4ZW1wbGlmaWNhdGl2YW1lbnRlLApvcyBkaXJlaXRvcyBkZSBkaXNwb25pYmlsaXphw6fDo28gZSBjb211bmljYcOnw6NvIHDDumJsaWNhIGRhIE9CUkEsIGVtIHF1YWxxdWVyIG1laW8gb3UgdmXDrWN1bG8sCmluY2x1c2l2ZSBlbSBSZXBvc2l0w7NyaW9zIERpZ2l0YWlzLCBiZW0gY29tbyBvcyBkaXJlaXRvcyBkZSByZXByb2R1w6fDo28sIGV4aWJpw6fDo28sIGV4ZWN1w6fDo28sCmRlY2xhbWHDp8OjbywgcmVjaXRhw6fDo28sIGV4cG9zacOnw6NvLCBhcnF1aXZhbWVudG8sIGluY2x1c8OjbyBlbSBiYW5jbyBkZSBkYWRvcywgcHJlc2VydmHDp8OjbywgZGlmdXPDo28sCmRpc3RyaWJ1acOnw6NvLCBkaXZ1bGdhw6fDo28sIGVtcHLDqXN0aW1vLCB0cmFkdcOnw6NvLCBkdWJsYWdlbSwgbGVnZW5kYWdlbSwgaW5jbHVzw6NvIGVtIG5vdmFzIG9icmFzIG91CmNvbGV0w6JuZWFzLCByZXV0aWxpemHDp8OjbywgZWRpw6fDo28sIHByb2R1w6fDo28gZGUgbWF0ZXJpYWwgZGlkw6F0aWNvIGUgY3Vyc29zIG91IHF1YWxxdWVyIGZvcm1hIGRlCnV0aWxpemHDp8OjbyBuw6NvIGNvbWVyY2lhbDsKCigzKSBSRUNPTkhFQ0UgcXVlIGEgY2Vzc8OjbyBhcXVpIGVzcGVjaWZpY2FkYSBjb25jZWRlIMOgIEZJT0NSVVogLSBGVU5EQcOHw4NPIE9TV0FMRE8KQ1JVWiBvIGRpcmVpdG8gZGUgYXV0b3JpemFyIHF1YWxxdWVyIHBlc3NvYSDigJMgZsOtc2ljYSBvdSBqdXLDrWRpY2EsIHDDumJsaWNhIG91IHByaXZhZGEsIG5hY2lvbmFsIG91CmVzdHJhbmdlaXJhIOKAkyBhIGFjZXNzYXIgZSB1dGlsaXphciBhbXBsYW1lbnRlIGEgT0JSQSwgc2VtIGV4Y2x1c2l2aWRhZGUsIHBhcmEgcXVhaXNxdWVyCmZpbmFsaWRhZGVzIG7Do28gY29tZXJjaWFpczsKCig0KSBERUNMQVJBIHF1ZSBhIG9icmEgw6kgY3JpYcOnw6NvIG9yaWdpbmFsIGUgcXVlIMOpIG8gdGl0dWxhciBkb3MgZGlyZWl0b3MgYXF1aSBjZWRpZG9zIGUgYXV0b3JpemFkb3MsCnJlc3BvbnNhYmlsaXphbmRvLXNlIGludGVncmFsbWVudGUgcGVsbyBjb250ZcO6ZG8gZSBvdXRyb3MgZWxlbWVudG9zIHF1ZSBmYXplbSBwYXJ0ZSBkYSBPQlJBLAppbmNsdXNpdmUgb3MgZGlyZWl0b3MgZGUgdm96IGUgaW1hZ2VtIHZpbmN1bGFkb3Mgw6AgT0JSQSwgb2JyaWdhbmRvLXNlIGEgaW5kZW5pemFyIHRlcmNlaXJvcyBwb3IKZGFub3MsIGJlbSBjb21vIGluZGVuaXphciBlIHJlc3NhcmNpciBhIEZJT0NSVVogLSBGVU5EQcOHw4NPIE9TV0FMRE8gQ1JVWiBkZQpldmVudHVhaXMgZGVzcGVzYXMgcXVlIHZpZXJlbSBhIHN1cG9ydGFyLCBlbSByYXrDo28gZGUgcXVhbHF1ZXIgb2ZlbnNhIGEgZGlyZWl0b3MgYXV0b3JhaXMgb3UKZGlyZWl0b3MgZGUgdm96IG91IGltYWdlbSwgcHJpbmNpcGFsbWVudGUgbm8gcXVlIGRpeiByZXNwZWl0byBhIHBsw6FnaW8gZSB2aW9sYcOnw7VlcyBkZSBkaXJlaXRvczsKCig1KSBBRklSTUEgcXVlIGNvbmhlY2UgYSBQb2zDrXRpY2EgSW5zdGl0dWNpb25hbCBkZSBBY2Vzc28gQWJlcnRvIGRhIEZJT0NSVVogLSBGVU5EQcOHw4NPCk9TV0FMRE8gQ1JVWiBlIGFzIGRpcmV0cml6ZXMgcGFyYSBvIGZ1bmNpb25hbWVudG8gZG8gcmVwb3NpdMOzcmlvIGluc3RpdHVjaW9uYWwgQVJDQS4KCkEgUG9sw610aWNhIEluc3RpdHVjaW9uYWwgZGUgQWNlc3NvIEFiZXJ0byBkYSBGSU9DUlVaIC0gRlVOREHDh8ODTyBPU1dBTERPIENSVVogcmVzZXJ2YQpleGNsdXNpdmFtZW50ZSBhbyBBVVRPUiBvcyBkaXJlaXRvcyBtb3JhaXMgZSBvcyB1c29zIGNvbWVyY2lhaXMgc29icmUgYXMgb2JyYXMgZGUgc3VhIGF1dG9yaWEKZS9vdSB0aXR1bGFyaWRhZGUsIHNlbmRvIG9zIHRlcmNlaXJvcyB1c3XDoXJpb3MgcmVzcG9uc8OhdmVpcyBwZWxhIGF0cmlidWnDp8OjbyBkZSBhdXRvcmlhIGUgbWFudXRlbsOnw6NvCmRhIGludGVncmlkYWRlIGRhIE9CUkEgZW0gcXVhbHF1ZXIgdXRpbGl6YcOnw6NvLgoKQSBQb2zDrXRpY2EgSW5zdGl0dWNpb25hbCBkZSBBY2Vzc28gQWJlcnRvIGRhIEZJT0NSVVogLSBGVU5EQcOHw4NPIE9TV0FMRE8gQ1JVWgpyZXNwZWl0YSBvcyBjb250cmF0b3MgZSBhY29yZG9zIHByZWV4aXN0ZW50ZXMgZG9zIEF1dG9yZXMgY29tIHRlcmNlaXJvcywgY2FiZW5kbyBhb3MgQXV0b3JlcwppbmZvcm1hciDDoCBJbnN0aXR1acOnw6NvIGFzIGNvbmRpw6fDtWVzIGUgb3V0cmFzIHJlc3RyacOnw7VlcyBpbXBvc3RhcyBwb3IgZXN0ZXMgaW5zdHJ1bWVudG9zLgo=Repositório InstitucionalPUBhttps://www.arca.fiocruz.br/oai/requestrepositorio.arca@fiocruz.bropendoar:21352020-08-14T05:09:33Repositório Institucional da FIOCRUZ (ARCA) - Fundação Oswaldo Cruz (FIOCRUZ)false
dc.title.pt_BR.fl_str_mv A support vector machine based method to distinguish long non-coding RNAs from protein coding transcripts
title A support vector machine based method to distinguish long non-coding RNAs from protein coding transcripts
spellingShingle A support vector machine based method to distinguish long non-coding RNAs from protein coding transcripts
Schneider, Hugo W.
Computational Biology
Molecular Sequence Annotation
Mice
Open Reading Frames
RNA, Messenger
RNA, Untranslated
Zebrafish
Support Vector Machine
Long non-coding RNA (lncRNA)
Machine learning
Principal component analysis (PCA)
Support vector machine (SVM)
lncRNA prediction with nucleotide pattern frequencies and ORF length
Biologia Computacional
Camundongos
Anotação de Sequência Molecular
Fases de Leitura Aberta
RNA Mensageiro
RNA não Traduzido
Peixe-Zebra
Máquina de Vetores de Suporte
title_short A support vector machine based method to distinguish long non-coding RNAs from protein coding transcripts
title_full A support vector machine based method to distinguish long non-coding RNAs from protein coding transcripts
title_fullStr A support vector machine based method to distinguish long non-coding RNAs from protein coding transcripts
title_full_unstemmed A support vector machine based method to distinguish long non-coding RNAs from protein coding transcripts
title_sort A support vector machine based method to distinguish long non-coding RNAs from protein coding transcripts
author Schneider, Hugo W.
author_facet Schneider, Hugo W.
Raiol, Tainá
Brigido, Marcelo M.
Walter, Maria Emilia M. T.
Stadler, Peter F.
author_role author
author2 Raiol, Tainá
Brigido, Marcelo M.
Walter, Maria Emilia M. T.
Stadler, Peter F.
author2_role author
author
author
author
dc.contributor.author.fl_str_mv Schneider, Hugo W.
Raiol, Tainá
Brigido, Marcelo M.
Walter, Maria Emilia M. T.
Stadler, Peter F.
dc.subject.mesh.pt_BR.fl_str_mv Computational Biology
Molecular Sequence Annotation
Mice
Open Reading Frames
RNA, Messenger
RNA, Untranslated
Zebrafish
Support Vector Machine
topic Computational Biology
Molecular Sequence Annotation
Mice
Open Reading Frames
RNA, Messenger
RNA, Untranslated
Zebrafish
Support Vector Machine
Long non-coding RNA (lncRNA)
Machine learning
Principal component analysis (PCA)
Support vector machine (SVM)
lncRNA prediction with nucleotide pattern frequencies and ORF length
Biologia Computacional
Camundongos
Anotação de Sequência Molecular
Fases de Leitura Aberta
RNA Mensageiro
RNA não Traduzido
Peixe-Zebra
Máquina de Vetores de Suporte
dc.subject.en.pt_BR.fl_str_mv Long non-coding RNA (lncRNA)
Machine learning
Principal component analysis (PCA)
Support vector machine (SVM)
lncRNA prediction with nucleotide pattern frequencies and ORF length
dc.subject.decs.pt_BR.fl_str_mv Biologia Computacional
Camundongos
Anotação de Sequência Molecular
Fases de Leitura Aberta
RNA Mensageiro
RNA não Traduzido
Peixe-Zebra
Máquina de Vetores de Suporte
description Cnpq
publishDate 2017
dc.date.issued.fl_str_mv 2017
dc.date.accessioned.fl_str_mv 2020-08-13T11:44:31Z
dc.date.available.fl_str_mv 2020-08-13T11:44:31Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.citation.fl_str_mv SCHNEIDER, Hugo W. et al. A support vector machine based method to distinguish long non-coding RNAs from protein coding transcripts. BMC Genomics, [London], v. 18, n. 804, p.1-14, 2017.
dc.identifier.uri.fl_str_mv https://www.arca.fiocruz.br/handle/icict/42741
dc.identifier.issn.pt_BR.fl_str_mv 1471-2164
dc.identifier.doi.none.fl_str_mv 10.1186/s12864-017-4178-4
identifier_str_mv SCHNEIDER, Hugo W. et al. A support vector machine based method to distinguish long non-coding RNAs from protein coding transcripts. BMC Genomics, [London], v. 18, n. 804, p.1-14, 2017.
1471-2164
10.1186/s12864-017-4178-4
url https://www.arca.fiocruz.br/handle/icict/42741
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv Springer Nature
publisher.none.fl_str_mv Springer Nature
dc.source.none.fl_str_mv reponame:Repositório Institucional da FIOCRUZ (ARCA)
instname:Fundação Oswaldo Cruz (FIOCRUZ)
instacron:FIOCRUZ
instname_str Fundação Oswaldo Cruz (FIOCRUZ)
instacron_str FIOCRUZ
institution FIOCRUZ
reponame_str Repositório Institucional da FIOCRUZ (ARCA)
collection Repositório Institucional da FIOCRUZ (ARCA)
bitstream.url.fl_str_mv https://www.arca.fiocruz.br/bitstream/icict/42741/1/license.txt
https://www.arca.fiocruz.br/bitstream/icict/42741/2/ve_Taina_Raiol_etal.pdf
https://www.arca.fiocruz.br/bitstream/icict/42741/3/ve_Taina_Raiol_etal.pdf.txt
bitstream.checksum.fl_str_mv d3e717dbb24bfc607ede047f44d29a0e
32022f8505f26c5e63c677c552d5526f
bedb9d211726162c82e6f1f0742023cc
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
repository.name.fl_str_mv Repositório Institucional da FIOCRUZ (ARCA) - Fundação Oswaldo Cruz (FIOCRUZ)
repository.mail.fl_str_mv repositorio.arca@fiocruz.br
_version_ 1798324935317258240