TERL: classification of transposable elements by convolutional neural networks

Detalhes bibliográficos
Autor(a) principal: da Cruz, Murilo Horacio Pereira
Data de Publicação: 2021
Outros Autores: Domingues, Douglas Silva [UNESP], Saito, Priscila Tiemi Maeda, Paschoal, Alexandre Rossi, Bugatti, Pedro Henrique
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Institucional da UNESP
Texto Completo: http://dx.doi.org/10.1093/bib/bbaa185
http://hdl.handle.net/11449/221755
Resumo: Transposable elements (TEs) are the most represented sequences occurring in eukaryotic genomes. Few methods provide the classification of these sequences into deeper levels, such as superfamily level, which could provide useful and detailed information about these sequences. Most methods that classify TE sequences use handcrafted features such as k-mers and homology-based search, which could be inefficient for classifying non-homologous sequences. Here we propose an approach, called transposable elements pepresentation learner (TERL), that preprocesses and transforms one-dimensional sequences into two-dimensional space data (i.e., image-like data of the sequences) and apply it to deep convolutional neural networks. This classification method tries to learn the best representation of the input data to classify it correctly. We have conducted six experiments to test the performance of TERL against other methods. Our approach obtained macro mean accuracies and F1-score of 96.4% and 85.8% for superfamilies and 95.7% and 91.5% for the order sequences from RepBase, respectively. We have also obtained macro mean accuracies and F1-score of 95.0% and 70.6% for sequences from seven databases into superfamily level and 89.3% and 73.9% for the order level, respectively. We surpassed accuracy, recall and specificity obtained by other methods on the experiment with the classification of order level sequences from seven databases and surpassed by far the time elapsed of any other method for all experiments. Therefore, TERL can learn how to predict any hierarchical level of the TEs classification system and is about 20 times and three orders of magnitude faster than TEclass and PASTEC, respectively https://github.com/muriloHoracio/TERL. Contact:murilocruz@alunos.utfpr.edu.br.
id UNSP_7c203331836462deca799773117ae366
oai_identifier_str oai:repositorio.unesp.br:11449/221755
network_acronym_str UNSP
network_name_str Repositório Institucional da UNESP
repository_id_str 2946
spelling TERL: classification of transposable elements by convolutional neural networksconvolutional neural networksdeep learningrepresentation learningsequence classificationtransposable elementsTransposable elements (TEs) are the most represented sequences occurring in eukaryotic genomes. Few methods provide the classification of these sequences into deeper levels, such as superfamily level, which could provide useful and detailed information about these sequences. Most methods that classify TE sequences use handcrafted features such as k-mers and homology-based search, which could be inefficient for classifying non-homologous sequences. Here we propose an approach, called transposable elements pepresentation learner (TERL), that preprocesses and transforms one-dimensional sequences into two-dimensional space data (i.e., image-like data of the sequences) and apply it to deep convolutional neural networks. This classification method tries to learn the best representation of the input data to classify it correctly. We have conducted six experiments to test the performance of TERL against other methods. Our approach obtained macro mean accuracies and F1-score of 96.4% and 85.8% for superfamilies and 95.7% and 91.5% for the order sequences from RepBase, respectively. We have also obtained macro mean accuracies and F1-score of 95.0% and 70.6% for sequences from seven databases into superfamily level and 89.3% and 73.9% for the order level, respectively. We surpassed accuracy, recall and specificity obtained by other methods on the experiment with the classification of order level sequences from seven databases and surpassed by far the time elapsed of any other method for all experiments. Therefore, TERL can learn how to predict any hierarchical level of the TEs classification system and is about 20 times and three orders of magnitude faster than TEclass and PASTEC, respectively https://github.com/muriloHoracio/TERL. Contact:murilocruz@alunos.utfpr.edu.br.Federal University of Technology - Parana (UTFPR)Bioinformatics Graduation Program (PPGBIOINFO) Department of Computer Science Federal University of Technology - Parana (UTFPR)São Paulo State University at BotucatuUniversity of São PauloDepartment of Biodiversity São Paulo State University at Rio ClaroEuripides Soares da Rocha University of MariliaUniversity of São Paulo (ICMC-USP)University of Campinas (IC-UNICAMP)Department of Computing Federal University of Technology - Parana (UTFPR)São Paulo State University at BotucatuDepartment of Biodiversity São Paulo State University at Rio ClaroFederal University of Technology - Parana (UTFPR)Universidade Estadual Paulista (UNESP)Universidade de São Paulo (USP)Euripides Soares da Rocha University of MariliaUniversidade Estadual de Campinas (UNICAMP)da Cruz, Murilo Horacio PereiraDomingues, Douglas Silva [UNESP]Saito, Priscila Tiemi MaedaPaschoal, Alexandre RossiBugatti, Pedro Henrique2022-04-28T19:40:16Z2022-04-28T19:40:16Z2021-05-20info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articlehttp://dx.doi.org/10.1093/bib/bbaa185Briefings in bioinformatics, v. 22, n. 3, 2021.1477-4054http://hdl.handle.net/11449/22175510.1093/bib/bbaa1852-s2.0-85106486317Scopusreponame:Repositório Institucional da UNESPinstname:Universidade Estadual Paulista (UNESP)instacron:UNESPengBriefings in bioinformaticsinfo:eu-repo/semantics/openAccess2022-04-28T19:40:16Zoai:repositorio.unesp.br:11449/221755Repositório InstitucionalPUBhttp://repositorio.unesp.br/oai/requestopendoar:29462022-04-28T19:40:16Repositório Institucional da UNESP - Universidade Estadual Paulista (UNESP)false
dc.title.none.fl_str_mv TERL: classification of transposable elements by convolutional neural networks
title TERL: classification of transposable elements by convolutional neural networks
spellingShingle TERL: classification of transposable elements by convolutional neural networks
da Cruz, Murilo Horacio Pereira
convolutional neural networks
deep learning
representation learning
sequence classification
transposable elements
title_short TERL: classification of transposable elements by convolutional neural networks
title_full TERL: classification of transposable elements by convolutional neural networks
title_fullStr TERL: classification of transposable elements by convolutional neural networks
title_full_unstemmed TERL: classification of transposable elements by convolutional neural networks
title_sort TERL: classification of transposable elements by convolutional neural networks
author da Cruz, Murilo Horacio Pereira
author_facet da Cruz, Murilo Horacio Pereira
Domingues, Douglas Silva [UNESP]
Saito, Priscila Tiemi Maeda
Paschoal, Alexandre Rossi
Bugatti, Pedro Henrique
author_role author
author2 Domingues, Douglas Silva [UNESP]
Saito, Priscila Tiemi Maeda
Paschoal, Alexandre Rossi
Bugatti, Pedro Henrique
author2_role author
author
author
author
dc.contributor.none.fl_str_mv Federal University of Technology - Parana (UTFPR)
Universidade Estadual Paulista (UNESP)
Universidade de São Paulo (USP)
Euripides Soares da Rocha University of Marilia
Universidade Estadual de Campinas (UNICAMP)
dc.contributor.author.fl_str_mv da Cruz, Murilo Horacio Pereira
Domingues, Douglas Silva [UNESP]
Saito, Priscila Tiemi Maeda
Paschoal, Alexandre Rossi
Bugatti, Pedro Henrique
dc.subject.por.fl_str_mv convolutional neural networks
deep learning
representation learning
sequence classification
transposable elements
topic convolutional neural networks
deep learning
representation learning
sequence classification
transposable elements
description Transposable elements (TEs) are the most represented sequences occurring in eukaryotic genomes. Few methods provide the classification of these sequences into deeper levels, such as superfamily level, which could provide useful and detailed information about these sequences. Most methods that classify TE sequences use handcrafted features such as k-mers and homology-based search, which could be inefficient for classifying non-homologous sequences. Here we propose an approach, called transposable elements pepresentation learner (TERL), that preprocesses and transforms one-dimensional sequences into two-dimensional space data (i.e., image-like data of the sequences) and apply it to deep convolutional neural networks. This classification method tries to learn the best representation of the input data to classify it correctly. We have conducted six experiments to test the performance of TERL against other methods. Our approach obtained macro mean accuracies and F1-score of 96.4% and 85.8% for superfamilies and 95.7% and 91.5% for the order sequences from RepBase, respectively. We have also obtained macro mean accuracies and F1-score of 95.0% and 70.6% for sequences from seven databases into superfamily level and 89.3% and 73.9% for the order level, respectively. We surpassed accuracy, recall and specificity obtained by other methods on the experiment with the classification of order level sequences from seven databases and surpassed by far the time elapsed of any other method for all experiments. Therefore, TERL can learn how to predict any hierarchical level of the TEs classification system and is about 20 times and three orders of magnitude faster than TEclass and PASTEC, respectively https://github.com/muriloHoracio/TERL. Contact:murilocruz@alunos.utfpr.edu.br.
publishDate 2021
dc.date.none.fl_str_mv 2021-05-20
2022-04-28T19:40:16Z
2022-04-28T19:40:16Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://dx.doi.org/10.1093/bib/bbaa185
Briefings in bioinformatics, v. 22, n. 3, 2021.
1477-4054
http://hdl.handle.net/11449/221755
10.1093/bib/bbaa185
2-s2.0-85106486317
url http://dx.doi.org/10.1093/bib/bbaa185
http://hdl.handle.net/11449/221755
identifier_str_mv Briefings in bioinformatics, v. 22, n. 3, 2021.
1477-4054
10.1093/bib/bbaa185
2-s2.0-85106486317
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv Briefings in bioinformatics
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.source.none.fl_str_mv Scopus
reponame:Repositório Institucional da UNESP
instname:Universidade Estadual Paulista (UNESP)
instacron:UNESP
instname_str Universidade Estadual Paulista (UNESP)
instacron_str UNESP
institution UNESP
reponame_str Repositório Institucional da UNESP
collection Repositório Institucional da UNESP
repository.name.fl_str_mv Repositório Institucional da UNESP - Universidade Estadual Paulista (UNESP)
repository.mail.fl_str_mv
_version_ 1803046996994424832