A machine learning based framework to identify and classify long terminal repeat retrotransposons

Detalhes bibliográficos
Autor(a) principal: Schietgat, Leander
Data de Publicação: 2018
Outros Autores: Vens, Celine, Cerri, Ricardo, Fischer, Carlos N. [UNESP], Costa, Eduardo, Ramon, Jan, Carareto, Claudia M. A. [UNESP], Blockeel, Hendrik
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Institucional da UNESP
Texto Completo: http://dx.doi.org/10.1371/journal.pcbi.1006097
http://hdl.handle.net/11449/176256
Resumo: Transposable elements (TEs) are repetitive nucleotide sequences that make up a large portion of eukaryotic genomes. They can move and duplicate within a genome, increasing genome size and contributing to genetic diversity within and across species. Accurate identification and classification of TEs present in a genome is an important step towards understanding their effects on genes and their role in genome evolution. We introduce TE-Learner, a framework based on machine learning that automatically identifies TEs in a given genome and assigns a classification to them. We present an implementation of our framework towards LTR retrotransposons, a particular type of TEs characterized by having long terminal repeats (LTRs) at their boundaries. We evaluate the predictive performance of our framework on the well-annotated genomes of Drosophila melanogaster and Arabidopsis thaliana and we compare our results for three LTR retrotransposon superfamilies with the results of three widely used methods for TE identification or classification: RepeatMasker, Censor and LtrDigest. In contrast to these methods, TE-Learner is the first to incorporate machine learning techniques, outperforming these methods in terms of predictive performance, while able to learn models and make predictions efficiently. Moreover, we show that our method was able to identify TEs that none of the above method could find, and we investigated TE-Learner’s predictions which did not correspond to an official annotation. It turns out that many of these predictions are in fact strongly homologous to a known TE.
id UNSP_bb8300f4990618e048816d382181fca2
oai_identifier_str oai:repositorio.unesp.br:11449/176256
network_acronym_str UNSP
network_name_str Repositório Institucional da UNESP
repository_id_str 2946
spelling A machine learning based framework to identify and classify long terminal repeat retrotransposonsTransposable elements (TEs) are repetitive nucleotide sequences that make up a large portion of eukaryotic genomes. They can move and duplicate within a genome, increasing genome size and contributing to genetic diversity within and across species. Accurate identification and classification of TEs present in a genome is an important step towards understanding their effects on genes and their role in genome evolution. We introduce TE-Learner, a framework based on machine learning that automatically identifies TEs in a given genome and assigns a classification to them. We present an implementation of our framework towards LTR retrotransposons, a particular type of TEs characterized by having long terminal repeats (LTRs) at their boundaries. We evaluate the predictive performance of our framework on the well-annotated genomes of Drosophila melanogaster and Arabidopsis thaliana and we compare our results for three LTR retrotransposon superfamilies with the results of three widely used methods for TE identification or classification: RepeatMasker, Censor and LtrDigest. In contrast to these methods, TE-Learner is the first to incorporate machine learning techniques, outperforming these methods in terms of predictive performance, while able to learn models and make predictions efficiently. Moreover, we show that our method was able to identify TEs that none of the above method could find, and we investigated TE-Learner’s predictions which did not correspond to an official annotation. It turns out that many of these predictions are in fact strongly homologous to a known TE.Department of Computer Science KU LeuvenDepartment of Public Health and Primary Care KU Leuven KulakDepartment of Respiratory Medicine Ghent University and VIB Inflammation Research CenterDepartment of Computer Science UFSCar Federal University of São CarlosDepartment of Statistics Applied Mathematics and Computer Science UNESP São Paulo State UniversityInstituto de Ciências Matemáticas e de Computação Universidade de São PauloINRIA Lille Nord Europe, 40 avenue HalleyDepartment of Biology UNESP São Paulo State University São José do Rio PretoDepartment of Statistics Applied Mathematics and Computer Science UNESP São Paulo State UniversityDepartment of Biology UNESP São Paulo State University São José do Rio PretoKU LeuvenKU Leuven KulakGhent University and VIB Inflammation Research CenterUniversidade Federal de São Carlos (UFSCar)Universidade Estadual Paulista (Unesp)Universidade de São Paulo (USP)INRIA Lille Nord EuropeSchietgat, LeanderVens, CelineCerri, RicardoFischer, Carlos N. [UNESP]Costa, EduardoRamon, JanCarareto, Claudia M. A. [UNESP]Blockeel, Hendrik2018-12-11T17:19:49Z2018-12-11T17:19:49Z2018-04-01info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://dx.doi.org/10.1371/journal.pcbi.1006097PLoS Computational Biology, v. 14, n. 4, 2018.1553-73581553-734Xhttp://hdl.handle.net/11449/17625610.1371/journal.pcbi.10060972-s2.0-850463677272-s2.0-85046367727.pdf34257729983192160000-0002-0298-1354Scopusreponame:Repositório Institucional da UNESPinstname:Universidade Estadual Paulista (UNESP)instacron:UNESPengPLoS Computational Biology3,097info:eu-repo/semantics/openAccess2024-01-28T06:46:36Zoai:repositorio.unesp.br:11449/176256Repositório InstitucionalPUBhttp://repositorio.unesp.br/oai/requestopendoar:29462024-08-06T00:07:45.001816Repositório Institucional da UNESP - Universidade Estadual Paulista (UNESP)false
dc.title.none.fl_str_mv A machine learning based framework to identify and classify long terminal repeat retrotransposons
title A machine learning based framework to identify and classify long terminal repeat retrotransposons
spellingShingle A machine learning based framework to identify and classify long terminal repeat retrotransposons
Schietgat, Leander
title_short A machine learning based framework to identify and classify long terminal repeat retrotransposons
title_full A machine learning based framework to identify and classify long terminal repeat retrotransposons
title_fullStr A machine learning based framework to identify and classify long terminal repeat retrotransposons
title_full_unstemmed A machine learning based framework to identify and classify long terminal repeat retrotransposons
title_sort A machine learning based framework to identify and classify long terminal repeat retrotransposons
author Schietgat, Leander
author_facet Schietgat, Leander
Vens, Celine
Cerri, Ricardo
Fischer, Carlos N. [UNESP]
Costa, Eduardo
Ramon, Jan
Carareto, Claudia M. A. [UNESP]
Blockeel, Hendrik
author_role author
author2 Vens, Celine
Cerri, Ricardo
Fischer, Carlos N. [UNESP]
Costa, Eduardo
Ramon, Jan
Carareto, Claudia M. A. [UNESP]
Blockeel, Hendrik
author2_role author
author
author
author
author
author
author
dc.contributor.none.fl_str_mv KU Leuven
KU Leuven Kulak
Ghent University and VIB Inflammation Research Center
Universidade Federal de São Carlos (UFSCar)
Universidade Estadual Paulista (Unesp)
Universidade de São Paulo (USP)
INRIA Lille Nord Europe
dc.contributor.author.fl_str_mv Schietgat, Leander
Vens, Celine
Cerri, Ricardo
Fischer, Carlos N. [UNESP]
Costa, Eduardo
Ramon, Jan
Carareto, Claudia M. A. [UNESP]
Blockeel, Hendrik
description Transposable elements (TEs) are repetitive nucleotide sequences that make up a large portion of eukaryotic genomes. They can move and duplicate within a genome, increasing genome size and contributing to genetic diversity within and across species. Accurate identification and classification of TEs present in a genome is an important step towards understanding their effects on genes and their role in genome evolution. We introduce TE-Learner, a framework based on machine learning that automatically identifies TEs in a given genome and assigns a classification to them. We present an implementation of our framework towards LTR retrotransposons, a particular type of TEs characterized by having long terminal repeats (LTRs) at their boundaries. We evaluate the predictive performance of our framework on the well-annotated genomes of Drosophila melanogaster and Arabidopsis thaliana and we compare our results for three LTR retrotransposon superfamilies with the results of three widely used methods for TE identification or classification: RepeatMasker, Censor and LtrDigest. In contrast to these methods, TE-Learner is the first to incorporate machine learning techniques, outperforming these methods in terms of predictive performance, while able to learn models and make predictions efficiently. Moreover, we show that our method was able to identify TEs that none of the above method could find, and we investigated TE-Learner’s predictions which did not correspond to an official annotation. It turns out that many of these predictions are in fact strongly homologous to a known TE.
publishDate 2018
dc.date.none.fl_str_mv 2018-12-11T17:19:49Z
2018-12-11T17:19:49Z
2018-04-01
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://dx.doi.org/10.1371/journal.pcbi.1006097
PLoS Computational Biology, v. 14, n. 4, 2018.
1553-7358
1553-734X
http://hdl.handle.net/11449/176256
10.1371/journal.pcbi.1006097
2-s2.0-85046367727
2-s2.0-85046367727.pdf
3425772998319216
0000-0002-0298-1354
url http://dx.doi.org/10.1371/journal.pcbi.1006097
http://hdl.handle.net/11449/176256
identifier_str_mv PLoS Computational Biology, v. 14, n. 4, 2018.
1553-7358
1553-734X
10.1371/journal.pcbi.1006097
2-s2.0-85046367727
2-s2.0-85046367727.pdf
3425772998319216
0000-0002-0298-1354
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv PLoS Computational Biology
3,097
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv Scopus
reponame:Repositório Institucional da UNESP
instname:Universidade Estadual Paulista (UNESP)
instacron:UNESP
instname_str Universidade Estadual Paulista (UNESP)
instacron_str UNESP
institution UNESP
reponame_str Repositório Institucional da UNESP
collection Repositório Institucional da UNESP
repository.name.fl_str_mv Repositório Institucional da UNESP - Universidade Estadual Paulista (UNESP)
repository.mail.fl_str_mv
_version_ 1808129587890094080