A machine learning based framework to identify and classify long terminal repeat retrotransposons
Autor(a) principal: | |
---|---|
Data de Publicação: | 2018 |
Outros Autores: | , , , , , , |
Tipo de documento: | Artigo |
Idioma: | eng |
Título da fonte: | Repositório Institucional da UNESP |
Texto Completo: | http://dx.doi.org/10.1371/journal.pcbi.1006097 http://hdl.handle.net/11449/176256 |
Resumo: | Transposable elements (TEs) are repetitive nucleotide sequences that make up a large portion of eukaryotic genomes. They can move and duplicate within a genome, increasing genome size and contributing to genetic diversity within and across species. Accurate identification and classification of TEs present in a genome is an important step towards understanding their effects on genes and their role in genome evolution. We introduce TE-Learner, a framework based on machine learning that automatically identifies TEs in a given genome and assigns a classification to them. We present an implementation of our framework towards LTR retrotransposons, a particular type of TEs characterized by having long terminal repeats (LTRs) at their boundaries. We evaluate the predictive performance of our framework on the well-annotated genomes of Drosophila melanogaster and Arabidopsis thaliana and we compare our results for three LTR retrotransposon superfamilies with the results of three widely used methods for TE identification or classification: RepeatMasker, Censor and LtrDigest. In contrast to these methods, TE-Learner is the first to incorporate machine learning techniques, outperforming these methods in terms of predictive performance, while able to learn models and make predictions efficiently. Moreover, we show that our method was able to identify TEs that none of the above method could find, and we investigated TE-Learner’s predictions which did not correspond to an official annotation. It turns out that many of these predictions are in fact strongly homologous to a known TE. |
id |
UNSP_bb8300f4990618e048816d382181fca2 |
---|---|
oai_identifier_str |
oai:repositorio.unesp.br:11449/176256 |
network_acronym_str |
UNSP |
network_name_str |
Repositório Institucional da UNESP |
repository_id_str |
2946 |
spelling |
A machine learning based framework to identify and classify long terminal repeat retrotransposonsTransposable elements (TEs) are repetitive nucleotide sequences that make up a large portion of eukaryotic genomes. They can move and duplicate within a genome, increasing genome size and contributing to genetic diversity within and across species. Accurate identification and classification of TEs present in a genome is an important step towards understanding their effects on genes and their role in genome evolution. We introduce TE-Learner, a framework based on machine learning that automatically identifies TEs in a given genome and assigns a classification to them. We present an implementation of our framework towards LTR retrotransposons, a particular type of TEs characterized by having long terminal repeats (LTRs) at their boundaries. We evaluate the predictive performance of our framework on the well-annotated genomes of Drosophila melanogaster and Arabidopsis thaliana and we compare our results for three LTR retrotransposon superfamilies with the results of three widely used methods for TE identification or classification: RepeatMasker, Censor and LtrDigest. In contrast to these methods, TE-Learner is the first to incorporate machine learning techniques, outperforming these methods in terms of predictive performance, while able to learn models and make predictions efficiently. Moreover, we show that our method was able to identify TEs that none of the above method could find, and we investigated TE-Learner’s predictions which did not correspond to an official annotation. It turns out that many of these predictions are in fact strongly homologous to a known TE.Department of Computer Science KU LeuvenDepartment of Public Health and Primary Care KU Leuven KulakDepartment of Respiratory Medicine Ghent University and VIB Inflammation Research CenterDepartment of Computer Science UFSCar Federal University of São CarlosDepartment of Statistics Applied Mathematics and Computer Science UNESP São Paulo State UniversityInstituto de Ciências Matemáticas e de Computação Universidade de São PauloINRIA Lille Nord Europe, 40 avenue HalleyDepartment of Biology UNESP São Paulo State University São José do Rio PretoDepartment of Statistics Applied Mathematics and Computer Science UNESP São Paulo State UniversityDepartment of Biology UNESP São Paulo State University São José do Rio PretoKU LeuvenKU Leuven KulakGhent University and VIB Inflammation Research CenterUniversidade Federal de São Carlos (UFSCar)Universidade Estadual Paulista (Unesp)Universidade de São Paulo (USP)INRIA Lille Nord EuropeSchietgat, LeanderVens, CelineCerri, RicardoFischer, Carlos N. [UNESP]Costa, EduardoRamon, JanCarareto, Claudia M. A. [UNESP]Blockeel, Hendrik2018-12-11T17:19:49Z2018-12-11T17:19:49Z2018-04-01info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://dx.doi.org/10.1371/journal.pcbi.1006097PLoS Computational Biology, v. 14, n. 4, 2018.1553-73581553-734Xhttp://hdl.handle.net/11449/17625610.1371/journal.pcbi.10060972-s2.0-850463677272-s2.0-85046367727.pdf34257729983192160000-0002-0298-1354Scopusreponame:Repositório Institucional da UNESPinstname:Universidade Estadual Paulista (UNESP)instacron:UNESPengPLoS Computational Biology3,097info:eu-repo/semantics/openAccess2024-01-28T06:46:36Zoai:repositorio.unesp.br:11449/176256Repositório InstitucionalPUBhttp://repositorio.unesp.br/oai/requestopendoar:29462024-08-06T00:07:45.001816Repositório Institucional da UNESP - Universidade Estadual Paulista (UNESP)false |
dc.title.none.fl_str_mv |
A machine learning based framework to identify and classify long terminal repeat retrotransposons |
title |
A machine learning based framework to identify and classify long terminal repeat retrotransposons |
spellingShingle |
A machine learning based framework to identify and classify long terminal repeat retrotransposons Schietgat, Leander |
title_short |
A machine learning based framework to identify and classify long terminal repeat retrotransposons |
title_full |
A machine learning based framework to identify and classify long terminal repeat retrotransposons |
title_fullStr |
A machine learning based framework to identify and classify long terminal repeat retrotransposons |
title_full_unstemmed |
A machine learning based framework to identify and classify long terminal repeat retrotransposons |
title_sort |
A machine learning based framework to identify and classify long terminal repeat retrotransposons |
author |
Schietgat, Leander |
author_facet |
Schietgat, Leander Vens, Celine Cerri, Ricardo Fischer, Carlos N. [UNESP] Costa, Eduardo Ramon, Jan Carareto, Claudia M. A. [UNESP] Blockeel, Hendrik |
author_role |
author |
author2 |
Vens, Celine Cerri, Ricardo Fischer, Carlos N. [UNESP] Costa, Eduardo Ramon, Jan Carareto, Claudia M. A. [UNESP] Blockeel, Hendrik |
author2_role |
author author author author author author author |
dc.contributor.none.fl_str_mv |
KU Leuven KU Leuven Kulak Ghent University and VIB Inflammation Research Center Universidade Federal de São Carlos (UFSCar) Universidade Estadual Paulista (Unesp) Universidade de São Paulo (USP) INRIA Lille Nord Europe |
dc.contributor.author.fl_str_mv |
Schietgat, Leander Vens, Celine Cerri, Ricardo Fischer, Carlos N. [UNESP] Costa, Eduardo Ramon, Jan Carareto, Claudia M. A. [UNESP] Blockeel, Hendrik |
description |
Transposable elements (TEs) are repetitive nucleotide sequences that make up a large portion of eukaryotic genomes. They can move and duplicate within a genome, increasing genome size and contributing to genetic diversity within and across species. Accurate identification and classification of TEs present in a genome is an important step towards understanding their effects on genes and their role in genome evolution. We introduce TE-Learner, a framework based on machine learning that automatically identifies TEs in a given genome and assigns a classification to them. We present an implementation of our framework towards LTR retrotransposons, a particular type of TEs characterized by having long terminal repeats (LTRs) at their boundaries. We evaluate the predictive performance of our framework on the well-annotated genomes of Drosophila melanogaster and Arabidopsis thaliana and we compare our results for three LTR retrotransposon superfamilies with the results of three widely used methods for TE identification or classification: RepeatMasker, Censor and LtrDigest. In contrast to these methods, TE-Learner is the first to incorporate machine learning techniques, outperforming these methods in terms of predictive performance, while able to learn models and make predictions efficiently. Moreover, we show that our method was able to identify TEs that none of the above method could find, and we investigated TE-Learner’s predictions which did not correspond to an official annotation. It turns out that many of these predictions are in fact strongly homologous to a known TE. |
publishDate |
2018 |
dc.date.none.fl_str_mv |
2018-12-11T17:19:49Z 2018-12-11T17:19:49Z 2018-04-01 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://dx.doi.org/10.1371/journal.pcbi.1006097 PLoS Computational Biology, v. 14, n. 4, 2018. 1553-7358 1553-734X http://hdl.handle.net/11449/176256 10.1371/journal.pcbi.1006097 2-s2.0-85046367727 2-s2.0-85046367727.pdf 3425772998319216 0000-0002-0298-1354 |
url |
http://dx.doi.org/10.1371/journal.pcbi.1006097 http://hdl.handle.net/11449/176256 |
identifier_str_mv |
PLoS Computational Biology, v. 14, n. 4, 2018. 1553-7358 1553-734X 10.1371/journal.pcbi.1006097 2-s2.0-85046367727 2-s2.0-85046367727.pdf 3425772998319216 0000-0002-0298-1354 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
PLoS Computational Biology 3,097 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
Scopus reponame:Repositório Institucional da UNESP instname:Universidade Estadual Paulista (UNESP) instacron:UNESP |
instname_str |
Universidade Estadual Paulista (UNESP) |
instacron_str |
UNESP |
institution |
UNESP |
reponame_str |
Repositório Institucional da UNESP |
collection |
Repositório Institucional da UNESP |
repository.name.fl_str_mv |
Repositório Institucional da UNESP - Universidade Estadual Paulista (UNESP) |
repository.mail.fl_str_mv |
|
_version_ |
1808129587890094080 |