Geminivirus data warehouse: a database enriched with machine learning approaches

Detalhes bibliográficos
Autor(a) principal: Silva, Jose Cleydson F.
Data de Publicação: 2017
Outros Autores: Carvalho, Thales F. M., Basso, Marcos F., Deguchi, Michihito, Pereira, Welison A., Vidigal, Pedro M. P., Brustolini, Otávio J. B., Silva, Fabyano F., Dal-Bianco, Maximiller, Fontes, Renildes L. F., Santos, Anésia A., Zerbini, Francisco Murilo, Cerqueira, Fabio R., Fontes, Elizabeth P. B., R. Sobrinho, Roberto
Tipo de documento: Artigo
Idioma: eng
Título da fonte: LOCUS Repositório Institucional da UFV
Texto Completo: https://doi.org/10.1186/s12859-017-1646-4
http://www.locus.ufv.br/handle/123456789/12748
Resumo: The Geminiviridae family encompasses a group of single-stranded DNA viruses with twinned and quasi-isometric virions, which infect a wide range of dicotyledonous and monocotyledonous plants and are responsible for significant economic losses worldwide. Geminiviruses are divided into nine genera, according to their insect vector, host range, genome organization, and phylogeny reconstruction. Using rolling-circle amplification approaches along with high-throughput sequencing technologies, thousands of full-length geminivirus and satellite genome sequences were amplified and have become available in public databases. As a consequence, many important challenges have emerged, namely, how to classify, store, and analyze massive datasets as well as how to extract information or new knowledge. Data mining approaches, mainly supported by machine learning (ML) techniques, are a natural means for high-throughput data analysis in the context of genomics, transcriptomics, proteomics, and metabolomics. Here, we describe the development of a data warehouse enriched with ML approaches, designated geminivirus.org. We implemented search modules, bioinformatics tools, and ML methods to retrieve high precision information, demarcate species, and create classifiers for genera and open reading frames (ORFs) of geminivirus genomes. The use of data mining techniques such as ETL (Extract, Transform, Load) to feed our database, as well as algorithms based on machine learning for knowledge extraction, allowed us to obtain a database with quality data and suitable tools for bioinformatics analysis. The Geminivirus Data Warehouse (geminivirus.org) offers a simple and user-friendly environment for information retrieval and knowledge discovery related to geminiviruses.
id UFV_0bbd679d977cd974e830475792b4e5dd
oai_identifier_str oai:locus.ufv.br:123456789/12748
network_acronym_str UFV
network_name_str LOCUS Repositório Institucional da UFV
repository_id_str 2145
spelling Silva, Jose Cleydson F.Carvalho, Thales F. M.Basso, Marcos F.Deguchi, MichihitoPereira, Welison A.Vidigal, Pedro M. P.Brustolini, Otávio J. B.Silva, Fabyano F.Dal-Bianco, MaximillerFontes, Renildes L. F.Santos, Anésia A.Zerbini, Francisco MuriloCerqueira, Fabio R.Fontes, Elizabeth P. B.R. Sobrinho, Roberto2017-11-06T09:22:19Z2017-11-06T09:22:19Z2017-05-051471-2105https://doi.org/10.1186/s12859-017-1646-4http://www.locus.ufv.br/handle/123456789/12748The Geminiviridae family encompasses a group of single-stranded DNA viruses with twinned and quasi-isometric virions, which infect a wide range of dicotyledonous and monocotyledonous plants and are responsible for significant economic losses worldwide. Geminiviruses are divided into nine genera, according to their insect vector, host range, genome organization, and phylogeny reconstruction. Using rolling-circle amplification approaches along with high-throughput sequencing technologies, thousands of full-length geminivirus and satellite genome sequences were amplified and have become available in public databases. As a consequence, many important challenges have emerged, namely, how to classify, store, and analyze massive datasets as well as how to extract information or new knowledge. Data mining approaches, mainly supported by machine learning (ML) techniques, are a natural means for high-throughput data analysis in the context of genomics, transcriptomics, proteomics, and metabolomics. Here, we describe the development of a data warehouse enriched with ML approaches, designated geminivirus.org. We implemented search modules, bioinformatics tools, and ML methods to retrieve high precision information, demarcate species, and create classifiers for genera and open reading frames (ORFs) of geminivirus genomes. The use of data mining techniques such as ETL (Extract, Transform, Load) to feed our database, as well as algorithms based on machine learning for knowledge extraction, allowed us to obtain a database with quality data and suitable tools for bioinformatics analysis. The Geminivirus Data Warehouse (geminivirus.org) offers a simple and user-friendly environment for information retrieval and knowledge discovery related to geminiviruses.engBioMed Central Bioinformaticsv. 18, n. 240, May. 2017Machine learningKnowledge discoveryData miningGeminivirusData warehouseRandom forestGeminivirus data warehouse: a database enriched with machine learning approachesinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfinfo:eu-repo/semantics/openAccessreponame:LOCUS Repositório Institucional da UFVinstname:Universidade Federal de Viçosa (UFV)instacron:UFVORIGINALdocument.pdfdocument.pdftexto completoapplication/pdf1322530https://locus.ufv.br//bitstream/123456789/12748/1/document.pdf63e53fc4113f92e165abb0ac2941d82fMD51LICENSElicense.txtlicense.txttext/plain; charset=utf-81748https://locus.ufv.br//bitstream/123456789/12748/2/license.txt8a4605be74aa9ea9d79846c1fba20a33MD52THUMBNAILdocument.pdf.jpgdocument.pdf.jpgIM Thumbnailimage/jpeg5199https://locus.ufv.br//bitstream/123456789/12748/3/document.pdf.jpg08e496d98aefd4a1d7dfd153e7f2ad82MD53123456789/127482017-11-06 22:00:28.69oai:locus.ufv.br:123456789/12748Tk9URTogUExBQ0UgWU9VUiBPV04gTElDRU5TRSBIRVJFClRoaXMgc2FtcGxlIGxpY2Vuc2UgaXMgcHJvdmlkZWQgZm9yIGluZm9ybWF0aW9uYWwgcHVycG9zZXMgb25seS4KCk5PTi1FWENMVVNJVkUgRElTVFJJQlVUSU9OIExJQ0VOU0UKCkJ5IHNpZ25pbmcgYW5kIHN1Ym1pdHRpbmcgdGhpcyBsaWNlbnNlLCB5b3UgKHRoZSBhdXRob3Iocykgb3IgY29weXJpZ2h0Cm93bmVyKSBncmFudHMgdG8gRFNwYWNlIFVuaXZlcnNpdHkgKERTVSkgdGhlIG5vbi1leGNsdXNpdmUgcmlnaHQgdG8gcmVwcm9kdWNlLAp0cmFuc2xhdGUgKGFzIGRlZmluZWQgYmVsb3cpLCBhbmQvb3IgZGlzdHJpYnV0ZSB5b3VyIHN1Ym1pc3Npb24gKGluY2x1ZGluZwp0aGUgYWJzdHJhY3QpIHdvcmxkd2lkZSBpbiBwcmludCBhbmQgZWxlY3Ryb25pYyBmb3JtYXQgYW5kIGluIGFueSBtZWRpdW0sCmluY2x1ZGluZyBidXQgbm90IGxpbWl0ZWQgdG8gYXVkaW8gb3IgdmlkZW8uCgpZb3UgYWdyZWUgdGhhdCBEU1UgbWF5LCB3aXRob3V0IGNoYW5naW5nIHRoZSBjb250ZW50LCB0cmFuc2xhdGUgdGhlCnN1Ym1pc3Npb24gdG8gYW55IG1lZGl1bSBvciBmb3JtYXQgZm9yIHRoZSBwdXJwb3NlIG9mIHByZXNlcnZhdGlvbi4KCllvdSBhbHNvIGFncmVlIHRoYXQgRFNVIG1heSBrZWVwIG1vcmUgdGhhbiBvbmUgY29weSBvZiB0aGlzIHN1Ym1pc3Npb24gZm9yCnB1cnBvc2VzIG9mIHNlY3VyaXR5LCBiYWNrLXVwIGFuZCBwcmVzZXJ2YXRpb24uCgpZb3UgcmVwcmVzZW50IHRoYXQgdGhlIHN1Ym1pc3Npb24gaXMgeW91ciBvcmlnaW5hbCB3b3JrLCBhbmQgdGhhdCB5b3UgaGF2ZQp0aGUgcmlnaHQgdG8gZ3JhbnQgdGhlIHJpZ2h0cyBjb250YWluZWQgaW4gdGhpcyBsaWNlbnNlLiBZb3UgYWxzbyByZXByZXNlbnQKdGhhdCB5b3VyIHN1Ym1pc3Npb24gZG9lcyBub3QsIHRvIHRoZSBiZXN0IG9mIHlvdXIga25vd2xlZGdlLCBpbmZyaW5nZSB1cG9uCmFueW9uZSdzIGNvcHlyaWdodC4KCklmIHRoZSBzdWJtaXNzaW9uIGNvbnRhaW5zIG1hdGVyaWFsIGZvciB3aGljaCB5b3UgZG8gbm90IGhvbGQgY29weXJpZ2h0LAp5b3UgcmVwcmVzZW50IHRoYXQgeW91IGhhdmUgb2J0YWluZWQgdGhlIHVucmVzdHJpY3RlZCBwZXJtaXNzaW9uIG9mIHRoZQpjb3B5cmlnaHQgb3duZXIgdG8gZ3JhbnQgRFNVIHRoZSByaWdodHMgcmVxdWlyZWQgYnkgdGhpcyBsaWNlbnNlLCBhbmQgdGhhdApzdWNoIHRoaXJkLXBhcnR5IG93bmVkIG1hdGVyaWFsIGlzIGNsZWFybHkgaWRlbnRpZmllZCBhbmQgYWNrbm93bGVkZ2VkCndpdGhpbiB0aGUgdGV4dCBvciBjb250ZW50IG9mIHRoZSBzdWJtaXNzaW9uLgoKSUYgVEhFIFNVQk1JU1NJT04gSVMgQkFTRUQgVVBPTiBXT1JLIFRIQVQgSEFTIEJFRU4gU1BPTlNPUkVEIE9SIFNVUFBPUlRFRApCWSBBTiBBR0VOQ1kgT1IgT1JHQU5JWkFUSU9OIE9USEVSIFRIQU4gRFNVLCBZT1UgUkVQUkVTRU5UIFRIQVQgWU9VIEhBVkUKRlVMRklMTEVEIEFOWSBSSUdIVCBPRiBSRVZJRVcgT1IgT1RIRVIgT0JMSUdBVElPTlMgUkVRVUlSRUQgQlkgU1VDSApDT05UUkFDVCBPUiBBR1JFRU1FTlQuCgpEU1Ugd2lsbCBjbGVhcmx5IGlkZW50aWZ5IHlvdXIgbmFtZShzKSBhcyB0aGUgYXV0aG9yKHMpIG9yIG93bmVyKHMpIG9mIHRoZQpzdWJtaXNzaW9uLCBhbmQgd2lsbCBub3QgbWFrZSBhbnkgYWx0ZXJhdGlvbiwgb3RoZXIgdGhhbiBhcyBhbGxvd2VkIGJ5IHRoaXMKbGljZW5zZSwgdG8geW91ciBzdWJtaXNzaW9uLgo=Repositório InstitucionalPUBhttps://www.locus.ufv.br/oai/requestfabiojreis@ufv.bropendoar:21452017-11-07T01:00:28LOCUS Repositório Institucional da UFV - Universidade Federal de Viçosa (UFV)false
dc.title.en.fl_str_mv Geminivirus data warehouse: a database enriched with machine learning approaches
title Geminivirus data warehouse: a database enriched with machine learning approaches
spellingShingle Geminivirus data warehouse: a database enriched with machine learning approaches
Silva, Jose Cleydson F.
Machine learning
Knowledge discovery
Data mining
Geminivirus
Data warehouse
Random forest
title_short Geminivirus data warehouse: a database enriched with machine learning approaches
title_full Geminivirus data warehouse: a database enriched with machine learning approaches
title_fullStr Geminivirus data warehouse: a database enriched with machine learning approaches
title_full_unstemmed Geminivirus data warehouse: a database enriched with machine learning approaches
title_sort Geminivirus data warehouse: a database enriched with machine learning approaches
author Silva, Jose Cleydson F.
author_facet Silva, Jose Cleydson F.
Carvalho, Thales F. M.
Basso, Marcos F.
Deguchi, Michihito
Pereira, Welison A.
Vidigal, Pedro M. P.
Brustolini, Otávio J. B.
Silva, Fabyano F.
Dal-Bianco, Maximiller
Fontes, Renildes L. F.
Santos, Anésia A.
Zerbini, Francisco Murilo
Cerqueira, Fabio R.
Fontes, Elizabeth P. B.
R. Sobrinho, Roberto
author_role author
author2 Carvalho, Thales F. M.
Basso, Marcos F.
Deguchi, Michihito
Pereira, Welison A.
Vidigal, Pedro M. P.
Brustolini, Otávio J. B.
Silva, Fabyano F.
Dal-Bianco, Maximiller
Fontes, Renildes L. F.
Santos, Anésia A.
Zerbini, Francisco Murilo
Cerqueira, Fabio R.
Fontes, Elizabeth P. B.
R. Sobrinho, Roberto
author2_role author
author
author
author
author
author
author
author
author
author
author
author
author
author
dc.contributor.author.fl_str_mv Silva, Jose Cleydson F.
Carvalho, Thales F. M.
Basso, Marcos F.
Deguchi, Michihito
Pereira, Welison A.
Vidigal, Pedro M. P.
Brustolini, Otávio J. B.
Silva, Fabyano F.
Dal-Bianco, Maximiller
Fontes, Renildes L. F.
Santos, Anésia A.
Zerbini, Francisco Murilo
Cerqueira, Fabio R.
Fontes, Elizabeth P. B.
R. Sobrinho, Roberto
dc.subject.pt-BR.fl_str_mv Machine learning
Knowledge discovery
Data mining
Geminivirus
Data warehouse
Random forest
topic Machine learning
Knowledge discovery
Data mining
Geminivirus
Data warehouse
Random forest
description The Geminiviridae family encompasses a group of single-stranded DNA viruses with twinned and quasi-isometric virions, which infect a wide range of dicotyledonous and monocotyledonous plants and are responsible for significant economic losses worldwide. Geminiviruses are divided into nine genera, according to their insect vector, host range, genome organization, and phylogeny reconstruction. Using rolling-circle amplification approaches along with high-throughput sequencing technologies, thousands of full-length geminivirus and satellite genome sequences were amplified and have become available in public databases. As a consequence, many important challenges have emerged, namely, how to classify, store, and analyze massive datasets as well as how to extract information or new knowledge. Data mining approaches, mainly supported by machine learning (ML) techniques, are a natural means for high-throughput data analysis in the context of genomics, transcriptomics, proteomics, and metabolomics. Here, we describe the development of a data warehouse enriched with ML approaches, designated geminivirus.org. We implemented search modules, bioinformatics tools, and ML methods to retrieve high precision information, demarcate species, and create classifiers for genera and open reading frames (ORFs) of geminivirus genomes. The use of data mining techniques such as ETL (Extract, Transform, Load) to feed our database, as well as algorithms based on machine learning for knowledge extraction, allowed us to obtain a database with quality data and suitable tools for bioinformatics analysis. The Geminivirus Data Warehouse (geminivirus.org) offers a simple and user-friendly environment for information retrieval and knowledge discovery related to geminiviruses.
publishDate 2017
dc.date.accessioned.fl_str_mv 2017-11-06T09:22:19Z
dc.date.available.fl_str_mv 2017-11-06T09:22:19Z
dc.date.issued.fl_str_mv 2017-05-05
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://doi.org/10.1186/s12859-017-1646-4
http://www.locus.ufv.br/handle/123456789/12748
dc.identifier.issn.none.fl_str_mv 1471-2105
identifier_str_mv 1471-2105
url https://doi.org/10.1186/s12859-017-1646-4
http://www.locus.ufv.br/handle/123456789/12748
dc.language.iso.fl_str_mv eng
language eng
dc.relation.ispartofseries.pt-BR.fl_str_mv v. 18, n. 240, May. 2017
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv BioMed Central Bioinformatics
publisher.none.fl_str_mv BioMed Central Bioinformatics
dc.source.none.fl_str_mv reponame:LOCUS Repositório Institucional da UFV
instname:Universidade Federal de Viçosa (UFV)
instacron:UFV
instname_str Universidade Federal de Viçosa (UFV)
instacron_str UFV
institution UFV
reponame_str LOCUS Repositório Institucional da UFV
collection LOCUS Repositório Institucional da UFV
bitstream.url.fl_str_mv https://locus.ufv.br//bitstream/123456789/12748/1/document.pdf
https://locus.ufv.br//bitstream/123456789/12748/2/license.txt
https://locus.ufv.br//bitstream/123456789/12748/3/document.pdf.jpg
bitstream.checksum.fl_str_mv 63e53fc4113f92e165abb0ac2941d82f
8a4605be74aa9ea9d79846c1fba20a33
08e496d98aefd4a1d7dfd153e7f2ad82
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
repository.name.fl_str_mv LOCUS Repositório Institucional da UFV - Universidade Federal de Viçosa (UFV)
repository.mail.fl_str_mv fabiojreis@ufv.br
_version_ 1801213058027094016