Computação de alto desempenho na seleção genômica
Autor(a) principal: | |
---|---|
Data de Publicação: | 2012 |
Tipo de documento: | Tese |
Idioma: | por |
Título da fonte: | LOCUS Repositório Institucional da UFV |
Texto Completo: | http://locus.ufv.br/handle/123456789/1804 |
Resumo: | Parallel computing has been growing in recent years due to the lower cost of computers and the exponential growth of databases. The parallel processing involves performing multiple tasks simultaneously on different processors. In the context of genomic selection, the large number of genetic markers used in the analyzes as well as the high computational demand of Bayesian models based on methods of Markov Chain Monte Carlo makes that certain analyzes have weeks or months of runtime. Thus parallel computing is a natural solution to this problem. The method used for analysis was BayesCπ, which has only the Gibbs sampling steps. The algorithm was initially written in a sequential manner using FORTRAN. It was studied two parallelization strategies. The first involved the analysis of multiple parallel chains being recommended in the situation that the burn-in is not long. The second strategy is relative to the parallelization of the chain itself, being indicated for cases in which the burn-in time is too long. It was used the MPI library and the packet OpenMPI associated to the gfortran compiler for this purpose. The computations were performed on a personal computer, with six processing cores of 3.3 GHz and 16 GB of RAM (Random Access Memory) and a cluster with 120 processors of 2.77 GHz. Simulated data for two traits of dairy cattle, referring to 10,000 markers and 4,100 individuals, were used. In the personal computer, the sequential algorithm was processed at 77.29 hours and by using parallel multiple chains the processing was almost five times faster with six cores. The performance ratio between parallel and sequential algorithms was higher in the cluster, because its memory architecture scales better with the number of processors in use than the shared memory architecture of the personal computer. The second parallelization strategy presented a performance gain of only 19% with two processors. Using more processors the processing speed was diminishing slowly. This strategy applies only on systems with shared memory architecture, due to the high overhead generated by the intense exchange of information and tasks synchronization. Therefore parallel computing is a technique of fundamental importance for genomic selection and it will be more significant in coming years due to rapid growth of databases. More efficient strategies for parallelization of the chain itself must be developed, because in situations where the burn-in is too long the processing of multiple chains in parallel is not recommended. The ideal would be that these new approaches have good performance in systems with distributed memory architecture (clusters). |
id |
UFV_0548eda0e114db5821351acb2a2a2b3f |
---|---|
oai_identifier_str |
oai:locus.ufv.br:123456789/1804 |
network_acronym_str |
UFV |
network_name_str |
LOCUS Repositório Institucional da UFV |
repository_id_str |
2145 |
spelling |
Lagrotta, Marcos Rodrigueshttp://lattes.cnpq.br/5176630154717355Torres, Robledo de Almeidahttp://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4783366H0Euclydes, Ricardo Fredericohttp://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4788533U6Silva, Fabyano Fonseca ehttp://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4766260Z2Souza, Gustavo Henrique dehttp://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4760298P6Goulart, Carlos de Castrohttp://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4784106Y92015-03-26T12:54:44Z2013-04-222015-03-26T12:54:44Z2012-07-27LAGROTTA, Marcos Rodrigues. High performance computing in genomic selection. 2012. 81 f. Tese (Doutorado em Genética e Melhoramento de Animais Domésticos; Nutrição e Alimentação Animal; Pastagens e Forragicul) - Universidade Federal de Viçosa, Viçosa, 2012.http://locus.ufv.br/handle/123456789/1804Parallel computing has been growing in recent years due to the lower cost of computers and the exponential growth of databases. The parallel processing involves performing multiple tasks simultaneously on different processors. In the context of genomic selection, the large number of genetic markers used in the analyzes as well as the high computational demand of Bayesian models based on methods of Markov Chain Monte Carlo makes that certain analyzes have weeks or months of runtime. Thus parallel computing is a natural solution to this problem. The method used for analysis was BayesCπ, which has only the Gibbs sampling steps. The algorithm was initially written in a sequential manner using FORTRAN. It was studied two parallelization strategies. The first involved the analysis of multiple parallel chains being recommended in the situation that the burn-in is not long. The second strategy is relative to the parallelization of the chain itself, being indicated for cases in which the burn-in time is too long. It was used the MPI library and the packet OpenMPI associated to the gfortran compiler for this purpose. The computations were performed on a personal computer, with six processing cores of 3.3 GHz and 16 GB of RAM (Random Access Memory) and a cluster with 120 processors of 2.77 GHz. Simulated data for two traits of dairy cattle, referring to 10,000 markers and 4,100 individuals, were used. In the personal computer, the sequential algorithm was processed at 77.29 hours and by using parallel multiple chains the processing was almost five times faster with six cores. The performance ratio between parallel and sequential algorithms was higher in the cluster, because its memory architecture scales better with the number of processors in use than the shared memory architecture of the personal computer. The second parallelization strategy presented a performance gain of only 19% with two processors. Using more processors the processing speed was diminishing slowly. This strategy applies only on systems with shared memory architecture, due to the high overhead generated by the intense exchange of information and tasks synchronization. Therefore parallel computing is a technique of fundamental importance for genomic selection and it will be more significant in coming years due to rapid growth of databases. More efficient strategies for parallelization of the chain itself must be developed, because in situations where the burn-in is too long the processing of multiple chains in parallel is not recommended. The ideal would be that these new approaches have good performance in systems with distributed memory architecture (clusters).A computação paralela vem crescendo nos últimos anos em virtude do menor custo dos computadores e do aumento exponencial dos bancos de dados. O processamento em paralelo envolve a execução de múltiplas tarefas simultaneamente em diferentes processadores. No contexto da seleção genômica, o grande número de marcadores genéticos utilizado nas análises, bem como a grande demanda computacional dos modelos bayesianos fundamentados nos métodos de Monte Carlo Via Cadeias de Markov, faz com que certas análises despendem semanas ou meses de processamento. Assim, a computação paralela representa uma solução natural a este problema. O método usado para análise foi o BayesCπ, o qual possui apenas passos do Amostrador de Gibbs. O algoritmo foi inicialmente escrito na forma sequencial usando o FORTRAN. Duas estratégias de paralelização foram então estudadas. A primeira envolveu a análise de múltiplas cadeias em paralelo, sendo recomendada na situação em que o burn-in não seja longo. A segunda estratégia referiu-se à paralelização da própria cadeia, sendo indicada para situações em que o burn-in é muito longo. Utilizou-se a biblioteca MPI e o pacote OpenMPI associado ao compilador gfortran para tal propósito. As computações foram realizadas em um computador pessoal, com seis núcleos de processamento de 3,3 GHz e 16 GB de memória RAM e em um cluster com 120 processadores de 2,77 GHz. Foram utilizados dados simulados para duas características produtivas de bovinos de leite, referentes a 10.000 marcadores e 4.100 indivíduos. No computador pessoal, o algoritmo sequencial foi processado em 77,29 horas e ao usar múltiplas cadeias em paralelo o processamento foi quase cinco vezes mais rápido com seis núcleos de processamento. A relação de desempenho entre o algoritmo paralelo e o sequencial foi maior no cluster, pois a sua arquitetura de memória escalona melhor com o número de processadores em uso do que a arquitetura de memória compartilhada do computador pessoal. A segunda estratégia de paralelização apresentou um ganho de desempenho de apenas 19% com dois processadores. Contudo, usando mais processadores não houve melhora de desempenho. Esta estratégia só se aplica em sistemas com arquitetura de memória compartilhada, devido ao elevado overhead (sobrecarga) gerado pela intensa troca de informações e sincronização das tarefas. Conclui-se que a computação paralela é uma técnica de fundamental importância para a seleção genômica, e isto será mais expressivo nos próximos anos devido ao rápido crescimento dos bancos de dados. Estratégias mais eficientes de paralelização da própria cadeia devem ser desenvolvidas, visto que nas situações em que o burn-in é muito longo o processamento de múltiplas cadeias em paralelo não é recomendado. O ideal seria que estas novas abordagens apresentassem bom desempenho em sistemas com arquitetura de memória distribuída (clusters).Coordenação de Aperfeiçoamento de Pessoal de Nível Superiorapplication/pdfporUniversidade Federal de ViçosaDoutorado em ZootecniaUFVBRGenética e Melhoramento de Animais Domésticos; Nutrição e Alimentação Animal; Pastagens e ForragiculSeleção genômicaConfiabilidadeGado de leiteGenomic selectionTrustworthinessDairyCNPQ::CIENCIAS AGRARIAS::ZOOTECNIA::GENETICA E MELHORAMENTO DOS ANIMAIS DOMESTICOSComputação de alto desempenho na seleção genômicaHigh performance computing in genomic selectioninfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisinfo:eu-repo/semantics/openAccessreponame:LOCUS Repositório Institucional da UFVinstname:Universidade Federal de Viçosa (UFV)instacron:UFVORIGINALtexto completo.pdfapplication/pdf1280540https://locus.ufv.br//bitstream/123456789/1804/1/texto%20completo.pdf4f9f576360bdfa9897a51ca9040659b2MD51TEXTtexto completo.pdf.txttexto completo.pdf.txtExtracted texttext/plain140865https://locus.ufv.br//bitstream/123456789/1804/2/texto%20completo.pdf.txt11fca14a937ac53fae2e0060a45ea7f1MD52THUMBNAILtexto completo.pdf.jpgtexto completo.pdf.jpgIM Thumbnailimage/jpeg3632https://locus.ufv.br//bitstream/123456789/1804/3/texto%20completo.pdf.jpg844db1d0e7dd7a4d04cc7470bd30b28fMD53123456789/18042016-04-07 23:13:31.078oai:locus.ufv.br:123456789/1804Repositório InstitucionalPUBhttps://www.locus.ufv.br/oai/requestfabiojreis@ufv.bropendoar:21452016-04-08T02:13:31LOCUS Repositório Institucional da UFV - Universidade Federal de Viçosa (UFV)false |
dc.title.por.fl_str_mv |
Computação de alto desempenho na seleção genômica |
dc.title.alternative.eng.fl_str_mv |
High performance computing in genomic selection |
title |
Computação de alto desempenho na seleção genômica |
spellingShingle |
Computação de alto desempenho na seleção genômica Lagrotta, Marcos Rodrigues Seleção genômica Confiabilidade Gado de leite Genomic selection Trustworthiness Dairy CNPQ::CIENCIAS AGRARIAS::ZOOTECNIA::GENETICA E MELHORAMENTO DOS ANIMAIS DOMESTICOS |
title_short |
Computação de alto desempenho na seleção genômica |
title_full |
Computação de alto desempenho na seleção genômica |
title_fullStr |
Computação de alto desempenho na seleção genômica |
title_full_unstemmed |
Computação de alto desempenho na seleção genômica |
title_sort |
Computação de alto desempenho na seleção genômica |
author |
Lagrotta, Marcos Rodrigues |
author_facet |
Lagrotta, Marcos Rodrigues |
author_role |
author |
dc.contributor.authorLattes.por.fl_str_mv |
http://lattes.cnpq.br/5176630154717355 |
dc.contributor.author.fl_str_mv |
Lagrotta, Marcos Rodrigues |
dc.contributor.advisor-co1.fl_str_mv |
Torres, Robledo de Almeida |
dc.contributor.advisor-co1Lattes.fl_str_mv |
http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4783366H0 |
dc.contributor.advisor1.fl_str_mv |
Euclydes, Ricardo Frederico |
dc.contributor.advisor1Lattes.fl_str_mv |
http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4788533U6 |
dc.contributor.referee1.fl_str_mv |
Silva, Fabyano Fonseca e |
dc.contributor.referee1Lattes.fl_str_mv |
http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4766260Z2 |
dc.contributor.referee2.fl_str_mv |
Souza, Gustavo Henrique de |
dc.contributor.referee2Lattes.fl_str_mv |
http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4760298P6 |
dc.contributor.referee3.fl_str_mv |
Goulart, Carlos de Castro |
dc.contributor.referee3Lattes.fl_str_mv |
http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4784106Y9 |
contributor_str_mv |
Torres, Robledo de Almeida Euclydes, Ricardo Frederico Silva, Fabyano Fonseca e Souza, Gustavo Henrique de Goulart, Carlos de Castro |
dc.subject.por.fl_str_mv |
Seleção genômica Confiabilidade Gado de leite |
topic |
Seleção genômica Confiabilidade Gado de leite Genomic selection Trustworthiness Dairy CNPQ::CIENCIAS AGRARIAS::ZOOTECNIA::GENETICA E MELHORAMENTO DOS ANIMAIS DOMESTICOS |
dc.subject.eng.fl_str_mv |
Genomic selection Trustworthiness Dairy |
dc.subject.cnpq.fl_str_mv |
CNPQ::CIENCIAS AGRARIAS::ZOOTECNIA::GENETICA E MELHORAMENTO DOS ANIMAIS DOMESTICOS |
description |
Parallel computing has been growing in recent years due to the lower cost of computers and the exponential growth of databases. The parallel processing involves performing multiple tasks simultaneously on different processors. In the context of genomic selection, the large number of genetic markers used in the analyzes as well as the high computational demand of Bayesian models based on methods of Markov Chain Monte Carlo makes that certain analyzes have weeks or months of runtime. Thus parallel computing is a natural solution to this problem. The method used for analysis was BayesCπ, which has only the Gibbs sampling steps. The algorithm was initially written in a sequential manner using FORTRAN. It was studied two parallelization strategies. The first involved the analysis of multiple parallel chains being recommended in the situation that the burn-in is not long. The second strategy is relative to the parallelization of the chain itself, being indicated for cases in which the burn-in time is too long. It was used the MPI library and the packet OpenMPI associated to the gfortran compiler for this purpose. The computations were performed on a personal computer, with six processing cores of 3.3 GHz and 16 GB of RAM (Random Access Memory) and a cluster with 120 processors of 2.77 GHz. Simulated data for two traits of dairy cattle, referring to 10,000 markers and 4,100 individuals, were used. In the personal computer, the sequential algorithm was processed at 77.29 hours and by using parallel multiple chains the processing was almost five times faster with six cores. The performance ratio between parallel and sequential algorithms was higher in the cluster, because its memory architecture scales better with the number of processors in use than the shared memory architecture of the personal computer. The second parallelization strategy presented a performance gain of only 19% with two processors. Using more processors the processing speed was diminishing slowly. This strategy applies only on systems with shared memory architecture, due to the high overhead generated by the intense exchange of information and tasks synchronization. Therefore parallel computing is a technique of fundamental importance for genomic selection and it will be more significant in coming years due to rapid growth of databases. More efficient strategies for parallelization of the chain itself must be developed, because in situations where the burn-in is too long the processing of multiple chains in parallel is not recommended. The ideal would be that these new approaches have good performance in systems with distributed memory architecture (clusters). |
publishDate |
2012 |
dc.date.issued.fl_str_mv |
2012-07-27 |
dc.date.available.fl_str_mv |
2013-04-22 2015-03-26T12:54:44Z |
dc.date.accessioned.fl_str_mv |
2015-03-26T12:54:44Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
format |
doctoralThesis |
status_str |
publishedVersion |
dc.identifier.citation.fl_str_mv |
LAGROTTA, Marcos Rodrigues. High performance computing in genomic selection. 2012. 81 f. Tese (Doutorado em Genética e Melhoramento de Animais Domésticos; Nutrição e Alimentação Animal; Pastagens e Forragicul) - Universidade Federal de Viçosa, Viçosa, 2012. |
dc.identifier.uri.fl_str_mv |
http://locus.ufv.br/handle/123456789/1804 |
identifier_str_mv |
LAGROTTA, Marcos Rodrigues. High performance computing in genomic selection. 2012. 81 f. Tese (Doutorado em Genética e Melhoramento de Animais Domésticos; Nutrição e Alimentação Animal; Pastagens e Forragicul) - Universidade Federal de Viçosa, Viçosa, 2012. |
url |
http://locus.ufv.br/handle/123456789/1804 |
dc.language.iso.fl_str_mv |
por |
language |
por |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Universidade Federal de Viçosa |
dc.publisher.program.fl_str_mv |
Doutorado em Zootecnia |
dc.publisher.initials.fl_str_mv |
UFV |
dc.publisher.country.fl_str_mv |
BR |
dc.publisher.department.fl_str_mv |
Genética e Melhoramento de Animais Domésticos; Nutrição e Alimentação Animal; Pastagens e Forragicul |
publisher.none.fl_str_mv |
Universidade Federal de Viçosa |
dc.source.none.fl_str_mv |
reponame:LOCUS Repositório Institucional da UFV instname:Universidade Federal de Viçosa (UFV) instacron:UFV |
instname_str |
Universidade Federal de Viçosa (UFV) |
instacron_str |
UFV |
institution |
UFV |
reponame_str |
LOCUS Repositório Institucional da UFV |
collection |
LOCUS Repositório Institucional da UFV |
bitstream.url.fl_str_mv |
https://locus.ufv.br//bitstream/123456789/1804/1/texto%20completo.pdf https://locus.ufv.br//bitstream/123456789/1804/2/texto%20completo.pdf.txt https://locus.ufv.br//bitstream/123456789/1804/3/texto%20completo.pdf.jpg |
bitstream.checksum.fl_str_mv |
4f9f576360bdfa9897a51ca9040659b2 11fca14a937ac53fae2e0060a45ea7f1 844db1d0e7dd7a4d04cc7470bd30b28f |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 MD5 |
repository.name.fl_str_mv |
LOCUS Repositório Institucional da UFV - Universidade Federal de Viçosa (UFV) |
repository.mail.fl_str_mv |
fabiojreis@ufv.br |
_version_ |
1801212866551873536 |