Efficient computation of the matrix square root in heterogeneous platforms

Costa, Pedro Filipe Araújo

Efficient computation of the matrix square root in heterogeneous platforms

Detalhes bibliográficos
Autor(a) principal:	Costa, Pedro Filipe Araújo
Data de Publicação:	2013
Tipo de documento:	Dissertação
Idioma:	eng
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	http://hdl.handle.net/1822/27850
Resumo:	Dissertação de mestrado em Engenharia Informática

Metadados do item

id	RCAP_09f44449556b26236bacb0a5f0dd8e51
oai_identifier_str	oai:repositorium.sdum.uminho.pt:1822/27850
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	Efficient computation of the matrix square root in heterogeneous platforms681.3.02681.325.59519.612Dissertação de mestrado em Engenharia InformáticaMatrix algorithms often deal with large amounts of data at a time, which impairs efficient cache memory usage. Recent collaborative work between the Numerical Algorithms Group and the University of Minho led to a blocked approach to the matrix square root algorithm with significant efficiency improvements, particularly in a multicore shared memory environment. Distributed memory architectures were left unexplored. In these systems data is distributed across multiple memory spaces, including those associated with specialized accelerator devices, such as GPUs. Systems with these devices are known as heterogeneous platforms. This dissertation focuses on studying the blocked matrix square root algorithm, first in a multicore environment, and then in heterogeneous platforms. Two types of hardware accelerators are explored: Intel Xeon Phi coprocessors and NVIDIA CUDA-enabled GPUs. The initial implementation confirmed the advantages of the blocked method and showed excellent scalability in a multicore environment. The same implementation was also used in the Intel Xeon Phi, but the obtained performance results lagged behind the expected behaviour and the CPU-only alternative. Several optimizations techniques were applied to the common implementation, which managed to reduce the gap between the two environments. The implementation for CUDA-enabled devices followed a different programming model and was not able to benefit from any of the previous solutions. It also required the implementation of BLAS and LAPACK routines, since no existing package fits the requirements of this application. The measured performance also showed that the CPU-only implementation is still the fastest.Algoritmos de matrizes lidam regularmente com grandes quantidades de dados ao mesmo tempo, o que dificulta uma utilização eficiente da cache. Um trabalho recente de colaboração entre o Numerical Algorithms Group e a Universidade do Minho levou a uma abordagem por blocos para o algoritmo da raíz quadrada de uma matriz com melhorias de eficiência significativas, particularmente num ambiente multicore de memória partilhada. Arquiteturas de memória distribuída permaneceram inexploradas. Nestes sistemas os dados são distribuídos por diversos espaços de memória, incluindo aqueles associados a dispositivos aceleradores especializados, como GPUs. Sistemas com estes dispositivos são conhecidos como plataformas heterogéneas. Esta dissertação foca-se em estudar o algoritmo da raíz quadrada de uma matriz por blocos, primeiro num ambiente multicore e depois usando plataformas heterogéneas. Dois tipos de aceleradores são explorados: co-processadores Intel Xeon Phi e GPUs NVIDIA habilitados para CUDA. A implementação inicial confirmou as vantagens do método por blocos e mostrou uma escalabilidade excelente num ambiente multicore. A mesma implementação foi ainda usada para o Intel Xeon Phi, mas os resultados de performance obtidos ficaram aquém do comportamento esperado e da alternativa usando apenas CPUs. Várias otimizações foram aplicadas a esta implementação comum, conseguindo reduzir a diferença entre os dois ambientes. A implementação para dispositivos CUDA seguiu um modelo de programação diferente e não pôde beneficiar the nenhuma das soluções anteriores. Também exigiu a implementação de rotinas BLAS e LAPACK, já que nenhum dos pacotes existentes se adequa aos requisitos desta implementação. A performance medida também mostrou que a alternativa usando apenas CPUs ainda é a mais rápida.Fundação para a Ciência e a Tecnologia (FCT) - Program UT Austin \| PortugalProença, Alberto JoséRalha, RuiUniversidade do MinhoCosta, Pedro Filipe Araújo20132013-01-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/1822/27850eng201195771info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-07-21T12:23:39Zoai:repositorium.sdum.uminho.pt:1822/27850Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T19:17:27.309371Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	Efficient computation of the matrix square root in heterogeneous platforms
title	Efficient computation of the matrix square root in heterogeneous platforms
spellingShingle	Efficient computation of the matrix square root in heterogeneous platforms Costa, Pedro Filipe Araújo 681.3.02 681.325.59 519.612
title_short	Efficient computation of the matrix square root in heterogeneous platforms
title_full	Efficient computation of the matrix square root in heterogeneous platforms
title_fullStr	Efficient computation of the matrix square root in heterogeneous platforms
title_full_unstemmed	Efficient computation of the matrix square root in heterogeneous platforms
title_sort	Efficient computation of the matrix square root in heterogeneous platforms
author	Costa, Pedro Filipe Araújo
author_facet	Costa, Pedro Filipe Araújo
author_role	author
dc.contributor.none.fl_str_mv	Proença, Alberto José Ralha, Rui Universidade do Minho
dc.contributor.author.fl_str_mv	Costa, Pedro Filipe Araújo
dc.subject.por.fl_str_mv	681.3.02 681.325.59 519.612
topic	681.3.02 681.325.59 519.612
description	Dissertação de mestrado em Engenharia Informática
publishDate	2013
dc.date.none.fl_str_mv	2013 2013-01-01T00:00:00Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/1822/27850
url	http://hdl.handle.net/1822/27850
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	201195771
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799132626427052032

Efficient computation of the matrix square root in heterogeneous platforms

Registros relacionados