Using precision reduction to efficiently improve mixed-precision GPUs reliability

Detalhes bibliográficos
Autor(a) principal: Acosta, Gerônimo Veit
Data de Publicação: 2021
Tipo de documento: Trabalho de conclusão de curso
Idioma: por
Título da fonte: Repositório Institucional da UFRGS
Texto Completo: http://hdl.handle.net/10183/224241
Resumo: Duplication With Comparison (DWC) is a traditional and accepted method for improving systems’ reliability. DWC consists of duplicating critical regions in Software or in Hardware level by creating redundant operations in order to decrease the probability of an unwanted event. However, this technique introduces an expensive overhead in power consumption, processing time and in resources allocation. This obstacle is due to the fact that the critical operations are computed at least two times in this process. Reduced Precision Duplication With Comparison (RP-DWC) is an effective software level solution to improve the performance of the conventional DWC. RP-DWC aims to mitigate these overheads by enabling parallel processing in underused Floating Point Units (FPUs) in mixed precision Graphic Processing Units (GPUs). By making use of precision reduction to efficiently improve the reliability in mixed precision GPUs, RPDWC extends the DWC technique, introducing proper ways to handle redundancy with different precision operations. Improving GPUs reliability is an extremely valuable challenge in the fault tolerance field since GPUs are adopted in both High-Performance Computing (HPC) and in automotive real-time applications. When GPUs are exposed to a natural environment, such as the surface of the Earth at sea level, they are also exposed to the Earth’s surface radiation. Furthermore, this exposure can be critical, given that these radiation particles may hit the GPU’s internal circuit, corrupt sensitive data and consequently generate undesired outputs. Introducing duplication with reduced precision in a trustworthy manner to maintain reliability in safety-critical systems is an arduous task that we propose to further investigate in this work.
id UFRGS-2_8366c73b5878d53181ad29e4b2ddce08
oai_identifier_str oai:www.lume.ufrgs.br:10183/224241
network_acronym_str UFRGS-2
network_name_str Repositório Institucional da UFRGS
repository_id_str
spelling Acosta, Gerônimo VeitRech, PaoloSantos, Fernando Fernandes dos2021-07-21T04:23:43Z2021http://hdl.handle.net/10183/224241001128602Duplication With Comparison (DWC) is a traditional and accepted method for improving systems’ reliability. DWC consists of duplicating critical regions in Software or in Hardware level by creating redundant operations in order to decrease the probability of an unwanted event. However, this technique introduces an expensive overhead in power consumption, processing time and in resources allocation. This obstacle is due to the fact that the critical operations are computed at least two times in this process. Reduced Precision Duplication With Comparison (RP-DWC) is an effective software level solution to improve the performance of the conventional DWC. RP-DWC aims to mitigate these overheads by enabling parallel processing in underused Floating Point Units (FPUs) in mixed precision Graphic Processing Units (GPUs). By making use of precision reduction to efficiently improve the reliability in mixed precision GPUs, RPDWC extends the DWC technique, introducing proper ways to handle redundancy with different precision operations. Improving GPUs reliability is an extremely valuable challenge in the fault tolerance field since GPUs are adopted in both High-Performance Computing (HPC) and in automotive real-time applications. When GPUs are exposed to a natural environment, such as the surface of the Earth at sea level, they are also exposed to the Earth’s surface radiation. Furthermore, this exposure can be critical, given that these radiation particles may hit the GPU’s internal circuit, corrupt sensitive data and consequently generate undesired outputs. Introducing duplication with reduced precision in a trustworthy manner to maintain reliability in safety-critical systems is an arduous task that we propose to further investigate in this work.application/pdfporTolerancia : FalhasReliabilityRadiationDuplicationDWCRP-DWCGPUUsing precision reduction to efficiently improve mixed-precision GPUs reliabilityinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/bachelorThesisUniversidade Federal do Rio Grande do SulInstituto de InformáticaPorto Alegre, BR-RS2021Ciência da Computação: Ênfase em Engenharia da Computação: Bachareladograduaçãoinfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da UFRGSinstname:Universidade Federal do Rio Grande do Sul (UFRGS)instacron:UFRGSTEXT001128602.pdf.txt001128602.pdf.txtExtracted Texttext/plain79487http://www.lume.ufrgs.br/bitstream/10183/224241/2/001128602.pdf.txtc77f24fe3857640be0dcb4aa89dc0906MD52ORIGINAL001128602.pdfTexto completo (inglês)application/pdf1310454http://www.lume.ufrgs.br/bitstream/10183/224241/1/001128602.pdfdbcedef7bb794ab3e2d47471fcadcf30MD5110183/2242412021-08-18 04:33:39.423768oai:www.lume.ufrgs.br:10183/224241Repositório de PublicaçõesPUBhttps://lume.ufrgs.br/oai/requestopendoar:2021-08-18T07:33:39Repositório Institucional da UFRGS - Universidade Federal do Rio Grande do Sul (UFRGS)false
dc.title.pt_BR.fl_str_mv Using precision reduction to efficiently improve mixed-precision GPUs reliability
title Using precision reduction to efficiently improve mixed-precision GPUs reliability
spellingShingle Using precision reduction to efficiently improve mixed-precision GPUs reliability
Acosta, Gerônimo Veit
Tolerancia : Falhas
Reliability
Radiation
Duplication
DWC
RP-DWC
GPU
title_short Using precision reduction to efficiently improve mixed-precision GPUs reliability
title_full Using precision reduction to efficiently improve mixed-precision GPUs reliability
title_fullStr Using precision reduction to efficiently improve mixed-precision GPUs reliability
title_full_unstemmed Using precision reduction to efficiently improve mixed-precision GPUs reliability
title_sort Using precision reduction to efficiently improve mixed-precision GPUs reliability
author Acosta, Gerônimo Veit
author_facet Acosta, Gerônimo Veit
author_role author
dc.contributor.author.fl_str_mv Acosta, Gerônimo Veit
dc.contributor.advisor1.fl_str_mv Rech, Paolo
dc.contributor.advisor-co1.fl_str_mv Santos, Fernando Fernandes dos
contributor_str_mv Rech, Paolo
Santos, Fernando Fernandes dos
dc.subject.por.fl_str_mv Tolerancia : Falhas
topic Tolerancia : Falhas
Reliability
Radiation
Duplication
DWC
RP-DWC
GPU
dc.subject.eng.fl_str_mv Reliability
Radiation
Duplication
DWC
RP-DWC
GPU
description Duplication With Comparison (DWC) is a traditional and accepted method for improving systems’ reliability. DWC consists of duplicating critical regions in Software or in Hardware level by creating redundant operations in order to decrease the probability of an unwanted event. However, this technique introduces an expensive overhead in power consumption, processing time and in resources allocation. This obstacle is due to the fact that the critical operations are computed at least two times in this process. Reduced Precision Duplication With Comparison (RP-DWC) is an effective software level solution to improve the performance of the conventional DWC. RP-DWC aims to mitigate these overheads by enabling parallel processing in underused Floating Point Units (FPUs) in mixed precision Graphic Processing Units (GPUs). By making use of precision reduction to efficiently improve the reliability in mixed precision GPUs, RPDWC extends the DWC technique, introducing proper ways to handle redundancy with different precision operations. Improving GPUs reliability is an extremely valuable challenge in the fault tolerance field since GPUs are adopted in both High-Performance Computing (HPC) and in automotive real-time applications. When GPUs are exposed to a natural environment, such as the surface of the Earth at sea level, they are also exposed to the Earth’s surface radiation. Furthermore, this exposure can be critical, given that these radiation particles may hit the GPU’s internal circuit, corrupt sensitive data and consequently generate undesired outputs. Introducing duplication with reduced precision in a trustworthy manner to maintain reliability in safety-critical systems is an arduous task that we propose to further investigate in this work.
publishDate 2021
dc.date.accessioned.fl_str_mv 2021-07-21T04:23:43Z
dc.date.issued.fl_str_mv 2021
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/bachelorThesis
format bachelorThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10183/224241
dc.identifier.nrb.pt_BR.fl_str_mv 001128602
url http://hdl.handle.net/10183/224241
identifier_str_mv 001128602
dc.language.iso.fl_str_mv por
language por
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFRGS
instname:Universidade Federal do Rio Grande do Sul (UFRGS)
instacron:UFRGS
instname_str Universidade Federal do Rio Grande do Sul (UFRGS)
instacron_str UFRGS
institution UFRGS
reponame_str Repositório Institucional da UFRGS
collection Repositório Institucional da UFRGS
bitstream.url.fl_str_mv http://www.lume.ufrgs.br/bitstream/10183/224241/2/001128602.pdf.txt
http://www.lume.ufrgs.br/bitstream/10183/224241/1/001128602.pdf
bitstream.checksum.fl_str_mv c77f24fe3857640be0dcb4aa89dc0906
dbcedef7bb794ab3e2d47471fcadcf30
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
repository.name.fl_str_mv Repositório Institucional da UFRGS - Universidade Federal do Rio Grande do Sul (UFRGS)
repository.mail.fl_str_mv
_version_ 1801224609491582976