Distributed Learning of CNNs on Heterogeneous CPU/GPU Architectures

Marques, José; Falcao, Gabriel; Alexandre, Luís

Distributed Learning of CNNs on Heterogeneous CPU/GPU Architectures

Detalhes bibliográficos
Autor(a) principal:	Marques, José
Data de Publicação:	2018
Outros Autores:	Falcao, Gabriel, Alexandre, Luís
Tipo de documento:	Artigo
Idioma:	eng
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	http://hdl.handle.net/10400.6/8141
Resumo:	Convolutional Neural Networks (CNNs) have shown to be powerful classi cation tools in tasks that range from check reading to medical diagnosis, reaching close to human perception, and in some cases surpassing it. However, the problems to solve are becoming larger and more complex, which translates to larger CNNs, leading to longer training times\|the computational complex part\|that not even the adoption of Graphics Processing Units (GPUs) could keep up to. This problem is partially solved by using more processing units and distributed training methods that are o ered by several frameworks dedicated to neural network training, such as Ca e, Torch or TensorFlow. However, these techniques do not take full advantage of the possible parallelization o ered by CNNs and the cooperative use of heterogeneous devices with di erent processing capabilities, clock speeds, memory size, among others. This paper presents a new method for the parallel training of CNNs that can be considered as a particular instantiation of model parallelism, where only the convolutional layer is distributed. In fact, the convolutions processed during training (forward and backward propagation included) represent from 60-90% of global processing time. The paper analyzes the in uence of network size, bandwidth, batch size, number of devices, including their processing capabilities, and other parameters. Results show that this technique is capable of diminishing the training time without a ecting the classi cation performance for both CPUs and GPUs. For the CIFAR-10 dataset, using a CNN with two convolutional layers, and 500 and 1500 kernels, respectively, best speedups achieve 3:28 using four CPUs and 2:45 with three GPUs. Modern imaging datasets, larger and more complex than CIFAR-10 will certainly require more than 60-90% of processing time calculating convolutions, and speedups will tend to increase accordingly.

Metadados do item

id	RCAP_a98e1a320abb79166308751ac5aa0ba2
oai_identifier_str	oai:ubibliorum.ubi.pt:10400.6/8141
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	Distributed Learning of CNNs on Heterogeneous CPU/GPU ArchitecturesConvolutional Neural Networks (CNNs) have shown to be powerful classi cation tools in tasks that range from check reading to medical diagnosis, reaching close to human perception, and in some cases surpassing it. However, the problems to solve are becoming larger and more complex, which translates to larger CNNs, leading to longer training times\|the computational complex part\|that not even the adoption of Graphics Processing Units (GPUs) could keep up to. This problem is partially solved by using more processing units and distributed training methods that are o ered by several frameworks dedicated to neural network training, such as Ca e, Torch or TensorFlow. However, these techniques do not take full advantage of the possible parallelization o ered by CNNs and the cooperative use of heterogeneous devices with di erent processing capabilities, clock speeds, memory size, among others. This paper presents a new method for the parallel training of CNNs that can be considered as a particular instantiation of model parallelism, where only the convolutional layer is distributed. In fact, the convolutions processed during training (forward and backward propagation included) represent from 60-90% of global processing time. The paper analyzes the in uence of network size, bandwidth, batch size, number of devices, including their processing capabilities, and other parameters. Results show that this technique is capable of diminishing the training time without a ecting the classi cation performance for both CPUs and GPUs. For the CIFAR-10 dataset, using a CNN with two convolutional layers, and 500 and 1500 kernels, respectively, best speedups achieve 3:28 using four CPUs and 2:45 with three GPUs. Modern imaging datasets, larger and more complex than CIFAR-10 will certainly require more than 60-90% of processing time calculating convolutions, and speedups will tend to increase accordingly.uBibliorumMarques, JoséFalcao, GabrielAlexandre, Luís2020-01-09T09:45:41Z20182018-01-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10400.6/8141eng10.1080/08839514.2018.1508814info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-12-15T09:47:54Zoai:ubibliorum.ubi.pt:10400.6/8141Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T00:48:32.761007Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	Distributed Learning of CNNs on Heterogeneous CPU/GPU Architectures
title	Distributed Learning of CNNs on Heterogeneous CPU/GPU Architectures
spellingShingle	Distributed Learning of CNNs on Heterogeneous CPU/GPU Architectures Marques, José
title_short	Distributed Learning of CNNs on Heterogeneous CPU/GPU Architectures
title_full	Distributed Learning of CNNs on Heterogeneous CPU/GPU Architectures
title_fullStr	Distributed Learning of CNNs on Heterogeneous CPU/GPU Architectures
title_full_unstemmed	Distributed Learning of CNNs on Heterogeneous CPU/GPU Architectures
title_sort	Distributed Learning of CNNs on Heterogeneous CPU/GPU Architectures
author	Marques, José
author_facet	Marques, José Falcao, Gabriel Alexandre, Luís
author_role	author
author2	Falcao, Gabriel Alexandre, Luís
author2_role	author author
dc.contributor.none.fl_str_mv	uBibliorum
dc.contributor.author.fl_str_mv	Marques, José Falcao, Gabriel Alexandre, Luís
description	Convolutional Neural Networks (CNNs) have shown to be powerful classi cation tools in tasks that range from check reading to medical diagnosis, reaching close to human perception, and in some cases surpassing it. However, the problems to solve are becoming larger and more complex, which translates to larger CNNs, leading to longer training times\|the computational complex part\|that not even the adoption of Graphics Processing Units (GPUs) could keep up to. This problem is partially solved by using more processing units and distributed training methods that are o ered by several frameworks dedicated to neural network training, such as Ca e, Torch or TensorFlow. However, these techniques do not take full advantage of the possible parallelization o ered by CNNs and the cooperative use of heterogeneous devices with di erent processing capabilities, clock speeds, memory size, among others. This paper presents a new method for the parallel training of CNNs that can be considered as a particular instantiation of model parallelism, where only the convolutional layer is distributed. In fact, the convolutions processed during training (forward and backward propagation included) represent from 60-90% of global processing time. The paper analyzes the in uence of network size, bandwidth, batch size, number of devices, including their processing capabilities, and other parameters. Results show that this technique is capable of diminishing the training time without a ecting the classi cation performance for both CPUs and GPUs. For the CIFAR-10 dataset, using a CNN with two convolutional layers, and 500 and 1500 kernels, respectively, best speedups achieve 3:28 using four CPUs and 2:45 with three GPUs. Modern imaging datasets, larger and more complex than CIFAR-10 will certainly require more than 60-90% of processing time calculating convolutions, and speedups will tend to increase accordingly.
publishDate	2018
dc.date.none.fl_str_mv	2018 2018-01-01T00:00:00Z 2020-01-09T09:45:41Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/article
format	article
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10400.6/8141
url	http://hdl.handle.net/10400.6/8141
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	10.1080/08839514.2018.1508814
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799136379744026624

Distributed Learning of CNNs on Heterogeneous CPU/GPU Architectures

Registros relacionados