Block size, parallelism and predictive performance: finding the sweet spot in distributed learning

Oliveira, Filipe; Carneiro, Davide Rua; Guimarães, Miguel; Oliveira, Óscar; Novais, Paulo

Block size, parallelism and predictive performance: finding the sweet spot in distributed learning

Detalhes bibliográficos
Autor(a) principal:	Oliveira, Filipe
Data de Publicação:	2023
Outros Autores:	Carneiro, Davide Rua, Guimarães, Miguel, Oliveira, Óscar, Novais, Paulo
Tipo de documento:	Artigo
Idioma:	eng
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	https://hdl.handle.net/1822/89520
Resumo:	As distributed and multi-organization Machine Learning emerges, new challenges must be solved, such as diverse and low-quality data or real-time delivery. In this paper, we use a distributed learning environment to analyze the relationship between block size, parallelism, and predictor quality. Specifically, the goal is to find the optimum block size and the best heuristic to create distributed Ensembles. We evaluated three different heuristics and five block sizes on four publicly available datasets. Results show that using fewer but better base models matches or outperforms a standard Random Forest, and that 32 MB is the best block size.

Metadados do item

id	RCAP_09e4aa942e885fc128acd34f05f4618d
oai_identifier_str	oai:repositorium.sdum.uminho.pt:1822/89520
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	Block size, parallelism and predictive performance: finding the sweet spot in distributed learningDistributed file systemDistributed machine learningHadoopMachine learningAs distributed and multi-organization Machine Learning emerges, new challenges must be solved, such as diverse and low-quality data or real-time delivery. In this paper, we use a distributed learning environment to analyze the relationship between block size, parallelism, and predictor quality. Specifically, the goal is to find the optimum block size and the best heuristic to create distributed Ensembles. We evaluated three different heuristics and five block sizes on four publicly available datasets. Results show that using fewer but better base models matches or outperforms a standard Random Forest, and that 32 MB is the best block size.This work was supported by FCT – Fundação para a Ciência e Tecnologia within projects UIDB/04728/2020, EXPL/CCI-COM/0706/2021 and CPCA/IAC/AV/475278/2022Taylor & FrancisUniversidade do MinhoOliveira, FilipeCarneiro, Davide RuaGuimarães, MiguelOliveira, ÓscarNovais, Paulo2023-062023-06-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttps://hdl.handle.net/1822/89520eng1744-57601744-577910.1080/17445760.2023.2225854https://www.tandfonline.com/doi/full/10.1080/17445760.2023.2225854info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-05-11T05:19:37Zoai:repositorium.sdum.uminho.pt:1822/89520Portal AgregadorONGhttps://www.rcaap.pt/oai/openairemluisa.alvim@gmail.comopendoar:71602024-05-11T05:19:37Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	Block size, parallelism and predictive performance: finding the sweet spot in distributed learning
title	Block size, parallelism and predictive performance: finding the sweet spot in distributed learning
spellingShingle	Block size, parallelism and predictive performance: finding the sweet spot in distributed learning Oliveira, Filipe Distributed file system Distributed machine learning Hadoop Machine learning
title_short	Block size, parallelism and predictive performance: finding the sweet spot in distributed learning
title_full	Block size, parallelism and predictive performance: finding the sweet spot in distributed learning
title_fullStr	Block size, parallelism and predictive performance: finding the sweet spot in distributed learning
title_full_unstemmed	Block size, parallelism and predictive performance: finding the sweet spot in distributed learning
title_sort	Block size, parallelism and predictive performance: finding the sweet spot in distributed learning
author	Oliveira, Filipe
author_facet	Oliveira, Filipe Carneiro, Davide Rua Guimarães, Miguel Oliveira, Óscar Novais, Paulo
author_role	author
author2	Carneiro, Davide Rua Guimarães, Miguel Oliveira, Óscar Novais, Paulo
author2_role	author author author author
dc.contributor.none.fl_str_mv	Universidade do Minho
dc.contributor.author.fl_str_mv	Oliveira, Filipe Carneiro, Davide Rua Guimarães, Miguel Oliveira, Óscar Novais, Paulo
dc.subject.por.fl_str_mv	Distributed file system Distributed machine learning Hadoop Machine learning
topic	Distributed file system Distributed machine learning Hadoop Machine learning
description	As distributed and multi-organization Machine Learning emerges, new challenges must be solved, such as diverse and low-quality data or real-time delivery. In this paper, we use a distributed learning environment to analyze the relationship between block size, parallelism, and predictor quality. Specifically, the goal is to find the optimum block size and the best heuristic to create distributed Ensembles. We evaluated three different heuristics and five block sizes on four publicly available datasets. Results show that using fewer but better base models matches or outperforms a standard Random Forest, and that 32 MB is the best block size.
publishDate	2023
dc.date.none.fl_str_mv	2023-06 2023-06-01T00:00:00Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/article
format	article
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://hdl.handle.net/1822/89520
url	https://hdl.handle.net/1822/89520
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	1744-5760 1744-5779 10.1080/17445760.2023.2225854 https://www.tandfonline.com/doi/full/10.1080/17445760.2023.2225854
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Taylor & Francis
publisher.none.fl_str_mv	Taylor & Francis
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv	mluisa.alvim@gmail.com
_version_	1817544585443803136

Block size, parallelism and predictive performance: finding the sweet spot in distributed learning

Registros relacionados