Block size, parallelism and predictive performance: finding the sweet spot in distributed learning

Detalhes bibliográficos
Autor(a) principal: Oliveira, Filipe
Data de Publicação: 2023
Outros Autores: Carneiro, Davide Rua, Guimarães, Miguel, Oliveira, Óscar, Novais, Paulo
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: https://hdl.handle.net/1822/89520
Resumo: As distributed and multi-organization Machine Learning emerges, new challenges must be solved, such as diverse and low-quality data or real-time delivery. In this paper, we use a distributed learning environment to analyze the relationship between block size, parallelism, and predictor quality. Specifically, the goal is to find the optimum block size and the best heuristic to create distributed Ensembles. We evaluated three different heuristics and five block sizes on four publicly available datasets. Results show that using fewer but better base models matches or outperforms a standard Random Forest, and that 32 MB is the best block size.
id RCAP_09e4aa942e885fc128acd34f05f4618d
oai_identifier_str oai:repositorium.sdum.uminho.pt:1822/89520
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Block size, parallelism and predictive performance: finding the sweet spot in distributed learningDistributed file systemDistributed machine learningHadoopMachine learningAs distributed and multi-organization Machine Learning emerges, new challenges must be solved, such as diverse and low-quality data or real-time delivery. In this paper, we use a distributed learning environment to analyze the relationship between block size, parallelism, and predictor quality. Specifically, the goal is to find the optimum block size and the best heuristic to create distributed Ensembles. We evaluated three different heuristics and five block sizes on four publicly available datasets. Results show that using fewer but better base models matches or outperforms a standard Random Forest, and that 32 MB is the best block size.This work was supported by FCT – Fundação para a Ciência e Tecnologia within projects UIDB/04728/2020, EXPL/CCI-COM/0706/2021 and CPCA/IAC/AV/475278/2022Taylor & FrancisUniversidade do MinhoOliveira, FilipeCarneiro, Davide RuaGuimarães, MiguelOliveira, ÓscarNovais, Paulo2023-062023-06-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttps://hdl.handle.net/1822/89520eng1744-57601744-577910.1080/17445760.2023.2225854https://www.tandfonline.com/doi/full/10.1080/17445760.2023.2225854info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-05-11T05:19:37Zoai:repositorium.sdum.uminho.pt:1822/89520Portal AgregadorONGhttps://www.rcaap.pt/oai/openairemluisa.alvim@gmail.comopendoar:71602024-05-11T05:19:37Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Block size, parallelism and predictive performance: finding the sweet spot in distributed learning
title Block size, parallelism and predictive performance: finding the sweet spot in distributed learning
spellingShingle Block size, parallelism and predictive performance: finding the sweet spot in distributed learning
Oliveira, Filipe
Distributed file system
Distributed machine learning
Hadoop
Machine learning
title_short Block size, parallelism and predictive performance: finding the sweet spot in distributed learning
title_full Block size, parallelism and predictive performance: finding the sweet spot in distributed learning
title_fullStr Block size, parallelism and predictive performance: finding the sweet spot in distributed learning
title_full_unstemmed Block size, parallelism and predictive performance: finding the sweet spot in distributed learning
title_sort Block size, parallelism and predictive performance: finding the sweet spot in distributed learning
author Oliveira, Filipe
author_facet Oliveira, Filipe
Carneiro, Davide Rua
Guimarães, Miguel
Oliveira, Óscar
Novais, Paulo
author_role author
author2 Carneiro, Davide Rua
Guimarães, Miguel
Oliveira, Óscar
Novais, Paulo
author2_role author
author
author
author
dc.contributor.none.fl_str_mv Universidade do Minho
dc.contributor.author.fl_str_mv Oliveira, Filipe
Carneiro, Davide Rua
Guimarães, Miguel
Oliveira, Óscar
Novais, Paulo
dc.subject.por.fl_str_mv Distributed file system
Distributed machine learning
Hadoop
Machine learning
topic Distributed file system
Distributed machine learning
Hadoop
Machine learning
description As distributed and multi-organization Machine Learning emerges, new challenges must be solved, such as diverse and low-quality data or real-time delivery. In this paper, we use a distributed learning environment to analyze the relationship between block size, parallelism, and predictor quality. Specifically, the goal is to find the optimum block size and the best heuristic to create distributed Ensembles. We evaluated three different heuristics and five block sizes on four publicly available datasets. Results show that using fewer but better base models matches or outperforms a standard Random Forest, and that 32 MB is the best block size.
publishDate 2023
dc.date.none.fl_str_mv 2023-06
2023-06-01T00:00:00Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://hdl.handle.net/1822/89520
url https://hdl.handle.net/1822/89520
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv 1744-5760
1744-5779
10.1080/17445760.2023.2225854
https://www.tandfonline.com/doi/full/10.1080/17445760.2023.2225854
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Taylor & Francis
publisher.none.fl_str_mv Taylor & Francis
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv mluisa.alvim@gmail.com
_version_ 1817544585443803136