Block size, parallelism and predictive performance: finding the sweet spot in distributed learning
Autor(a) principal: | |
---|---|
Data de Publicação: | 2023 |
Outros Autores: | , , , |
Tipo de documento: | Artigo |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | https://hdl.handle.net/1822/89520 |
Resumo: | As distributed and multi-organization Machine Learning emerges, new challenges must be solved, such as diverse and low-quality data or real-time delivery. In this paper, we use a distributed learning environment to analyze the relationship between block size, parallelism, and predictor quality. Specifically, the goal is to find the optimum block size and the best heuristic to create distributed Ensembles. We evaluated three different heuristics and five block sizes on four publicly available datasets. Results show that using fewer but better base models matches or outperforms a standard Random Forest, and that 32 MB is the best block size. |
id |
RCAP_09e4aa942e885fc128acd34f05f4618d |
---|---|
oai_identifier_str |
oai:repositorium.sdum.uminho.pt:1822/89520 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Block size, parallelism and predictive performance: finding the sweet spot in distributed learningDistributed file systemDistributed machine learningHadoopMachine learningAs distributed and multi-organization Machine Learning emerges, new challenges must be solved, such as diverse and low-quality data or real-time delivery. In this paper, we use a distributed learning environment to analyze the relationship between block size, parallelism, and predictor quality. Specifically, the goal is to find the optimum block size and the best heuristic to create distributed Ensembles. We evaluated three different heuristics and five block sizes on four publicly available datasets. Results show that using fewer but better base models matches or outperforms a standard Random Forest, and that 32 MB is the best block size.This work was supported by FCT – Fundação para a Ciência e Tecnologia within projects UIDB/04728/2020, EXPL/CCI-COM/0706/2021 and CPCA/IAC/AV/475278/2022Taylor & FrancisUniversidade do MinhoOliveira, FilipeCarneiro, Davide RuaGuimarães, MiguelOliveira, ÓscarNovais, Paulo2023-062023-06-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttps://hdl.handle.net/1822/89520eng1744-57601744-577910.1080/17445760.2023.2225854https://www.tandfonline.com/doi/full/10.1080/17445760.2023.2225854info:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-05-11T05:19:37Zoai:repositorium.sdum.uminho.pt:1822/89520Portal AgregadorONGhttps://www.rcaap.pt/oai/openairemluisa.alvim@gmail.comopendoar:71602024-05-11T05:19:37Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Block size, parallelism and predictive performance: finding the sweet spot in distributed learning |
title |
Block size, parallelism and predictive performance: finding the sweet spot in distributed learning |
spellingShingle |
Block size, parallelism and predictive performance: finding the sweet spot in distributed learning Oliveira, Filipe Distributed file system Distributed machine learning Hadoop Machine learning |
title_short |
Block size, parallelism and predictive performance: finding the sweet spot in distributed learning |
title_full |
Block size, parallelism and predictive performance: finding the sweet spot in distributed learning |
title_fullStr |
Block size, parallelism and predictive performance: finding the sweet spot in distributed learning |
title_full_unstemmed |
Block size, parallelism and predictive performance: finding the sweet spot in distributed learning |
title_sort |
Block size, parallelism and predictive performance: finding the sweet spot in distributed learning |
author |
Oliveira, Filipe |
author_facet |
Oliveira, Filipe Carneiro, Davide Rua Guimarães, Miguel Oliveira, Óscar Novais, Paulo |
author_role |
author |
author2 |
Carneiro, Davide Rua Guimarães, Miguel Oliveira, Óscar Novais, Paulo |
author2_role |
author author author author |
dc.contributor.none.fl_str_mv |
Universidade do Minho |
dc.contributor.author.fl_str_mv |
Oliveira, Filipe Carneiro, Davide Rua Guimarães, Miguel Oliveira, Óscar Novais, Paulo |
dc.subject.por.fl_str_mv |
Distributed file system Distributed machine learning Hadoop Machine learning |
topic |
Distributed file system Distributed machine learning Hadoop Machine learning |
description |
As distributed and multi-organization Machine Learning emerges, new challenges must be solved, such as diverse and low-quality data or real-time delivery. In this paper, we use a distributed learning environment to analyze the relationship between block size, parallelism, and predictor quality. Specifically, the goal is to find the optimum block size and the best heuristic to create distributed Ensembles. We evaluated three different heuristics and five block sizes on four publicly available datasets. Results show that using fewer but better base models matches or outperforms a standard Random Forest, and that 32 MB is the best block size. |
publishDate |
2023 |
dc.date.none.fl_str_mv |
2023-06 2023-06-01T00:00:00Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://hdl.handle.net/1822/89520 |
url |
https://hdl.handle.net/1822/89520 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
1744-5760 1744-5779 10.1080/17445760.2023.2225854 https://www.tandfonline.com/doi/full/10.1080/17445760.2023.2225854 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Taylor & Francis |
publisher.none.fl_str_mv |
Taylor & Francis |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
mluisa.alvim@gmail.com |
_version_ |
1817544585443803136 |