Improving data locality of bagging ensembles for data streams through mini-batching
Autor(a) principal: | |
---|---|
Data de Publicação: | 2021 |
Tipo de documento: | Tese |
Idioma: | eng |
Título da fonte: | Repositório Institucional da UFSCAR |
Texto Completo: | https://repositorio.ufscar.br/handle/ufscar/15176 |
Resumo: | Machine Learning techniques have been employed in virtually all domains in the past few years. In many applications, learning algorithms will have to cope with dynamic environments, under both memory and time constraints, to provide a (near) real-time answer. In this scenario, Ensemble learning comprises a class of stream mining algorithms that achieved remarkable predictive performance. Ensembles are implemented as a set of (several) individual learners whose predictions are aggregated to predict new incoming instances. Although ensembles can be computationally more expensive, they are naturally amendable for task-parallelism. However, the incremental learning and dynamic data structures used to capture the concept drift increase the cache misses and hinder the benefit of parallelism. In this thesis, we devise a method capable of reducing the execution time and increasing the energy efficiency of several bagging ensembles for data streams. The method is based on a task-parallel model capable of leveraging the natural independence of the underlying learners from this class of ensembles (bagging). The parallel model is combined with a mini-batching technique that can improve the memory access locality of the ensembles. We consistently achieve speedups of 4X to 5X with 8 cores, with even a superlinear speedup of 12X in one case. We demonstrate that mini-batching can significantly decrease the reuse distance and the number of cache-misses.We provide data regarding the trade-off regarding the reduction of execution time with a loss in predictive performance (ranging from less than 1% up to 12%). We conclude that loss in predictive performance depends on dataset characteristics and the mini-batch size used. We present evidence that using small mini-batch sizes (e.g., up to 50 examples) provides a good compromise between execution time and predictive performance. We demonstrate that energy efficiency can be improved under three different workloads. Although the biggest reduction in energy consumption happens in the smallest workload, it comes at the cost of a big delay in response time, which may hinder the idea of real-time processing. In the higher workloads, however, the proposed method presents a better performance in both the energy consumption and the delay in response time when compared to the baseline version. We evaluate our method using many hardware platforms, with a total of six different hardware platforms used among all experimental frameworks. At the same time, we use up to six different algorithms and up to five different datasets on the experimental frameworks. By providing data about the execution of the proposed method in such a wide range of setups, we believe that the proposed method is a viable solution for improving the performance of online bagging ensembles. |
id |
SCAR_8d049b3ce14e42860c5d412e1497757e |
---|---|
oai_identifier_str |
oai:repositorio.ufscar.br:ufscar/15176 |
network_acronym_str |
SCAR |
network_name_str |
Repositório Institucional da UFSCAR |
repository_id_str |
|
spelling |
Cassales, GuilhermeSenger, Hermeshttp://lattes.cnpq.br/3691742159298316http://lattes.cnpq.br/61911255938214812021-11-26T14:46:36Z2021-11-26T14:46:36Z2021-08-27CASSALES, Guilherme. Improving data locality of bagging ensembles for data streams through mini-batching. 2021. Tese (Doutorado em Ciência da Computação) – Universidade Federal de São Carlos, São Carlos, 2021. Disponível em: https://repositorio.ufscar.br/handle/ufscar/15176.https://repositorio.ufscar.br/handle/ufscar/15176Machine Learning techniques have been employed in virtually all domains in the past few years. In many applications, learning algorithms will have to cope with dynamic environments, under both memory and time constraints, to provide a (near) real-time answer. In this scenario, Ensemble learning comprises a class of stream mining algorithms that achieved remarkable predictive performance. Ensembles are implemented as a set of (several) individual learners whose predictions are aggregated to predict new incoming instances. Although ensembles can be computationally more expensive, they are naturally amendable for task-parallelism. However, the incremental learning and dynamic data structures used to capture the concept drift increase the cache misses and hinder the benefit of parallelism. In this thesis, we devise a method capable of reducing the execution time and increasing the energy efficiency of several bagging ensembles for data streams. The method is based on a task-parallel model capable of leveraging the natural independence of the underlying learners from this class of ensembles (bagging). The parallel model is combined with a mini-batching technique that can improve the memory access locality of the ensembles. We consistently achieve speedups of 4X to 5X with 8 cores, with even a superlinear speedup of 12X in one case. We demonstrate that mini-batching can significantly decrease the reuse distance and the number of cache-misses.We provide data regarding the trade-off regarding the reduction of execution time with a loss in predictive performance (ranging from less than 1% up to 12%). We conclude that loss in predictive performance depends on dataset characteristics and the mini-batch size used. We present evidence that using small mini-batch sizes (e.g., up to 50 examples) provides a good compromise between execution time and predictive performance. We demonstrate that energy efficiency can be improved under three different workloads. Although the biggest reduction in energy consumption happens in the smallest workload, it comes at the cost of a big delay in response time, which may hinder the idea of real-time processing. In the higher workloads, however, the proposed method presents a better performance in both the energy consumption and the delay in response time when compared to the baseline version. We evaluate our method using many hardware platforms, with a total of six different hardware platforms used among all experimental frameworks. At the same time, we use up to six different algorithms and up to five different datasets on the experimental frameworks. By providing data about the execution of the proposed method in such a wide range of setups, we believe that the proposed method is a viable solution for improving the performance of online bagging ensembles.As técnicas de aprendizado de máquina foram empregadas em praticamente todos os domínios nos últimos anos. Em muitas aplicações, os algoritmos de aprendizagem terão que lidar com ambientes dinâmicos, onde precisam fornecer uma resposta em (quase)tempo real enquanto aderem com restrições tanto de memória quanto de tempo. Nesse cenário, os comitês de aprendizagem compreendem uma classe de algoritmos de mineração de fluxo de dados capaz de alcançar um notável desempenho preditivo. Os comitês de aprendizagem são implementados como um conjunto de (vários) classificadores individuais, cujas predições são agregadas para classificar novas instâncias de entrada. Embora os comitês de aprendizagem possam ser computacionalmente mais caros, eles são naturalmente modificáveis para o paralelismo de tarefas. No entanto, o aprendizado incremental e as estruturas de dados dinâmicas usadas para capturar o desvio de conceito aumentam as falhas de cache e podem reduzir o benefício do paralelismo. Nesta tese, um método capaz de reduzir o tempo de execução e aumentar a eficiência energética de diversos comitês de aprendizagem do tipo bagging para fluxos de dados é proposto. O método é baseado em um modelo de paralelismo de tarefas capaz de aproveitar a independência natural dos classificadores internos que compõem os comitês de aprendizagem da classe bagging. O modelo paralelo é combinado com uma técnica de mini-batching, a qual pode melhorar a localidade de acesso à memória dos comitês de aprendizagem. Speedups de 4X a 5X são alcançados onsistentemente com 8 núcleos de processamento, apresentando, inclusive, um Speedup superlinear de 12X em um caso específico. Demonstra-se que o mini-batching pode diminuir significativamente a distância de reuso e o número de falhas de cache. Fornece-se dados sobre a relação de compromisso da redução do tempo de execução com a perda no desempenho preditivo, perda que pode variar de menos de 1% a até 12%. Conclui-se que a perda de desempenho preditivo depende, principalmente, das características do conjunto de dados e do tamanho do mini-batch utilizado. Apresenta-se evidências de que o uso de mini-batches pequenos (por exemplo, até 50 exemplos) fornece um bom compromisso entre o tempo de execução e o desempenho preditivo. Demonstra-se que a eficiência energética pode ser melhorada em utilizando três níveis de carga de trabalho diferentes. Embora a maior redução no consumo de energia aconteça com o menor nível de carga de trabalho a contrapartida é um grande atraso no tempo de resposta, o que pode atrapalhar a ideia de processamento em tempo real. Nos níveis de cargas de trabalho mais altos, entretanto, o método proposto apresenta melhor desempenho tanto no consumo de energia quanto no atraso no tempo de resposta quando comparado à versão base. Avalia-se o método utilizando diversas plataformas de hardware, com um total de seis plataformas diferentes utilizadas entre todos os frameworks experimentais. Ao mesmo tempo, utiliza-se de até seis algoritmos diferentes e até cinco conjuntos de dados diferentes nos frameworks experimentais. Ao fornecer dados sobre a execução do método proposto em uma gama tão ampla de configurações, acredita-se que o método proposto é uma solução viável para melhorar o desempenho de comitês de aprendizagem da classe bagging para fluxos de dados.Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brasil (CAPES) - Finance Code 001Programa Institucional de Internacionalização CAPES-PrInt UFSCar (Contract 88887.373234/2019-00)Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP - contract number 2018/22979-2)engUniversidade Federal de São CarlosCâmpus São CarlosPrograma de Pós-Graduação em Ciência da Computação - PPGCCUFSCarAttribution-NonCommercial-NoDerivs 3.0 Brazilhttp://creativecommons.org/licenses/by-nc-nd/3.0/br/info:eu-repo/semantics/openAccessAprendizagem de fluxo de dadosParalelismo de tarefas multicoreComitês de aprendizagemAlgoritmos de baggingMúltiplas plataformasConsumo de energiaData stream learningMulticore task-parallelismEnsemble learnersBagging algorithmsMultiple platformsEnergy consumptionCIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAOImproving data locality of bagging ensembles for data streams through mini-batchingMelhorando a localidade de dados dos comitês classificadores do tipo bagging para fluxos contínuos de dados através da técnica de mini-batchinginfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisreponame:Repositório Institucional da UFSCARinstname:Universidade Federal de São Carlos (UFSCAR)instacron:UFSCARORIGINALTese_FA_Guilherme_Cassales.pdfTese_FA_Guilherme_Cassales.pdfTese com Folha de aprovaçãoapplication/pdf2984053https://repositorio.ufscar.br/bitstream/ufscar/15176/1/Tese_FA_Guilherme_Cassales.pdfef4ab5c66fe0b9dd07e268b8fe9a8340MD51Carta-Tese-Guilherme.pdfCarta-Tese-Guilherme.pdfCarta orientadorapplication/pdf119828https://repositorio.ufscar.br/bitstream/ufscar/15176/2/Carta-Tese-Guilherme.pdfdbf5faf9797c3442e5da4bcf1581678dMD52CC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-8811https://repositorio.ufscar.br/bitstream/ufscar/15176/3/license_rdfe39d27027a6cc9cb039ad269a5db8e34MD53TEXTTese_FA_Guilherme_Cassales.pdf.txtTese_FA_Guilherme_Cassales.pdf.txtExtracted texttext/plain230901https://repositorio.ufscar.br/bitstream/ufscar/15176/4/Tese_FA_Guilherme_Cassales.pdf.txt1f055b60f71ff7f05d5ad7807f026dc1MD54Carta-Tese-Guilherme.pdf.txtCarta-Tese-Guilherme.pdf.txtExtracted texttext/plain1520https://repositorio.ufscar.br/bitstream/ufscar/15176/6/Carta-Tese-Guilherme.pdf.txted3f9bbc15eb094eac4251374c85a526MD56THUMBNAILTese_FA_Guilherme_Cassales.pdf.jpgTese_FA_Guilherme_Cassales.pdf.jpgIM Thumbnailimage/jpeg6704https://repositorio.ufscar.br/bitstream/ufscar/15176/5/Tese_FA_Guilherme_Cassales.pdf.jpg8617aa37f119ccd9f1fbd3d22718cc09MD55Carta-Tese-Guilherme.pdf.jpgCarta-Tese-Guilherme.pdf.jpgIM Thumbnailimage/jpeg6545https://repositorio.ufscar.br/bitstream/ufscar/15176/7/Carta-Tese-Guilherme.pdf.jpg89c1dfc305584e71d049b2e150f929d2MD57ufscar/151762021-11-27 03:42:24.157oai:repositorio.ufscar.br:ufscar/15176Repositório InstitucionalPUBhttps://repositorio.ufscar.br/oai/requestopendoar:43222021-11-27T03:42:24Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)false |
dc.title.eng.fl_str_mv |
Improving data locality of bagging ensembles for data streams through mini-batching |
dc.title.alternative.por.fl_str_mv |
Melhorando a localidade de dados dos comitês classificadores do tipo bagging para fluxos contínuos de dados através da técnica de mini-batching |
title |
Improving data locality of bagging ensembles for data streams through mini-batching |
spellingShingle |
Improving data locality of bagging ensembles for data streams through mini-batching Cassales, Guilherme Aprendizagem de fluxo de dados Paralelismo de tarefas multicore Comitês de aprendizagem Algoritmos de bagging Múltiplas plataformas Consumo de energia Data stream learning Multicore task-parallelism Ensemble learners Bagging algorithms Multiple platforms Energy consumption CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO |
title_short |
Improving data locality of bagging ensembles for data streams through mini-batching |
title_full |
Improving data locality of bagging ensembles for data streams through mini-batching |
title_fullStr |
Improving data locality of bagging ensembles for data streams through mini-batching |
title_full_unstemmed |
Improving data locality of bagging ensembles for data streams through mini-batching |
title_sort |
Improving data locality of bagging ensembles for data streams through mini-batching |
author |
Cassales, Guilherme |
author_facet |
Cassales, Guilherme |
author_role |
author |
dc.contributor.authorlattes.por.fl_str_mv |
http://lattes.cnpq.br/6191125593821481 |
dc.contributor.author.fl_str_mv |
Cassales, Guilherme |
dc.contributor.advisor1.fl_str_mv |
Senger, Hermes |
dc.contributor.advisor1Lattes.fl_str_mv |
http://lattes.cnpq.br/3691742159298316 |
contributor_str_mv |
Senger, Hermes |
dc.subject.por.fl_str_mv |
Aprendizagem de fluxo de dados Paralelismo de tarefas multicore Comitês de aprendizagem Algoritmos de bagging Múltiplas plataformas Consumo de energia |
topic |
Aprendizagem de fluxo de dados Paralelismo de tarefas multicore Comitês de aprendizagem Algoritmos de bagging Múltiplas plataformas Consumo de energia Data stream learning Multicore task-parallelism Ensemble learners Bagging algorithms Multiple platforms Energy consumption CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO |
dc.subject.eng.fl_str_mv |
Data stream learning Multicore task-parallelism Ensemble learners Bagging algorithms Multiple platforms Energy consumption |
dc.subject.cnpq.fl_str_mv |
CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO |
description |
Machine Learning techniques have been employed in virtually all domains in the past few years. In many applications, learning algorithms will have to cope with dynamic environments, under both memory and time constraints, to provide a (near) real-time answer. In this scenario, Ensemble learning comprises a class of stream mining algorithms that achieved remarkable predictive performance. Ensembles are implemented as a set of (several) individual learners whose predictions are aggregated to predict new incoming instances. Although ensembles can be computationally more expensive, they are naturally amendable for task-parallelism. However, the incremental learning and dynamic data structures used to capture the concept drift increase the cache misses and hinder the benefit of parallelism. In this thesis, we devise a method capable of reducing the execution time and increasing the energy efficiency of several bagging ensembles for data streams. The method is based on a task-parallel model capable of leveraging the natural independence of the underlying learners from this class of ensembles (bagging). The parallel model is combined with a mini-batching technique that can improve the memory access locality of the ensembles. We consistently achieve speedups of 4X to 5X with 8 cores, with even a superlinear speedup of 12X in one case. We demonstrate that mini-batching can significantly decrease the reuse distance and the number of cache-misses.We provide data regarding the trade-off regarding the reduction of execution time with a loss in predictive performance (ranging from less than 1% up to 12%). We conclude that loss in predictive performance depends on dataset characteristics and the mini-batch size used. We present evidence that using small mini-batch sizes (e.g., up to 50 examples) provides a good compromise between execution time and predictive performance. We demonstrate that energy efficiency can be improved under three different workloads. Although the biggest reduction in energy consumption happens in the smallest workload, it comes at the cost of a big delay in response time, which may hinder the idea of real-time processing. In the higher workloads, however, the proposed method presents a better performance in both the energy consumption and the delay in response time when compared to the baseline version. We evaluate our method using many hardware platforms, with a total of six different hardware platforms used among all experimental frameworks. At the same time, we use up to six different algorithms and up to five different datasets on the experimental frameworks. By providing data about the execution of the proposed method in such a wide range of setups, we believe that the proposed method is a viable solution for improving the performance of online bagging ensembles. |
publishDate |
2021 |
dc.date.accessioned.fl_str_mv |
2021-11-26T14:46:36Z |
dc.date.available.fl_str_mv |
2021-11-26T14:46:36Z |
dc.date.issued.fl_str_mv |
2021-08-27 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
format |
doctoralThesis |
status_str |
publishedVersion |
dc.identifier.citation.fl_str_mv |
CASSALES, Guilherme. Improving data locality of bagging ensembles for data streams through mini-batching. 2021. Tese (Doutorado em Ciência da Computação) – Universidade Federal de São Carlos, São Carlos, 2021. Disponível em: https://repositorio.ufscar.br/handle/ufscar/15176. |
dc.identifier.uri.fl_str_mv |
https://repositorio.ufscar.br/handle/ufscar/15176 |
identifier_str_mv |
CASSALES, Guilherme. Improving data locality of bagging ensembles for data streams through mini-batching. 2021. Tese (Doutorado em Ciência da Computação) – Universidade Federal de São Carlos, São Carlos, 2021. Disponível em: https://repositorio.ufscar.br/handle/ufscar/15176. |
url |
https://repositorio.ufscar.br/handle/ufscar/15176 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
Attribution-NonCommercial-NoDerivs 3.0 Brazil http://creativecommons.org/licenses/by-nc-nd/3.0/br/ info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
Attribution-NonCommercial-NoDerivs 3.0 Brazil http://creativecommons.org/licenses/by-nc-nd/3.0/br/ |
eu_rights_str_mv |
openAccess |
dc.publisher.none.fl_str_mv |
Universidade Federal de São Carlos Câmpus São Carlos |
dc.publisher.program.fl_str_mv |
Programa de Pós-Graduação em Ciência da Computação - PPGCC |
dc.publisher.initials.fl_str_mv |
UFSCar |
publisher.none.fl_str_mv |
Universidade Federal de São Carlos Câmpus São Carlos |
dc.source.none.fl_str_mv |
reponame:Repositório Institucional da UFSCAR instname:Universidade Federal de São Carlos (UFSCAR) instacron:UFSCAR |
instname_str |
Universidade Federal de São Carlos (UFSCAR) |
instacron_str |
UFSCAR |
institution |
UFSCAR |
reponame_str |
Repositório Institucional da UFSCAR |
collection |
Repositório Institucional da UFSCAR |
bitstream.url.fl_str_mv |
https://repositorio.ufscar.br/bitstream/ufscar/15176/1/Tese_FA_Guilherme_Cassales.pdf https://repositorio.ufscar.br/bitstream/ufscar/15176/2/Carta-Tese-Guilherme.pdf https://repositorio.ufscar.br/bitstream/ufscar/15176/3/license_rdf https://repositorio.ufscar.br/bitstream/ufscar/15176/4/Tese_FA_Guilherme_Cassales.pdf.txt https://repositorio.ufscar.br/bitstream/ufscar/15176/6/Carta-Tese-Guilherme.pdf.txt https://repositorio.ufscar.br/bitstream/ufscar/15176/5/Tese_FA_Guilherme_Cassales.pdf.jpg https://repositorio.ufscar.br/bitstream/ufscar/15176/7/Carta-Tese-Guilherme.pdf.jpg |
bitstream.checksum.fl_str_mv |
ef4ab5c66fe0b9dd07e268b8fe9a8340 dbf5faf9797c3442e5da4bcf1581678d e39d27027a6cc9cb039ad269a5db8e34 1f055b60f71ff7f05d5ad7807f026dc1 ed3f9bbc15eb094eac4251374c85a526 8617aa37f119ccd9f1fbd3d22718cc09 89c1dfc305584e71d049b2e150f929d2 |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 MD5 MD5 MD5 MD5 MD5 |
repository.name.fl_str_mv |
Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR) |
repository.mail.fl_str_mv |
|
_version_ |
1777472144647651328 |