Gerenciamento dinâmico de memória em aplicações com reuso de dados no Apache Spark
Autor(a) principal: | |
---|---|
Data de Publicação: | 2020 |
Tipo de documento: | Dissertação |
Idioma: | por |
Título da fonte: | Biblioteca Digital de Teses e Dissertações do UFSM |
Texto Completo: | http://repositorio.ufsm.br/handle/1/22687 |
Resumo: | The Apache Spark is a framework able to process a massive quantity of data in-memory, through its primary abstraction: Resilient Distributed Datasets (RDD). An RDD consists of an immutable object collection, which can be processed in a parallel and distributed way in the cluster. Once it was processed, an RDD could be stored in the cache, allowing its reuse without recomputing it. While the application’s computations are done, the memory tends to be overheaded, and RDD’s partitions must be removed according to the Least Recently Used (LRU) algorithm. This algorithm is based on the idea that partitions frequently used in the past will be reaccessed shortly. Thus, the algorithm removes partitions that access occurred a long time ago. However, there are situations that the LRU algorithm could introduce degradation in Spark’s performance, which is the case where there is cyclic access in the memory, and the available space is lower than the dataset size. In those situations, the LRU algorithm will always remove the block, which will be accessed soon. Considering the identified issues in the LRU, this work proposes a Dynamic Memory Management in Applications With Data Reuse on Apache Spark. This model aims to extract metrics from the application’s execution in order to use that information to realize data removing from the cache. The proposed model is compound by two main components, which are (1) an algorithm to manage the RDD’s partitions stored int the memory and (2) a monitor agency responsible for getting information about the application. The Dynamic Management model was validated through experiments using the Grid’5000 platforms with benchmarks PageRank, K-Means, and Logistic Regression. The obtained results demonstrate that the Dynamic Management model was able to improve the utilization of available memory, being able to reduce by 23,94% the necessary execution time to process the benchmark Logistic Regression, when it is compared to LRU. Furthermore, the proposed model became Spark’s execution more stable, reducing the error frequency during the processing of benchmarks. As a consequence, there was a reduction by 34,14% in the time spend to process the benchmark PageRank. Therefore, the obtained results allow concluding that dynamic strategies, like the one proposed by this work, can improve the Sparks execution in applications where there is reuse data. |
id |
UFSM_aac95071b03a1e1fd510f3c04a1c8cf2 |
---|---|
oai_identifier_str |
oai:repositorio.ufsm.br:1/22687 |
network_acronym_str |
UFSM |
network_name_str |
Biblioteca Digital de Teses e Dissertações do UFSM |
repository_id_str |
|
spelling |
2021-11-03T17:42:24Z2021-11-03T17:42:24Z2020-05-25http://repositorio.ufsm.br/handle/1/22687The Apache Spark is a framework able to process a massive quantity of data in-memory, through its primary abstraction: Resilient Distributed Datasets (RDD). An RDD consists of an immutable object collection, which can be processed in a parallel and distributed way in the cluster. Once it was processed, an RDD could be stored in the cache, allowing its reuse without recomputing it. While the application’s computations are done, the memory tends to be overheaded, and RDD’s partitions must be removed according to the Least Recently Used (LRU) algorithm. This algorithm is based on the idea that partitions frequently used in the past will be reaccessed shortly. Thus, the algorithm removes partitions that access occurred a long time ago. However, there are situations that the LRU algorithm could introduce degradation in Spark’s performance, which is the case where there is cyclic access in the memory, and the available space is lower than the dataset size. In those situations, the LRU algorithm will always remove the block, which will be accessed soon. Considering the identified issues in the LRU, this work proposes a Dynamic Memory Management in Applications With Data Reuse on Apache Spark. This model aims to extract metrics from the application’s execution in order to use that information to realize data removing from the cache. The proposed model is compound by two main components, which are (1) an algorithm to manage the RDD’s partitions stored int the memory and (2) a monitor agency responsible for getting information about the application. The Dynamic Management model was validated through experiments using the Grid’5000 platforms with benchmarks PageRank, K-Means, and Logistic Regression. The obtained results demonstrate that the Dynamic Management model was able to improve the utilization of available memory, being able to reduce by 23,94% the necessary execution time to process the benchmark Logistic Regression, when it is compared to LRU. Furthermore, the proposed model became Spark’s execution more stable, reducing the error frequency during the processing of benchmarks. As a consequence, there was a reduction by 34,14% in the time spend to process the benchmark PageRank. Therefore, the obtained results allow concluding that dynamic strategies, like the one proposed by this work, can improve the Sparks execution in applications where there is reuse data.O Apache Spark é um framework capaz de processar grandes quantidades de dados em memória, através da sua principal abstração: o Resilient Distributed Datasets (RDD). Um RDD consiste em uma coleção imutável de objetos, os quais podem ser operados de maneira paralela e distribuída nocluster. Uma vez processados, RDDs podem ser mantidos em cache, possibilitando a sua reutilização sem realizar a sua recomputação. Conforme a computação da aplicação é feita, a memória tende a ficar sobrecarregada e, portanto, partições de RDDs devem ser removidas de acordo com o algoritmo Least Recently Used (LRU). Este algoritmo é baseado na observação de que partições frequentemente utilizadas em um passado recente tendem a ser acessadas novamente em um futuro próximo. Deste modo, remove-se a partição cujo acesso ocorreu há mais tempo. Entretanto, há situações em que o LRU pode acarretar em uma degradação no desempenho, como é o caso onde há acessos cíclicos à memória e a quantidade de dados manipulados é maior que o espaço disponível. Nessas situações,o LRU sempre irá remover um bloco que será acessado em um futuro próximo. Considerando tal problemática, este trabalho propõe um modelo de Gerenciamento Dinâmico da Memória em Aplicações com Reuso de Dados no Apache Spark. Este modelo busca extrair métricas da aplicação em execução a fim de utilizar estas informações para realizar remoção dos dados em cache. O modelo proposto é composto por dois componentes principais, sendo estes (1) um algoritmo de gerenciamento das partições de RDDs armazenadas em memória e (2) um agente de monitoramento responsável por obter informações sobre a execução de aplicações. O modelo de Gerenciamento Dinâmico foi validado através da realização de experimentos utilizando a plataforma Grid’5000 com os benchmarks PageRank, K-Means e Logistic Regression. Os resultados obtidos demonstram que o modelo de Gerênciamento Dinâmico conseguiu realizar um melhor aproveitamento da memória disponível, chegando a reduzir em 23,94% o tempo médio necessário para processar o benchmark Logistic Regression, quanto comparado ao LRU. Ademais, o modelo proposto tornou a execução do Spark mais estável, reduzindo a frequência de erros no processamento dos benchmarks. Como consequência, houve uma redução de até 34,15% no tempo de execução do benchmark PageRank. Portanto, estes resultados permitem concluir que estratégias dinâmicas, como a proposta por este estudo, podem proporcionar um ganho no desempenho do Spark no processamento de aplicações onde existe o reuso de dados.porUniversidade Federal de Santa MariaCentro de TecnologiaPrograma de Pós-Graduação em Ciência da ComputaçãoUFSMBrasilCiência da ComputaçãoAttribution-NonCommercial-NoDerivatives 4.0 Internationalhttp://creativecommons.org/licenses/by-nc-nd/4.0/info:eu-repo/semantics/openAccessSparkGerenciamentoMemóriaReusoDinâmicoManagementMemoryReuseDynamicCNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAOGerenciamento dinâmico de memória em aplicações com reuso de dados no Apache SparkDynamic memory management in applications with data reuse on Apache Sparkinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisBarcelos, Patrícia Pitthan de Araújohttp://lattes.cnpq.br/6069105173950277Lima, João Vicente FerreiraWives, Leandro Krughttp://lattes.cnpq.br/2203534256446729Donato, Mauricio Matter100300000007600600600600600a421e37e-ac28-4452-9943-e257a19343ca78570edf-feab-4fee-ad24-b0909e34c3b3aec00059-8729-40bd-b989-bf07fcb3bdbd871918b5-3824-4183-ba59-f4e2e7ef4ec3reponame:Biblioteca Digital de Teses e Dissertações do UFSMinstname:Universidade Federal de Santa Maria (UFSM)instacron:UFSMORIGINALDIS_PPGCC_2020_DONATO_MAURÍCIO.pdfDIS_PPGCC_2020_DONATO_MAURÍCIO.pdfDissertação de Mestradoapplication/pdf1234048http://repositorio.ufsm.br/bitstream/1/22687/1/DIS_PPGCC_2020_DONATO_MAUR%c3%8dCIO.pdf9e486533ac5f2d5a8a47856c27cd9335MD51CC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-8805http://repositorio.ufsm.br/bitstream/1/22687/2/license_rdf4460e5956bc1d1639be9ae6146a50347MD52LICENSElicense.txtlicense.txttext/plain; charset=utf-816http://repositorio.ufsm.br/bitstream/1/22687/3/license.txtf8fcb28efb1c8cf0dc096bec902bf4c4MD53TEXTDIS_PPGCC_2020_DONATO_MAURÍCIO.pdf.txtDIS_PPGCC_2020_DONATO_MAURÍCIO.pdf.txtExtracted texttext/plain166027http://repositorio.ufsm.br/bitstream/1/22687/4/DIS_PPGCC_2020_DONATO_MAUR%c3%8dCIO.pdf.txt3e1b2a372414e72ad8f3ed1e3879f40fMD54THUMBNAILDIS_PPGCC_2020_DONATO_MAURÍCIO.pdf.jpgDIS_PPGCC_2020_DONATO_MAURÍCIO.pdf.jpgIM Thumbnailimage/jpeg3941http://repositorio.ufsm.br/bitstream/1/22687/5/DIS_PPGCC_2020_DONATO_MAUR%c3%8dCIO.pdf.jpg0de8601e91f8a79dc6ecacea7eed82b3MD551/226872021-11-04 03:00:52.576oai:repositorio.ufsm.br:1/22687Q3JlYXRpdmUgQ29tbW9ucw==Biblioteca Digital de Teses e Dissertaçõeshttps://repositorio.ufsm.br/ONGhttps://repositorio.ufsm.br/oai/requestatendimento.sib@ufsm.br||tedebc@gmail.comopendoar:2021-11-04T06:00:52Biblioteca Digital de Teses e Dissertações do UFSM - Universidade Federal de Santa Maria (UFSM)false |
dc.title.por.fl_str_mv |
Gerenciamento dinâmico de memória em aplicações com reuso de dados no Apache Spark |
dc.title.alternative.eng.fl_str_mv |
Dynamic memory management in applications with data reuse on Apache Spark |
title |
Gerenciamento dinâmico de memória em aplicações com reuso de dados no Apache Spark |
spellingShingle |
Gerenciamento dinâmico de memória em aplicações com reuso de dados no Apache Spark Donato, Mauricio Matter Spark Gerenciamento Memória Reuso Dinâmico Management Memory Reuse Dynamic CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO |
title_short |
Gerenciamento dinâmico de memória em aplicações com reuso de dados no Apache Spark |
title_full |
Gerenciamento dinâmico de memória em aplicações com reuso de dados no Apache Spark |
title_fullStr |
Gerenciamento dinâmico de memória em aplicações com reuso de dados no Apache Spark |
title_full_unstemmed |
Gerenciamento dinâmico de memória em aplicações com reuso de dados no Apache Spark |
title_sort |
Gerenciamento dinâmico de memória em aplicações com reuso de dados no Apache Spark |
author |
Donato, Mauricio Matter |
author_facet |
Donato, Mauricio Matter |
author_role |
author |
dc.contributor.advisor1.fl_str_mv |
Barcelos, Patrícia Pitthan de Araújo |
dc.contributor.advisor1Lattes.fl_str_mv |
http://lattes.cnpq.br/6069105173950277 |
dc.contributor.referee1.fl_str_mv |
Lima, João Vicente Ferreira |
dc.contributor.referee2.fl_str_mv |
Wives, Leandro Krug |
dc.contributor.authorLattes.fl_str_mv |
http://lattes.cnpq.br/2203534256446729 |
dc.contributor.author.fl_str_mv |
Donato, Mauricio Matter |
contributor_str_mv |
Barcelos, Patrícia Pitthan de Araújo Lima, João Vicente Ferreira Wives, Leandro Krug |
dc.subject.por.fl_str_mv |
Spark Gerenciamento Memória Reuso Dinâmico |
topic |
Spark Gerenciamento Memória Reuso Dinâmico Management Memory Reuse Dynamic CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO |
dc.subject.eng.fl_str_mv |
Management Memory Reuse Dynamic |
dc.subject.cnpq.fl_str_mv |
CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO |
description |
The Apache Spark is a framework able to process a massive quantity of data in-memory, through its primary abstraction: Resilient Distributed Datasets (RDD). An RDD consists of an immutable object collection, which can be processed in a parallel and distributed way in the cluster. Once it was processed, an RDD could be stored in the cache, allowing its reuse without recomputing it. While the application’s computations are done, the memory tends to be overheaded, and RDD’s partitions must be removed according to the Least Recently Used (LRU) algorithm. This algorithm is based on the idea that partitions frequently used in the past will be reaccessed shortly. Thus, the algorithm removes partitions that access occurred a long time ago. However, there are situations that the LRU algorithm could introduce degradation in Spark’s performance, which is the case where there is cyclic access in the memory, and the available space is lower than the dataset size. In those situations, the LRU algorithm will always remove the block, which will be accessed soon. Considering the identified issues in the LRU, this work proposes a Dynamic Memory Management in Applications With Data Reuse on Apache Spark. This model aims to extract metrics from the application’s execution in order to use that information to realize data removing from the cache. The proposed model is compound by two main components, which are (1) an algorithm to manage the RDD’s partitions stored int the memory and (2) a monitor agency responsible for getting information about the application. The Dynamic Management model was validated through experiments using the Grid’5000 platforms with benchmarks PageRank, K-Means, and Logistic Regression. The obtained results demonstrate that the Dynamic Management model was able to improve the utilization of available memory, being able to reduce by 23,94% the necessary execution time to process the benchmark Logistic Regression, when it is compared to LRU. Furthermore, the proposed model became Spark’s execution more stable, reducing the error frequency during the processing of benchmarks. As a consequence, there was a reduction by 34,14% in the time spend to process the benchmark PageRank. Therefore, the obtained results allow concluding that dynamic strategies, like the one proposed by this work, can improve the Sparks execution in applications where there is reuse data. |
publishDate |
2020 |
dc.date.issued.fl_str_mv |
2020-05-25 |
dc.date.accessioned.fl_str_mv |
2021-11-03T17:42:24Z |
dc.date.available.fl_str_mv |
2021-11-03T17:42:24Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://repositorio.ufsm.br/handle/1/22687 |
url |
http://repositorio.ufsm.br/handle/1/22687 |
dc.language.iso.fl_str_mv |
por |
language |
por |
dc.relation.cnpq.fl_str_mv |
100300000007 |
dc.relation.confidence.fl_str_mv |
600 600 600 600 600 |
dc.relation.authority.fl_str_mv |
a421e37e-ac28-4452-9943-e257a19343ca 78570edf-feab-4fee-ad24-b0909e34c3b3 aec00059-8729-40bd-b989-bf07fcb3bdbd 871918b5-3824-4183-ba59-f4e2e7ef4ec3 |
dc.rights.driver.fl_str_mv |
Attribution-NonCommercial-NoDerivatives 4.0 International http://creativecommons.org/licenses/by-nc-nd/4.0/ info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
Attribution-NonCommercial-NoDerivatives 4.0 International http://creativecommons.org/licenses/by-nc-nd/4.0/ |
eu_rights_str_mv |
openAccess |
dc.publisher.none.fl_str_mv |
Universidade Federal de Santa Maria Centro de Tecnologia |
dc.publisher.program.fl_str_mv |
Programa de Pós-Graduação em Ciência da Computação |
dc.publisher.initials.fl_str_mv |
UFSM |
dc.publisher.country.fl_str_mv |
Brasil |
dc.publisher.department.fl_str_mv |
Ciência da Computação |
publisher.none.fl_str_mv |
Universidade Federal de Santa Maria Centro de Tecnologia |
dc.source.none.fl_str_mv |
reponame:Biblioteca Digital de Teses e Dissertações do UFSM instname:Universidade Federal de Santa Maria (UFSM) instacron:UFSM |
instname_str |
Universidade Federal de Santa Maria (UFSM) |
instacron_str |
UFSM |
institution |
UFSM |
reponame_str |
Biblioteca Digital de Teses e Dissertações do UFSM |
collection |
Biblioteca Digital de Teses e Dissertações do UFSM |
bitstream.url.fl_str_mv |
http://repositorio.ufsm.br/bitstream/1/22687/1/DIS_PPGCC_2020_DONATO_MAUR%c3%8dCIO.pdf http://repositorio.ufsm.br/bitstream/1/22687/2/license_rdf http://repositorio.ufsm.br/bitstream/1/22687/3/license.txt http://repositorio.ufsm.br/bitstream/1/22687/4/DIS_PPGCC_2020_DONATO_MAUR%c3%8dCIO.pdf.txt http://repositorio.ufsm.br/bitstream/1/22687/5/DIS_PPGCC_2020_DONATO_MAUR%c3%8dCIO.pdf.jpg |
bitstream.checksum.fl_str_mv |
9e486533ac5f2d5a8a47856c27cd9335 4460e5956bc1d1639be9ae6146a50347 f8fcb28efb1c8cf0dc096bec902bf4c4 3e1b2a372414e72ad8f3ed1e3879f40f 0de8601e91f8a79dc6ecacea7eed82b3 |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 MD5 MD5 MD5 |
repository.name.fl_str_mv |
Biblioteca Digital de Teses e Dissertações do UFSM - Universidade Federal de Santa Maria (UFSM) |
repository.mail.fl_str_mv |
atendimento.sib@ufsm.br||tedebc@gmail.com |
_version_ |
1801485297237622784 |