Cache based global memory orchestration for data intensive stream processing pipelines

Detalhes bibliográficos
Autor(a) principal: Matteussi, Kassiano José
Data de Publicação: 2022
Tipo de documento: Tese
Idioma: eng
Título da fonte: Biblioteca Digital de Teses e Dissertações da UFRGS
Texto Completo: http://hdl.handle.net/10183/259649
Resumo: A significant rise in the adoption of streaming applications has changed the decision making processes in the last decade. This movement led to the emergence of several big data in-memory data-intensive processing technologies like Apache Storm, Spark, Heron, Samza, and Flink in the most varied areas and domains such as financial services, healthcare, education, manufacturing, retail, social media, sensor networks, among others. These streaming systems rely on the use of Java Virtual Machine (JVM) as an underlying processing environment for platform independence. Although it provides high-level hard ware abstraction, the JVM could not efficiently manage intensive applications that cache data into the JVM heap intensively. Consequently, it may lead to data loss, throughput degradation, and high-level latency due to several processing overheads induced by data deserialization, object scattering in main memory, garbage collection operations, and oth ers. State of the art reinforces efficient memory management for stream processing is a prominent role in real-time data analysis since it represents a critical aspect of perfor mance. The proposed solutions have provided strategies for optimizing the shuffle-driven eviction process, job-level caching, and GC performance-based cache allocation models on top of Apache Spark and Flink. However, these studies do not present mechanisms for controlling the JVM state by using solutions unaware of streaming systems’ processing and storage utilization. This thesis tackles this issue by considering the impact of the overall JVM utilization for processing and storage operations using a cache-based global memory orchestration model with well-defined memory utilization policies. It aims to improve memory management of data-intensive SP pipelines, avoid memory-based per formance issues, and keep the throughput of applications stable. Still, the proposed eval uation comprises real experiments in small and medium-sized data center infrastructures with fast network switches provided by the french grid - Grid5000. The experiments use Spark Streaming and real-world streaming applications with representative in-memory execution and storage utilization (e.g., data cache operations, processing with states, and checkpointing). The results revealed that the proposed solution kept throughput stable at a high rate (e.g., ~1GBps for small and medium-sized clusters) and may reduce the global JVM heap memory utilization by up to 50% in the evaluated cases.
id URGS_7884042b0990d7b127f03a608c22bf9d
oai_identifier_str oai:www.lume.ufrgs.br:10183/259649
network_acronym_str URGS
network_name_str Biblioteca Digital de Teses e Dissertações da UFRGS
repository_id_str 1853
spelling Matteussi, Kassiano JoséGeyer, Claudio Fernando Resin2023-06-30T03:30:04Z2022http://hdl.handle.net/10183/259649001156184A significant rise in the adoption of streaming applications has changed the decision making processes in the last decade. This movement led to the emergence of several big data in-memory data-intensive processing technologies like Apache Storm, Spark, Heron, Samza, and Flink in the most varied areas and domains such as financial services, healthcare, education, manufacturing, retail, social media, sensor networks, among others. These streaming systems rely on the use of Java Virtual Machine (JVM) as an underlying processing environment for platform independence. Although it provides high-level hard ware abstraction, the JVM could not efficiently manage intensive applications that cache data into the JVM heap intensively. Consequently, it may lead to data loss, throughput degradation, and high-level latency due to several processing overheads induced by data deserialization, object scattering in main memory, garbage collection operations, and oth ers. State of the art reinforces efficient memory management for stream processing is a prominent role in real-time data analysis since it represents a critical aspect of perfor mance. The proposed solutions have provided strategies for optimizing the shuffle-driven eviction process, job-level caching, and GC performance-based cache allocation models on top of Apache Spark and Flink. However, these studies do not present mechanisms for controlling the JVM state by using solutions unaware of streaming systems’ processing and storage utilization. This thesis tackles this issue by considering the impact of the overall JVM utilization for processing and storage operations using a cache-based global memory orchestration model with well-defined memory utilization policies. It aims to improve memory management of data-intensive SP pipelines, avoid memory-based per formance issues, and keep the throughput of applications stable. Still, the proposed eval uation comprises real experiments in small and medium-sized data center infrastructures with fast network switches provided by the french grid - Grid5000. The experiments use Spark Streaming and real-world streaming applications with representative in-memory execution and storage utilization (e.g., data cache operations, processing with states, and checkpointing). The results revealed that the proposed solution kept throughput stable at a high rate (e.g., ~1GBps for small and medium-sized clusters) and may reduce the global JVM heap memory utilization by up to 50% in the evaluated cases.application/pdfengProcessamento de dadosBig dataMemóriaData-intensive stream processingStreaming applicationsReal-time big data analyticsApache sparkMemory managementData orchestrationCache based global memory orchestration for data intensive stream processing pipelinesinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisUniversidade Federal do Rio Grande do SulInstituto de InformáticaPrograma de Pós-Graduação em ComputaçãoPorto Alegre, BR-RS2022doutoradoinfo:eu-repo/semantics/openAccessreponame:Biblioteca Digital de Teses e Dissertações da UFRGSinstname:Universidade Federal do Rio Grande do Sul (UFRGS)instacron:UFRGSTEXT001156184.pdf.txt001156184.pdf.txtExtracted Texttext/plain323500http://www.lume.ufrgs.br/bitstream/10183/259649/2/001156184.pdf.txtb84a25f3f0ce64af0270462aca3604a0MD52ORIGINAL001156184.pdfTexto completo (inglês)application/pdf8631469http://www.lume.ufrgs.br/bitstream/10183/259649/1/001156184.pdf14a58891c4afc9a763cca22982b7c3b7MD5110183/2596492023-07-01 03:37:37.107392oai:www.lume.ufrgs.br:10183/259649Biblioteca Digital de Teses e Dissertaçõeshttps://lume.ufrgs.br/handle/10183/2PUBhttps://lume.ufrgs.br/oai/requestlume@ufrgs.br||lume@ufrgs.bropendoar:18532023-07-01T06:37:37Biblioteca Digital de Teses e Dissertações da UFRGS - Universidade Federal do Rio Grande do Sul (UFRGS)false
dc.title.pt_BR.fl_str_mv Cache based global memory orchestration for data intensive stream processing pipelines
title Cache based global memory orchestration for data intensive stream processing pipelines
spellingShingle Cache based global memory orchestration for data intensive stream processing pipelines
Matteussi, Kassiano José
Processamento de dados
Big data
Memória
Data-intensive stream processing
Streaming applications
Real-time big data analytics
Apache spark
Memory management
Data orchestration
title_short Cache based global memory orchestration for data intensive stream processing pipelines
title_full Cache based global memory orchestration for data intensive stream processing pipelines
title_fullStr Cache based global memory orchestration for data intensive stream processing pipelines
title_full_unstemmed Cache based global memory orchestration for data intensive stream processing pipelines
title_sort Cache based global memory orchestration for data intensive stream processing pipelines
author Matteussi, Kassiano José
author_facet Matteussi, Kassiano José
author_role author
dc.contributor.author.fl_str_mv Matteussi, Kassiano José
dc.contributor.advisor1.fl_str_mv Geyer, Claudio Fernando Resin
contributor_str_mv Geyer, Claudio Fernando Resin
dc.subject.por.fl_str_mv Processamento de dados
Big data
Memória
topic Processamento de dados
Big data
Memória
Data-intensive stream processing
Streaming applications
Real-time big data analytics
Apache spark
Memory management
Data orchestration
dc.subject.eng.fl_str_mv Data-intensive stream processing
Streaming applications
Real-time big data analytics
Apache spark
Memory management
Data orchestration
description A significant rise in the adoption of streaming applications has changed the decision making processes in the last decade. This movement led to the emergence of several big data in-memory data-intensive processing technologies like Apache Storm, Spark, Heron, Samza, and Flink in the most varied areas and domains such as financial services, healthcare, education, manufacturing, retail, social media, sensor networks, among others. These streaming systems rely on the use of Java Virtual Machine (JVM) as an underlying processing environment for platform independence. Although it provides high-level hard ware abstraction, the JVM could not efficiently manage intensive applications that cache data into the JVM heap intensively. Consequently, it may lead to data loss, throughput degradation, and high-level latency due to several processing overheads induced by data deserialization, object scattering in main memory, garbage collection operations, and oth ers. State of the art reinforces efficient memory management for stream processing is a prominent role in real-time data analysis since it represents a critical aspect of perfor mance. The proposed solutions have provided strategies for optimizing the shuffle-driven eviction process, job-level caching, and GC performance-based cache allocation models on top of Apache Spark and Flink. However, these studies do not present mechanisms for controlling the JVM state by using solutions unaware of streaming systems’ processing and storage utilization. This thesis tackles this issue by considering the impact of the overall JVM utilization for processing and storage operations using a cache-based global memory orchestration model with well-defined memory utilization policies. It aims to improve memory management of data-intensive SP pipelines, avoid memory-based per formance issues, and keep the throughput of applications stable. Still, the proposed eval uation comprises real experiments in small and medium-sized data center infrastructures with fast network switches provided by the french grid - Grid5000. The experiments use Spark Streaming and real-world streaming applications with representative in-memory execution and storage utilization (e.g., data cache operations, processing with states, and checkpointing). The results revealed that the proposed solution kept throughput stable at a high rate (e.g., ~1GBps for small and medium-sized clusters) and may reduce the global JVM heap memory utilization by up to 50% in the evaluated cases.
publishDate 2022
dc.date.issued.fl_str_mv 2022
dc.date.accessioned.fl_str_mv 2023-06-30T03:30:04Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10183/259649
dc.identifier.nrb.pt_BR.fl_str_mv 001156184
url http://hdl.handle.net/10183/259649
identifier_str_mv 001156184
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Biblioteca Digital de Teses e Dissertações da UFRGS
instname:Universidade Federal do Rio Grande do Sul (UFRGS)
instacron:UFRGS
instname_str Universidade Federal do Rio Grande do Sul (UFRGS)
instacron_str UFRGS
institution UFRGS
reponame_str Biblioteca Digital de Teses e Dissertações da UFRGS
collection Biblioteca Digital de Teses e Dissertações da UFRGS
bitstream.url.fl_str_mv http://www.lume.ufrgs.br/bitstream/10183/259649/2/001156184.pdf.txt
http://www.lume.ufrgs.br/bitstream/10183/259649/1/001156184.pdf
bitstream.checksum.fl_str_mv b84a25f3f0ce64af0270462aca3604a0
14a58891c4afc9a763cca22982b7c3b7
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da UFRGS - Universidade Federal do Rio Grande do Sul (UFRGS)
repository.mail.fl_str_mv lume@ufrgs.br||lume@ufrgs.br
_version_ 1810085621350268928