Cache based global memory orchestration for data intensive stream processing pipelines
Autor(a) principal: | |
---|---|
Data de Publicação: | 2022 |
Tipo de documento: | Tese |
Idioma: | eng |
Título da fonte: | Biblioteca Digital de Teses e Dissertações da UFRGS |
Texto Completo: | http://hdl.handle.net/10183/259649 |
Resumo: | A significant rise in the adoption of streaming applications has changed the decision making processes in the last decade. This movement led to the emergence of several big data in-memory data-intensive processing technologies like Apache Storm, Spark, Heron, Samza, and Flink in the most varied areas and domains such as financial services, healthcare, education, manufacturing, retail, social media, sensor networks, among others. These streaming systems rely on the use of Java Virtual Machine (JVM) as an underlying processing environment for platform independence. Although it provides high-level hard ware abstraction, the JVM could not efficiently manage intensive applications that cache data into the JVM heap intensively. Consequently, it may lead to data loss, throughput degradation, and high-level latency due to several processing overheads induced by data deserialization, object scattering in main memory, garbage collection operations, and oth ers. State of the art reinforces efficient memory management for stream processing is a prominent role in real-time data analysis since it represents a critical aspect of perfor mance. The proposed solutions have provided strategies for optimizing the shuffle-driven eviction process, job-level caching, and GC performance-based cache allocation models on top of Apache Spark and Flink. However, these studies do not present mechanisms for controlling the JVM state by using solutions unaware of streaming systems’ processing and storage utilization. This thesis tackles this issue by considering the impact of the overall JVM utilization for processing and storage operations using a cache-based global memory orchestration model with well-defined memory utilization policies. It aims to improve memory management of data-intensive SP pipelines, avoid memory-based per formance issues, and keep the throughput of applications stable. Still, the proposed eval uation comprises real experiments in small and medium-sized data center infrastructures with fast network switches provided by the french grid - Grid5000. The experiments use Spark Streaming and real-world streaming applications with representative in-memory execution and storage utilization (e.g., data cache operations, processing with states, and checkpointing). The results revealed that the proposed solution kept throughput stable at a high rate (e.g., ~1GBps for small and medium-sized clusters) and may reduce the global JVM heap memory utilization by up to 50% in the evaluated cases. |
id |
URGS_7884042b0990d7b127f03a608c22bf9d |
---|---|
oai_identifier_str |
oai:www.lume.ufrgs.br:10183/259649 |
network_acronym_str |
URGS |
network_name_str |
Biblioteca Digital de Teses e Dissertações da UFRGS |
repository_id_str |
1853 |
spelling |
Matteussi, Kassiano JoséGeyer, Claudio Fernando Resin2023-06-30T03:30:04Z2022http://hdl.handle.net/10183/259649001156184A significant rise in the adoption of streaming applications has changed the decision making processes in the last decade. This movement led to the emergence of several big data in-memory data-intensive processing technologies like Apache Storm, Spark, Heron, Samza, and Flink in the most varied areas and domains such as financial services, healthcare, education, manufacturing, retail, social media, sensor networks, among others. These streaming systems rely on the use of Java Virtual Machine (JVM) as an underlying processing environment for platform independence. Although it provides high-level hard ware abstraction, the JVM could not efficiently manage intensive applications that cache data into the JVM heap intensively. Consequently, it may lead to data loss, throughput degradation, and high-level latency due to several processing overheads induced by data deserialization, object scattering in main memory, garbage collection operations, and oth ers. State of the art reinforces efficient memory management for stream processing is a prominent role in real-time data analysis since it represents a critical aspect of perfor mance. The proposed solutions have provided strategies for optimizing the shuffle-driven eviction process, job-level caching, and GC performance-based cache allocation models on top of Apache Spark and Flink. However, these studies do not present mechanisms for controlling the JVM state by using solutions unaware of streaming systems’ processing and storage utilization. This thesis tackles this issue by considering the impact of the overall JVM utilization for processing and storage operations using a cache-based global memory orchestration model with well-defined memory utilization policies. It aims to improve memory management of data-intensive SP pipelines, avoid memory-based per formance issues, and keep the throughput of applications stable. Still, the proposed eval uation comprises real experiments in small and medium-sized data center infrastructures with fast network switches provided by the french grid - Grid5000. The experiments use Spark Streaming and real-world streaming applications with representative in-memory execution and storage utilization (e.g., data cache operations, processing with states, and checkpointing). The results revealed that the proposed solution kept throughput stable at a high rate (e.g., ~1GBps for small and medium-sized clusters) and may reduce the global JVM heap memory utilization by up to 50% in the evaluated cases.application/pdfengProcessamento de dadosBig dataMemóriaData-intensive stream processingStreaming applicationsReal-time big data analyticsApache sparkMemory managementData orchestrationCache based global memory orchestration for data intensive stream processing pipelinesinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisUniversidade Federal do Rio Grande do SulInstituto de InformáticaPrograma de Pós-Graduação em ComputaçãoPorto Alegre, BR-RS2022doutoradoinfo:eu-repo/semantics/openAccessreponame:Biblioteca Digital de Teses e Dissertações da UFRGSinstname:Universidade Federal do Rio Grande do Sul (UFRGS)instacron:UFRGSTEXT001156184.pdf.txt001156184.pdf.txtExtracted Texttext/plain323500http://www.lume.ufrgs.br/bitstream/10183/259649/2/001156184.pdf.txtb84a25f3f0ce64af0270462aca3604a0MD52ORIGINAL001156184.pdfTexto completo (inglês)application/pdf8631469http://www.lume.ufrgs.br/bitstream/10183/259649/1/001156184.pdf14a58891c4afc9a763cca22982b7c3b7MD5110183/2596492023-07-01 03:37:37.107392oai:www.lume.ufrgs.br:10183/259649Biblioteca Digital de Teses e Dissertaçõeshttps://lume.ufrgs.br/handle/10183/2PUBhttps://lume.ufrgs.br/oai/requestlume@ufrgs.br||lume@ufrgs.bropendoar:18532023-07-01T06:37:37Biblioteca Digital de Teses e Dissertações da UFRGS - Universidade Federal do Rio Grande do Sul (UFRGS)false |
dc.title.pt_BR.fl_str_mv |
Cache based global memory orchestration for data intensive stream processing pipelines |
title |
Cache based global memory orchestration for data intensive stream processing pipelines |
spellingShingle |
Cache based global memory orchestration for data intensive stream processing pipelines Matteussi, Kassiano José Processamento de dados Big data Memória Data-intensive stream processing Streaming applications Real-time big data analytics Apache spark Memory management Data orchestration |
title_short |
Cache based global memory orchestration for data intensive stream processing pipelines |
title_full |
Cache based global memory orchestration for data intensive stream processing pipelines |
title_fullStr |
Cache based global memory orchestration for data intensive stream processing pipelines |
title_full_unstemmed |
Cache based global memory orchestration for data intensive stream processing pipelines |
title_sort |
Cache based global memory orchestration for data intensive stream processing pipelines |
author |
Matteussi, Kassiano José |
author_facet |
Matteussi, Kassiano José |
author_role |
author |
dc.contributor.author.fl_str_mv |
Matteussi, Kassiano José |
dc.contributor.advisor1.fl_str_mv |
Geyer, Claudio Fernando Resin |
contributor_str_mv |
Geyer, Claudio Fernando Resin |
dc.subject.por.fl_str_mv |
Processamento de dados Big data Memória |
topic |
Processamento de dados Big data Memória Data-intensive stream processing Streaming applications Real-time big data analytics Apache spark Memory management Data orchestration |
dc.subject.eng.fl_str_mv |
Data-intensive stream processing Streaming applications Real-time big data analytics Apache spark Memory management Data orchestration |
description |
A significant rise in the adoption of streaming applications has changed the decision making processes in the last decade. This movement led to the emergence of several big data in-memory data-intensive processing technologies like Apache Storm, Spark, Heron, Samza, and Flink in the most varied areas and domains such as financial services, healthcare, education, manufacturing, retail, social media, sensor networks, among others. These streaming systems rely on the use of Java Virtual Machine (JVM) as an underlying processing environment for platform independence. Although it provides high-level hard ware abstraction, the JVM could not efficiently manage intensive applications that cache data into the JVM heap intensively. Consequently, it may lead to data loss, throughput degradation, and high-level latency due to several processing overheads induced by data deserialization, object scattering in main memory, garbage collection operations, and oth ers. State of the art reinforces efficient memory management for stream processing is a prominent role in real-time data analysis since it represents a critical aspect of perfor mance. The proposed solutions have provided strategies for optimizing the shuffle-driven eviction process, job-level caching, and GC performance-based cache allocation models on top of Apache Spark and Flink. However, these studies do not present mechanisms for controlling the JVM state by using solutions unaware of streaming systems’ processing and storage utilization. This thesis tackles this issue by considering the impact of the overall JVM utilization for processing and storage operations using a cache-based global memory orchestration model with well-defined memory utilization policies. It aims to improve memory management of data-intensive SP pipelines, avoid memory-based per formance issues, and keep the throughput of applications stable. Still, the proposed eval uation comprises real experiments in small and medium-sized data center infrastructures with fast network switches provided by the french grid - Grid5000. The experiments use Spark Streaming and real-world streaming applications with representative in-memory execution and storage utilization (e.g., data cache operations, processing with states, and checkpointing). The results revealed that the proposed solution kept throughput stable at a high rate (e.g., ~1GBps for small and medium-sized clusters) and may reduce the global JVM heap memory utilization by up to 50% in the evaluated cases. |
publishDate |
2022 |
dc.date.issued.fl_str_mv |
2022 |
dc.date.accessioned.fl_str_mv |
2023-06-30T03:30:04Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
format |
doctoralThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10183/259649 |
dc.identifier.nrb.pt_BR.fl_str_mv |
001156184 |
url |
http://hdl.handle.net/10183/259649 |
identifier_str_mv |
001156184 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:Biblioteca Digital de Teses e Dissertações da UFRGS instname:Universidade Federal do Rio Grande do Sul (UFRGS) instacron:UFRGS |
instname_str |
Universidade Federal do Rio Grande do Sul (UFRGS) |
instacron_str |
UFRGS |
institution |
UFRGS |
reponame_str |
Biblioteca Digital de Teses e Dissertações da UFRGS |
collection |
Biblioteca Digital de Teses e Dissertações da UFRGS |
bitstream.url.fl_str_mv |
http://www.lume.ufrgs.br/bitstream/10183/259649/2/001156184.pdf.txt http://www.lume.ufrgs.br/bitstream/10183/259649/1/001156184.pdf |
bitstream.checksum.fl_str_mv |
b84a25f3f0ce64af0270462aca3604a0 14a58891c4afc9a763cca22982b7c3b7 |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 |
repository.name.fl_str_mv |
Biblioteca Digital de Teses e Dissertações da UFRGS - Universidade Federal do Rio Grande do Sul (UFRGS) |
repository.mail.fl_str_mv |
lume@ufrgs.br||lume@ufrgs.br |
_version_ |
1810085621350268928 |