Cache based global memory orchestration for data intensive stream processing pipelines

Matteussi, Kassiano José

Cache based global memory orchestration for data intensive stream processing pipelines

Detalhes bibliográficos
Autor(a) principal:	Matteussi, Kassiano José
Data de Publicação:	2022
Tipo de documento:	Tese
Idioma:	eng
Título da fonte:	Biblioteca Digital de Teses e Dissertações da UFRGS
Texto Completo:	http://hdl.handle.net/10183/259649
Resumo:	A significant rise in the adoption of streaming applications has changed the decision making processes in the last decade. This movement led to the emergence of several big data in-memory data-intensive processing technologies like Apache Storm, Spark, Heron, Samza, and Flink in the most varied areas and domains such as financial services, healthcare, education, manufacturing, retail, social media, sensor networks, among others. These streaming systems rely on the use of Java Virtual Machine (JVM) as an underlying processing environment for platform independence. Although it provides high-level hard ware abstraction, the JVM could not efficiently manage intensive applications that cache data into the JVM heap intensively. Consequently, it may lead to data loss, throughput degradation, and high-level latency due to several processing overheads induced by data deserialization, object scattering in main memory, garbage collection operations, and oth ers. State of the art reinforces efficient memory management for stream processing is a prominent role in real-time data analysis since it represents a critical aspect of perfor mance. The proposed solutions have provided strategies for optimizing the shuffle-driven eviction process, job-level caching, and GC performance-based cache allocation models on top of Apache Spark and Flink. However, these studies do not present mechanisms for controlling the JVM state by using solutions unaware of streaming systems’ processing and storage utilization. This thesis tackles this issue by considering the impact of the overall JVM utilization for processing and storage operations using a cache-based global memory orchestration model with well-defined memory utilization policies. It aims to improve memory management of data-intensive SP pipelines, avoid memory-based per formance issues, and keep the throughput of applications stable. Still, the proposed eval uation comprises real experiments in small and medium-sized data center infrastructures with fast network switches provided by the french grid - Grid5000. The experiments use Spark Streaming and real-world streaming applications with representative in-memory execution and storage utilization (e.g., data cache operations, processing with states, and checkpointing). The results revealed that the proposed solution kept throughput stable at a high rate (e.g., ~1GBps for small and medium-sized clusters) and may reduce the global JVM heap memory utilization by up to 50% in the evaluated cases.

Metadados do item

id	URGS_7884042b0990d7b127f03a608c22bf9d
oai_identifier_str	oai:www.lume.ufrgs.br:10183/259649
network_acronym_str	URGS
network_name_str	Biblioteca Digital de Teses e Dissertações da UFRGS
repository_id_str	1853
spelling	Matteussi, Kassiano JoséGeyer, Claudio Fernando Resin2023-06-30T03:30:04Z2022http://hdl.handle.net/10183/259649001156184A significant rise in the adoption of streaming applications has changed the decision making processes in the last decade. This movement led to the emergence of several big data in-memory data-intensive processing technologies like Apache Storm, Spark, Heron, Samza, and Flink in the most varied areas and domains such as financial services, healthcare, education, manufacturing, retail, social media, sensor networks, among others. These streaming systems rely on the use of Java Virtual Machine (JVM) as an underlying processing environment for platform independence. Although it provides high-level hard ware abstraction, the JVM could not efficiently manage intensive applications that cache data into the JVM heap intensively. Consequently, it may lead to data loss, throughput degradation, and high-level latency due to several processing overheads induced by data deserialization, object scattering in main memory, garbage collection operations, and oth ers. State of the art reinforces efficient memory management for stream processing is a prominent role in real-time data analysis since it represents a critical aspect of perfor mance. The proposed solutions have provided strategies for optimizing the shuffle-driven eviction process, job-level caching, and GC performance-based cache allocation models on top of Apache Spark and Flink. However, these studies do not present mechanisms for controlling the JVM state by using solutions unaware of streaming systems’ processing and storage utilization. This thesis tackles this issue by considering the impact of the overall JVM utilization for processing and storage operations using a cache-based global memory orchestration model with well-defined memory utilization policies. It aims to improve memory management of data-intensive SP pipelines, avoid memory-based per formance issues, and keep the throughput of applications stable. Still, the proposed eval uation comprises real experiments in small and medium-sized data center infrastructures with fast network switches provided by the french grid - Grid5000. The experiments use Spark Streaming and real-world streaming applications with representative in-memory execution and storage utilization (e.g., data cache operations, processing with states, and checkpointing). The results revealed that the proposed solution kept throughput stable at a high rate (e.g., ~1GBps for small and medium-sized clusters) and may reduce the global JVM heap memory utilization by up to 50% in the evaluated cases.application/pdfengProcessamento de dadosBig dataMemóriaData-intensive stream processingStreaming applicationsReal-time big data analyticsApache sparkMemory managementData orchestrationCache based global memory orchestration for data intensive stream processing pipelinesinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisUniversidade Federal do Rio Grande do SulInstituto de InformáticaPrograma de Pós-Graduação em ComputaçãoPorto Alegre, BR-RS2022doutoradoinfo:eu-repo/semantics/openAccessreponame:Biblioteca Digital de Teses e Dissertações da UFRGSinstname:Universidade Federal do Rio Grande do Sul (UFRGS)instacron:UFRGSTEXT001156184.pdf.txt001156184.pdf.txtExtracted Texttext/plain323500http://www.lume.ufrgs.br/bitstream/10183/259649/2/001156184.pdf.txtb84a25f3f0ce64af0270462aca3604a0MD52ORIGINAL001156184.pdfTexto completo (inglês)application/pdf8631469http://www.lume.ufrgs.br/bitstream/10183/259649/1/001156184.pdf14a58891c4afc9a763cca22982b7c3b7MD5110183/2596492023-07-01 03:37:37.107392oai:www.lume.ufrgs.br:10183/259649Biblioteca Digital de Teses e Dissertaçõeshttps://lume.ufrgs.br/handle/10183/2PUBhttps://lume.ufrgs.br/oai/requestlume@ufrgs.br\|\|lume@ufrgs.bropendoar:18532023-07-01T06:37:37Biblioteca Digital de Teses e Dissertações da UFRGS - Universidade Federal do Rio Grande do Sul (UFRGS)false
dc.title.pt_BR.fl_str_mv	Cache based global memory orchestration for data intensive stream processing pipelines
title	Cache based global memory orchestration for data intensive stream processing pipelines
spellingShingle	Cache based global memory orchestration for data intensive stream processing pipelines Matteussi, Kassiano José Processamento de dados Big data Memória Data-intensive stream processing Streaming applications Real-time big data analytics Apache spark Memory management Data orchestration
title_short	Cache based global memory orchestration for data intensive stream processing pipelines
title_full	Cache based global memory orchestration for data intensive stream processing pipelines
title_fullStr	Cache based global memory orchestration for data intensive stream processing pipelines
title_full_unstemmed	Cache based global memory orchestration for data intensive stream processing pipelines
title_sort	Cache based global memory orchestration for data intensive stream processing pipelines
author	Matteussi, Kassiano José
author_facet	Matteussi, Kassiano José
author_role	author
dc.contributor.author.fl_str_mv	Matteussi, Kassiano José
dc.contributor.advisor1.fl_str_mv	Geyer, Claudio Fernando Resin
contributor_str_mv	Geyer, Claudio Fernando Resin
dc.subject.por.fl_str_mv	Processamento de dados Big data Memória
topic	Processamento de dados Big data Memória Data-intensive stream processing Streaming applications Real-time big data analytics Apache spark Memory management Data orchestration
dc.subject.eng.fl_str_mv	Data-intensive stream processing Streaming applications Real-time big data analytics Apache spark Memory management Data orchestration
description	A significant rise in the adoption of streaming applications has changed the decision making processes in the last decade. This movement led to the emergence of several big data in-memory data-intensive processing technologies like Apache Storm, Spark, Heron, Samza, and Flink in the most varied areas and domains such as financial services, healthcare, education, manufacturing, retail, social media, sensor networks, among others. These streaming systems rely on the use of Java Virtual Machine (JVM) as an underlying processing environment for platform independence. Although it provides high-level hard ware abstraction, the JVM could not efficiently manage intensive applications that cache data into the JVM heap intensively. Consequently, it may lead to data loss, throughput degradation, and high-level latency due to several processing overheads induced by data deserialization, object scattering in main memory, garbage collection operations, and oth ers. State of the art reinforces efficient memory management for stream processing is a prominent role in real-time data analysis since it represents a critical aspect of perfor mance. The proposed solutions have provided strategies for optimizing the shuffle-driven eviction process, job-level caching, and GC performance-based cache allocation models on top of Apache Spark and Flink. However, these studies do not present mechanisms for controlling the JVM state by using solutions unaware of streaming systems’ processing and storage utilization. This thesis tackles this issue by considering the impact of the overall JVM utilization for processing and storage operations using a cache-based global memory orchestration model with well-defined memory utilization policies. It aims to improve memory management of data-intensive SP pipelines, avoid memory-based per formance issues, and keep the throughput of applications stable. Still, the proposed eval uation comprises real experiments in small and medium-sized data center infrastructures with fast network switches provided by the french grid - Grid5000. The experiments use Spark Streaming and real-world streaming applications with representative in-memory execution and storage utilization (e.g., data cache operations, processing with states, and checkpointing). The results revealed that the proposed solution kept throughput stable at a high rate (e.g., ~1GBps for small and medium-sized clusters) and may reduce the global JVM heap memory utilization by up to 50% in the evaluated cases.
publishDate	2022
dc.date.issued.fl_str_mv	2022
dc.date.accessioned.fl_str_mv	2023-06-30T03:30:04Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/doctoralThesis
format	doctoralThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10183/259649
dc.identifier.nrb.pt_BR.fl_str_mv	001156184
url	http://hdl.handle.net/10183/259649
identifier_str_mv	001156184
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.source.none.fl_str_mv	reponame:Biblioteca Digital de Teses e Dissertações da UFRGS instname:Universidade Federal do Rio Grande do Sul (UFRGS) instacron:UFRGS
instname_str	Universidade Federal do Rio Grande do Sul (UFRGS)
instacron_str	UFRGS
institution	UFRGS
reponame_str	Biblioteca Digital de Teses e Dissertações da UFRGS
collection	Biblioteca Digital de Teses e Dissertações da UFRGS
bitstream.url.fl_str_mv	http://www.lume.ufrgs.br/bitstream/10183/259649/2/001156184.pdf.txt http://www.lume.ufrgs.br/bitstream/10183/259649/1/001156184.pdf
bitstream.checksum.fl_str_mv	b84a25f3f0ce64af0270462aca3604a0 14a58891c4afc9a763cca22982b7c3b7
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5
repository.name.fl_str_mv	Biblioteca Digital de Teses e Dissertações da UFRGS - Universidade Federal do Rio Grande do Sul (UFRGS)
repository.mail.fl_str_mv	lume@ufrgs.br\|\|lume@ufrgs.br
_version_	1810085621350268928

Cache based global memory orchestration for data intensive stream processing pipelines

Registros relacionados