Hive on spark and MapReduce : a methodology for parameter tuning

Detalhes bibliográficos
Autor(a) principal: Forster, Rodrigo Richard
Data de Publicação: 2018
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10362/52854
Resumo: Project Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Information Systems and Technologies Management
id RCAP_ca2dad9351d49acf1ad2b7b94630f2a4
oai_identifier_str oai:run.unl.pt:10362/52854
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Hive on spark and MapReduce : a methodology for parameter tuningTuningHive on SparkMapReduceApache SparkBig DataHDFSHadoopData WarehouseProject Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Information Systems and Technologies ManagementAs the era of “big data” has arrived, more and more companies start using distributed file systems to manage and process their data streams like the Hadoop distributed file system framework (HDFS). This software library offers a way to store large files across multiple machines. Large data sets are processed by using its inherent programming model MapReduce. Apache Spark is a relatively new alternative to Hadoop MapReduce and claims to offer a performance boost up to 10 times for certain applications, while maintaining its automatic fault tolerance. To leverage the Data Warehouse capabilities of Hadoop Apache Hive was introduced. It is a concept for Big Data analytics that works on top of Hadoop and provides data analysis tools and most importantly translates queries to MapReduce and Spark jobs. Therefore, it exploits the scalability of Hadoop and offers data exploration and mining capabilities to non-developers. However, it is difficult for users to utilize the full potential of the Apache Spark execution engine. This results in very long execution times. Therefore, this project work gives researches and companies a tuning methodology that significantly can improve the execution time of queries. As a result, this tuning methodology could optimize a real-world batch-processing query by 5 times. Moreover, it gives insides in the underlying reasons of this big improvement by using Apache Spark Monitoring tools. The result can be helpful for many practitioners and researchers that would like to optimise the performance of Spark and MapReduce queries executed in Hive on top of an Apache Hadoop cluster.Santos, Vitor Manuel Pereira Duarte dosRUNForster, Rodrigo Richard2018-11-26T14:59:01Z2018-10-292018-10-29T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/52854TID:202028755enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T04:26:15Zoai:run.unl.pt:10362/52854Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:32:36.167842Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Hive on spark and MapReduce : a methodology for parameter tuning
title Hive on spark and MapReduce : a methodology for parameter tuning
spellingShingle Hive on spark and MapReduce : a methodology for parameter tuning
Forster, Rodrigo Richard
Tuning
Hive on Spark
MapReduce
Apache Spark
Big Data
HDFS
Hadoop
Data Warehouse
title_short Hive on spark and MapReduce : a methodology for parameter tuning
title_full Hive on spark and MapReduce : a methodology for parameter tuning
title_fullStr Hive on spark and MapReduce : a methodology for parameter tuning
title_full_unstemmed Hive on spark and MapReduce : a methodology for parameter tuning
title_sort Hive on spark and MapReduce : a methodology for parameter tuning
author Forster, Rodrigo Richard
author_facet Forster, Rodrigo Richard
author_role author
dc.contributor.none.fl_str_mv Santos, Vitor Manuel Pereira Duarte dos
RUN
dc.contributor.author.fl_str_mv Forster, Rodrigo Richard
dc.subject.por.fl_str_mv Tuning
Hive on Spark
MapReduce
Apache Spark
Big Data
HDFS
Hadoop
Data Warehouse
topic Tuning
Hive on Spark
MapReduce
Apache Spark
Big Data
HDFS
Hadoop
Data Warehouse
description Project Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Information Systems and Technologies Management
publishDate 2018
dc.date.none.fl_str_mv 2018-11-26T14:59:01Z
2018-10-29
2018-10-29T00:00:00Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10362/52854
TID:202028755
url http://hdl.handle.net/10362/52854
identifier_str_mv TID:202028755
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799137947615756288