ALGORITMO K-MEANS PARALELO BASEADO EM HADOOP-MAPREDUCE PARA MINERAÇÃO DE DADOS AGRÍCOLAS

Veloso, Lays Helena Lopes

ALGORITMO K-MEANS PARALELO BASEADO EM HADOOP-MAPREDUCE PARA MINERAÇÃO DE DADOS AGRÍCOLAS

Detalhes bibliográficos
Autor(a) principal:	Veloso, Lays Helena Lopes
Data de Publicação:	2015
Tipo de documento:	Dissertação
Idioma:	por
Título da fonte:	Biblioteca Digital de Teses e Dissertações da UEPG
Texto Completo:	http://tede2.uepg.br/jspui/handle/prefix/127
Resumo:	This study aimed to investigate the use of a parallel K-means clustering algorithm,based on parallel MapReduce model, to improve the response time of the data mining. The parallel K-Means was implemented in three phases, performed in each iteration: assignment of samples to groups with nearest centroid by Mappers, in parallel; local grouping of samples assigned to the same group from Mappers using a Combiner and update of the centroids by the Reducer. The performance of the algorithm was evaluated in respect to SpeedUp and ScaleUp. To achieve this, experiments were run in single-node mode and on a Hadoop cluster consisting of six of-the-shelf computers. The data were clustered comprise flux towers measurements from agricultural regions and belong to Ameriflux. The results showed performance gains with increasing number of machines and the best time was obtained using six machines reaching the speedup of 3,25. To support our results, ANOVA analysis was applied from repetitions using 3, 4 and 6 machines in the cluster, respectively. The ANOVA show low variance between the execution times obtained for the same number of machines and a significant difference between means of each number of machines. The ScaleUp analysis show that the application scale well with an equivalent increase in data size and the number of machines, achieving similar performance. With the results as expected, this paper presents a parallel and scalable implementation of the K-Means to run on a Hadoop cluster and improve the response time of clustering to large databases.

Metadados do item

id	UEPG_26ffa70635c10512bdcb5e28bb36def0
oai_identifier_str	oai:tede2.uepg.br:prefix/127
network_acronym_str	UEPG
network_name_str	Biblioteca Digital de Teses e Dissertações da UEPG
repository_id_str
spelling	Senger, Luciano JoséCPF:93591187968http://lattes.cnpq.br/6880696447532558Vaz, Maria Salete Marcon GomesCPF:44311931972http://lattes.cnpq.br/2266103198034845Góis, Lourival Aparecido deCPF:04534444826http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4775580P1CPF:07585468903http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4404351D3Veloso, Lays Helena Lopes2017-07-21T14:19:24Z2015-07-062017-07-21T14:19:24Z2015-04-29VELOSO, Lays Helena Lopes. ALGORITMO K-MEANS PARALELO BASEADO EM HADOOP-MAPREDUCE PARA MINERAÇÃO DE DADOS AGRÍCOLAS. 2015. 62 f. Dissertação (Mestrado em Computação para Tecnologias em Agricultura) - UNIVERSIDADE ESTADUAL DE PONTA GROSSA, Ponta Grossa, 2015.http://tede2.uepg.br/jspui/handle/prefix/127This study aimed to investigate the use of a parallel K-means clustering algorithm,based on parallel MapReduce model, to improve the response time of the data mining. The parallel K-Means was implemented in three phases, performed in each iteration: assignment of samples to groups with nearest centroid by Mappers, in parallel; local grouping of samples assigned to the same group from Mappers using a Combiner and update of the centroids by the Reducer. The performance of the algorithm was evaluated in respect to SpeedUp and ScaleUp. To achieve this, experiments were run in single-node mode and on a Hadoop cluster consisting of six of-the-shelf computers. The data were clustered comprise flux towers measurements from agricultural regions and belong to Ameriflux. The results showed performance gains with increasing number of machines and the best time was obtained using six machines reaching the speedup of 3,25. To support our results, ANOVA analysis was applied from repetitions using 3, 4 and 6 machines in the cluster, respectively. The ANOVA show low variance between the execution times obtained for the same number of machines and a significant difference between means of each number of machines. The ScaleUp analysis show that the application scale well with an equivalent increase in data size and the number of machines, achieving similar performance. With the results as expected, this paper presents a parallel and scalable implementation of the K-Means to run on a Hadoop cluster and improve the response time of clustering to large databases.Este trabalho teve como objetivo investigar a utilização de um algoritmo de agrupamento K-Means paralelo, com base no modelo paralelo MapReduce, para melhorar o tempo de resposta da mineração de dados. O K-Means paralelo foi implementado em três fases, executadas em cada iteração: atribuição das amostras aos grupos com centróide mais próximo pelos Mappers, em paralelo; agrupamento local das amostras atribuídas ao mesmo grupo pelos Mappers usando um Combiner e atualização dos centróides pelo Reducer. O desempenho do algoritmo foi avaliado quanto ao SpeedUp e ScaleUp. Para isso foram executados experimentos em modo single-node e em um cluster Hadoop formado por seis computadores de hardware comum. Os dados agrupados são medições de torres de fluxo de regiões agrícolas e pertencem a Ameriflux. Os resultados mostraram que com o aumento do número de máquinas houve ganho no desempenho, sendo que o melhor tempo obtido foi usando seis máquinas chegando ao SpeedUp de 3,25. Para apoiar nossos resultados foi construída uma tabela ANOVA a partir de repetições usando 3, 4 e 6 máquinas no cluster, pespectivamente. Os resultados da análise ANOVA mostram que existe pouca variância entre os tempos de execução obtidos com o mesmo número de máquinas e existe uma diferença significativa entre as médias para cada número de máquinas. A partir dos experimentos para analisar o ScaleUp verificou-se que a aplicação escala bem com o aumento equivalente do tamanho dos dados e do número de máquinas no cluster,atingindo um desempenho próximo. Com os resultados conforme esperados, esse trabalho apresenta uma implementação paralela e escalável do K-Means para ser executada em um cluster Hadoop e melhorar o tempo de resposta do agrupamento de grandes bases de dados.Made available in DSpace on 2017-07-21T14:19:24Z (GMT). No. of bitstreams: 1 Lays Veloso.pdf: 1140015 bytes, checksum: c544c69a03612a2909b7011c936788ee (MD5) Previous issue date: 2015-04-29Coordenação de Aperfeiçoamento de Pessoal de Nível Superiorapplication/pdfporUNIVERSIDADE ESTADUAL DE PONTA GROSSAPrograma de Pós Graduação Computação AplicadaUEPGBRComputação para Tecnologias em AgriculturaK-Means ParaleloMapReduceHadoopdados de fluxomineração de dadosParallel K-MeansMapReduceHadoopflux datadata miningCNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAOALGORITMO K-MEANS PARALELO BASEADO EM HADOOP-MAPREDUCE PARA MINERAÇÃO DE DADOS AGRÍCOLASinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisinfo:eu-repo/semantics/openAccessreponame:Biblioteca Digital de Teses e Dissertações da UEPGinstname:Universidade Estadual de Ponta Grossa (UEPG)instacron:UEPGORIGINALLays Veloso.pdfapplication/pdf1140015http://tede2.uepg.br/jspui/bitstream/prefix/127/1/Lays%20Veloso.pdfc544c69a03612a2909b7011c936788eeMD51prefix/1272017-07-21 11:19:24.288oai:tede2.uepg.br:prefix/127Biblioteca Digital de Teses e Dissertaçõeshttps://tede2.uepg.br/jspui/PUBhttp://tede2.uepg.br/oai/requestbicen@uepg.br\|\|mv_fidelis@yahoo.com.bropendoar:2017-07-21T14:19:24Biblioteca Digital de Teses e Dissertações da UEPG - Universidade Estadual de Ponta Grossa (UEPG)false
dc.title.por.fl_str_mv	ALGORITMO K-MEANS PARALELO BASEADO EM HADOOP-MAPREDUCE PARA MINERAÇÃO DE DADOS AGRÍCOLAS
title	ALGORITMO K-MEANS PARALELO BASEADO EM HADOOP-MAPREDUCE PARA MINERAÇÃO DE DADOS AGRÍCOLAS
spellingShingle	ALGORITMO K-MEANS PARALELO BASEADO EM HADOOP-MAPREDUCE PARA MINERAÇÃO DE DADOS AGRÍCOLAS Veloso, Lays Helena Lopes K-Means Paralelo MapReduce Hadoop dados de fluxo mineração de dados Parallel K-Means MapReduce Hadoop flux data data mining CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
title_short	ALGORITMO K-MEANS PARALELO BASEADO EM HADOOP-MAPREDUCE PARA MINERAÇÃO DE DADOS AGRÍCOLAS
title_full	ALGORITMO K-MEANS PARALELO BASEADO EM HADOOP-MAPREDUCE PARA MINERAÇÃO DE DADOS AGRÍCOLAS
title_fullStr	ALGORITMO K-MEANS PARALELO BASEADO EM HADOOP-MAPREDUCE PARA MINERAÇÃO DE DADOS AGRÍCOLAS
title_full_unstemmed	ALGORITMO K-MEANS PARALELO BASEADO EM HADOOP-MAPREDUCE PARA MINERAÇÃO DE DADOS AGRÍCOLAS
title_sort	ALGORITMO K-MEANS PARALELO BASEADO EM HADOOP-MAPREDUCE PARA MINERAÇÃO DE DADOS AGRÍCOLAS
author	Veloso, Lays Helena Lopes
author_facet	Veloso, Lays Helena Lopes
author_role	author
dc.contributor.advisor1.fl_str_mv	Senger, Luciano José
dc.contributor.advisor1ID.fl_str_mv	CPF:93591187968
dc.contributor.advisor1Lattes.fl_str_mv	http://lattes.cnpq.br/6880696447532558
dc.contributor.referee1.fl_str_mv	Vaz, Maria Salete Marcon Gomes
dc.contributor.referee1ID.fl_str_mv	CPF:44311931972
dc.contributor.referee1Lattes.fl_str_mv	http://lattes.cnpq.br/2266103198034845
dc.contributor.referee2.fl_str_mv	Góis, Lourival Aparecido de
dc.contributor.referee2ID.fl_str_mv	CPF:04534444826
dc.contributor.referee2Lattes.fl_str_mv	http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4775580P1
dc.contributor.authorID.fl_str_mv	CPF:07585468903
dc.contributor.authorLattes.fl_str_mv	http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4404351D3
dc.contributor.author.fl_str_mv	Veloso, Lays Helena Lopes
contributor_str_mv	Senger, Luciano José Vaz, Maria Salete Marcon Gomes Góis, Lourival Aparecido de
dc.subject.por.fl_str_mv	K-Means Paralelo MapReduce Hadoop dados de fluxo mineração de dados
topic	K-Means Paralelo MapReduce Hadoop dados de fluxo mineração de dados Parallel K-Means MapReduce Hadoop flux data data mining CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
dc.subject.eng.fl_str_mv	Parallel K-Means MapReduce Hadoop flux data data mining
dc.subject.cnpq.fl_str_mv	CNPQ::CIENCIAS EXATAS E DA TERRA::CIENCIA DA COMPUTACAO
description	This study aimed to investigate the use of a parallel K-means clustering algorithm,based on parallel MapReduce model, to improve the response time of the data mining. The parallel K-Means was implemented in three phases, performed in each iteration: assignment of samples to groups with nearest centroid by Mappers, in parallel; local grouping of samples assigned to the same group from Mappers using a Combiner and update of the centroids by the Reducer. The performance of the algorithm was evaluated in respect to SpeedUp and ScaleUp. To achieve this, experiments were run in single-node mode and on a Hadoop cluster consisting of six of-the-shelf computers. The data were clustered comprise flux towers measurements from agricultural regions and belong to Ameriflux. The results showed performance gains with increasing number of machines and the best time was obtained using six machines reaching the speedup of 3,25. To support our results, ANOVA analysis was applied from repetitions using 3, 4 and 6 machines in the cluster, respectively. The ANOVA show low variance between the execution times obtained for the same number of machines and a significant difference between means of each number of machines. The ScaleUp analysis show that the application scale well with an equivalent increase in data size and the number of machines, achieving similar performance. With the results as expected, this paper presents a parallel and scalable implementation of the K-Means to run on a Hadoop cluster and improve the response time of clustering to large databases.
publishDate	2015
dc.date.available.fl_str_mv	2015-07-06 2017-07-21T14:19:24Z
dc.date.issued.fl_str_mv	2015-04-29
dc.date.accessioned.fl_str_mv	2017-07-21T14:19:24Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.citation.fl_str_mv	VELOSO, Lays Helena Lopes. ALGORITMO K-MEANS PARALELO BASEADO EM HADOOP-MAPREDUCE PARA MINERAÇÃO DE DADOS AGRÍCOLAS. 2015. 62 f. Dissertação (Mestrado em Computação para Tecnologias em Agricultura) - UNIVERSIDADE ESTADUAL DE PONTA GROSSA, Ponta Grossa, 2015.
dc.identifier.uri.fl_str_mv	http://tede2.uepg.br/jspui/handle/prefix/127
identifier_str_mv	VELOSO, Lays Helena Lopes. ALGORITMO K-MEANS PARALELO BASEADO EM HADOOP-MAPREDUCE PARA MINERAÇÃO DE DADOS AGRÍCOLAS. 2015. 62 f. Dissertação (Mestrado em Computação para Tecnologias em Agricultura) - UNIVERSIDADE ESTADUAL DE PONTA GROSSA, Ponta Grossa, 2015.
url	http://tede2.uepg.br/jspui/handle/prefix/127
dc.language.iso.fl_str_mv	por
language	por
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	UNIVERSIDADE ESTADUAL DE PONTA GROSSA
dc.publisher.program.fl_str_mv	Programa de Pós Graduação Computação Aplicada
dc.publisher.initials.fl_str_mv	UEPG
dc.publisher.country.fl_str_mv	BR
dc.publisher.department.fl_str_mv	Computação para Tecnologias em Agricultura
publisher.none.fl_str_mv	UNIVERSIDADE ESTADUAL DE PONTA GROSSA
dc.source.none.fl_str_mv	reponame:Biblioteca Digital de Teses e Dissertações da UEPG instname:Universidade Estadual de Ponta Grossa (UEPG) instacron:UEPG
instname_str	Universidade Estadual de Ponta Grossa (UEPG)
instacron_str	UEPG
institution	UEPG
reponame_str	Biblioteca Digital de Teses e Dissertações da UEPG
collection	Biblioteca Digital de Teses e Dissertações da UEPG
bitstream.url.fl_str_mv	http://tede2.uepg.br/jspui/bitstream/prefix/127/1/Lays%20Veloso.pdf
bitstream.checksum.fl_str_mv	c544c69a03612a2909b7011c936788ee
bitstream.checksumAlgorithm.fl_str_mv	MD5
repository.name.fl_str_mv	Biblioteca Digital de Teses e Dissertações da UEPG - Universidade Estadual de Ponta Grossa (UEPG)
repository.mail.fl_str_mv	bicen@uepg.br\|\|mv_fidelis@yahoo.com.br
_version_	1809460446268227584

ALGORITMO K-MEANS PARALELO BASEADO EM HADOOP-MAPREDUCE PARA MINERAÇÃO DE DADOS AGRÍCOLAS

Registros relacionados