G2P-DBSCAN: Data Partitioning Strategy and Distributed Processing of DBSCAN with MapReduce.

AntÃnio Cavalcante AraÃjo Neto

G2P-DBSCAN: Data Partitioning Strategy and Distributed Processing of DBSCAN with MapReduce.

Detalhes bibliográficos
Autor(a) principal:	AntÃnio Cavalcante AraÃjo Neto
Data de Publicação:	2015
Tipo de documento:	Dissertação
Idioma:	por
Título da fonte:	Biblioteca Digital de Teses e Dissertações da UFC
Texto Completo:	http://www.teses.ufc.br/tde_busca/arquivo.php?codArquivo=15592
Resumo:	Clustering is a data mining technique that brings together elements of a data set such so that the elements of a same group are more similar to each other than to those from other groups. This thesis studied the problem of processing the clustering based on density DBSCAN algorithm distributedly through the MapReduce paradigm. In the distributed processing it is important that the partitions are processed have approximately the same size, provided that the total of the processing time is limited by the time the node with a larger amount of data leads to complete the computation of data assigned to it. For this reason we also propose a data set partitioning strategy called G2P, which aims to distribute the data set in a balanced manner between partitions and takes into account the characteristics of DBSCAN algorithm. More Specifically, the G2P strategy uses grid and graph structures to assist in the division of space low density regions. Distributed DBSCAN the algorithm is done processing MapReduce two stages and an intermediate phase that identifies groupings that can were divided into more than one partition, called candidates from merging. The first MapReduce phase applies the algorithm DSBCAN the partitions individually. The second and checks correcting, if necessary, merge candidate clusters. Experiments using data sets demonstrate that true G2P-DBSCAN strategy overcomes the baseline adopted in all the scenarios, both at runtime and quality of obtained partitions.

Metadados do item

id	UFC_9a7d6d26818357a7a8be616834a33812
oai_identifier_str	oai:www.teses.ufc.br:10119
network_acronym_str	UFC
network_name_str	Biblioteca Digital de Teses e Dissertações da UFC
spelling	info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisG2P-DBSCAN: Data Partitioning Strategy and Distributed Processing of DBSCAN with MapReduce.G2P-DBSCAN: EstratÃgia de Particionamento de Dados e de Processamento DistribuÃdo fazer DBSCAN com MapReduce.2015-08-17Javam de Castro Machado19177526368http://buscatextual.cnpq.br/buscatextual/visualizacv.jsp?id=K4723088A5Altigran Soares da Silva24303925268http://lattes.cnpq.br/3405503472010994JosÃ AntÃnio Fernandes de Macedo00028098700http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4737328P5JosÃ Neuman de Souza09779604391http://lattes.cnpq.br/361425614105480004237771300http://lattes.cnpq.br/6547909874085176AntÃnio Cavalcante AraÃjo NetoUniversidade Federal do CearÃPrograma de PÃs-GraduaÃÃo em CiÃncia da ComputaÃÃoUFCBRDBSCAN MapReduce Particionamento de dados ClusterizaÃÃoCIENCIA DA COMPUTACAOClustering is a data mining technique that brings together elements of a data set such so that the elements of a same group are more similar to each other than to those from other groups. This thesis studied the problem of processing the clustering based on density DBSCAN algorithm distributedly through the MapReduce paradigm. In the distributed processing it is important that the partitions are processed have approximately the same size, provided that the total of the processing time is limited by the time the node with a larger amount of data leads to complete the computation of data assigned to it. For this reason we also propose a data set partitioning strategy called G2P, which aims to distribute the data set in a balanced manner between partitions and takes into account the characteristics of DBSCAN algorithm. More Specifically, the G2P strategy uses grid and graph structures to assist in the division of space low density regions. Distributed DBSCAN the algorithm is done processing MapReduce two stages and an intermediate phase that identifies groupings that can were divided into more than one partition, called candidates from merging. The first MapReduce phase applies the algorithm DSBCAN the partitions individually. The second and checks correcting, if necessary, merge candidate clusters. Experiments using data sets demonstrate that true G2P-DBSCAN strategy overcomes the baseline adopted in all the scenarios, both at runtime and quality of obtained partitions.ClusterizaÃao Ã uma tÃcnica de mineraÃÃo de dados que agrupa elementos de um conjunto de dados de forma que os elementos que pertencem ao mesmo grupo sÃo mais semelhantes entre si que entre elementos de outros grupos. Nesta dissertaÃÃo nÃs estudamos o problema de processar o algoritmo de clusterizaÃÃo baseado em densidade DBSCAN de maneira distribuÃda atravÃs do paradigma MapReduce. Em processamentos distribuÃdos Ã importante que as partiÃÃes de dados a serem processadas tenham tamanhos proximadamente iguais, uma vez que o tempo total de processamento Ã delimitado pelo tempo que o nÃ com uma maior quantidade de dados leva para finalizar a computaÃÃo dos dados a ele atribuÃdos. Por essa razÃo nÃs tambÃm propomos uma estratÃgia de particionamento de dados, chamada G2P, que busca distribuir o conjunto de dados de forma balanceada entre as partiÃÃes e que leva em consideraÃÃo as caracterÃsticas do algoritmo DBSCAN. Mais especificamente, a estratÃgia G2P usa estruturas de grade e grafo para auxiliar na divisÃo do espaÃo em regiÃes de baixa densidade. JÃ o processamento distribuÃdo do algoritmo DBSCAN se dÃ por meio de duas fases de processamento MapReduce e uma fase intermediÃria que identifica clusters que podem ter sido divididos em mais de uma partiÃÃo, chamados de candidatos Ã junÃÃo. A primeira fase de MapReduce aplica o algoritmo DSBCAN nas partiÃÃes de dados individualmente, e a segunda verifica e corrige, caso necessÃrio, os clusters candidatos Ã junÃÃo. Experimentos utilizando dados reais mostram que a estratÃgia G2P-DBSCAN se comporta melhor que a soluÃÃo utilizada para comparaÃÃo em todos os cenÃrios considerados, tanto em tempo de execuÃÃo quanto em qualidade das partiÃÃes obtidas.CoordenaÃÃo de AperfeÃoamento de Pessoal de NÃvel Superior http://www.teses.ufc.br/tde_busca/arquivo.php?codArquivo=15592application/pdfinfo:eu-repo/semantics/openAccessporreponame:Biblioteca Digital de Teses e Dissertações da UFCinstname:Universidade Federal do Cearáinstacron:UFC2019-01-21T11:29:05Zmail@mail.com -
dc.title.en.fl_str_mv	G2P-DBSCAN: Data Partitioning Strategy and Distributed Processing of DBSCAN with MapReduce.
dc.title.alternative.pt.fl_str_mv	G2P-DBSCAN: EstratÃgia de Particionamento de Dados e de Processamento DistribuÃdo fazer DBSCAN com MapReduce.
title	G2P-DBSCAN: Data Partitioning Strategy and Distributed Processing of DBSCAN with MapReduce.
spellingShingle	G2P-DBSCAN: Data Partitioning Strategy and Distributed Processing of DBSCAN with MapReduce. AntÃnio Cavalcante AraÃjo Neto DBSCAN MapReduce Particionamento de dados ClusterizaÃÃo CIENCIA DA COMPUTACAO
title_short	G2P-DBSCAN: Data Partitioning Strategy and Distributed Processing of DBSCAN with MapReduce.
title_full	G2P-DBSCAN: Data Partitioning Strategy and Distributed Processing of DBSCAN with MapReduce.
title_fullStr	G2P-DBSCAN: Data Partitioning Strategy and Distributed Processing of DBSCAN with MapReduce.
title_full_unstemmed	G2P-DBSCAN: Data Partitioning Strategy and Distributed Processing of DBSCAN with MapReduce.
title_sort	G2P-DBSCAN: Data Partitioning Strategy and Distributed Processing of DBSCAN with MapReduce.
author	AntÃnio Cavalcante AraÃjo Neto
author_facet	AntÃnio Cavalcante AraÃjo Neto
author_role	author
dc.contributor.advisor1.fl_str_mv	Javam de Castro Machado
dc.contributor.advisor1ID.fl_str_mv	19177526368
dc.contributor.advisor1Lattes.fl_str_mv	http://buscatextual.cnpq.br/buscatextual/visualizacv.jsp?id=K4723088A5
dc.contributor.referee1.fl_str_mv	Altigran Soares da Silva
dc.contributor.referee1ID.fl_str_mv	24303925268
dc.contributor.referee1Lattes.fl_str_mv	http://lattes.cnpq.br/3405503472010994
dc.contributor.referee2.fl_str_mv	JosÃ AntÃnio Fernandes de Macedo
dc.contributor.referee2ID.fl_str_mv	00028098700
dc.contributor.referee2Lattes.fl_str_mv	http://buscatextual.cnpq.br/buscatextual/visualizacv.do?id=K4737328P5
dc.contributor.referee3.fl_str_mv	JosÃ Neuman de Souza
dc.contributor.referee3ID.fl_str_mv	09779604391
dc.contributor.referee3Lattes.fl_str_mv	http://lattes.cnpq.br/3614256141054800
dc.contributor.authorID.fl_str_mv	04237771300
dc.contributor.authorLattes.fl_str_mv	http://lattes.cnpq.br/6547909874085176
dc.contributor.author.fl_str_mv	AntÃnio Cavalcante AraÃjo Neto
contributor_str_mv	Javam de Castro Machado Altigran Soares da Silva JosÃ AntÃnio Fernandes de Macedo JosÃ Neuman de Souza
dc.subject.por.fl_str_mv	DBSCAN MapReduce Particionamento de dados ClusterizaÃÃo
topic	DBSCAN MapReduce Particionamento de dados ClusterizaÃÃo CIENCIA DA COMPUTACAO
dc.subject.cnpq.fl_str_mv	CIENCIA DA COMPUTACAO
dc.description.sponsorship.fl_txt_mv	CoordenaÃÃo de AperfeÃoamento de Pessoal de NÃvel Superior
dc.description.abstract.por.fl_txt_mv	Clustering is a data mining technique that brings together elements of a data set such so that the elements of a same group are more similar to each other than to those from other groups. This thesis studied the problem of processing the clustering based on density DBSCAN algorithm distributedly through the MapReduce paradigm. In the distributed processing it is important that the partitions are processed have approximately the same size, provided that the total of the processing time is limited by the time the node with a larger amount of data leads to complete the computation of data assigned to it. For this reason we also propose a data set partitioning strategy called G2P, which aims to distribute the data set in a balanced manner between partitions and takes into account the characteristics of DBSCAN algorithm. More Specifically, the G2P strategy uses grid and graph structures to assist in the division of space low density regions. Distributed DBSCAN the algorithm is done processing MapReduce two stages and an intermediate phase that identifies groupings that can were divided into more than one partition, called candidates from merging. The first MapReduce phase applies the algorithm DSBCAN the partitions individually. The second and checks correcting, if necessary, merge candidate clusters. Experiments using data sets demonstrate that true G2P-DBSCAN strategy overcomes the baseline adopted in all the scenarios, both at runtime and quality of obtained partitions. ClusterizaÃao Ã uma tÃcnica de mineraÃÃo de dados que agrupa elementos de um conjunto de dados de forma que os elementos que pertencem ao mesmo grupo sÃo mais semelhantes entre si que entre elementos de outros grupos. Nesta dissertaÃÃo nÃs estudamos o problema de processar o algoritmo de clusterizaÃÃo baseado em densidade DBSCAN de maneira distribuÃda atravÃs do paradigma MapReduce. Em processamentos distribuÃdos Ã importante que as partiÃÃes de dados a serem processadas tenham tamanhos proximadamente iguais, uma vez que o tempo total de processamento Ã delimitado pelo tempo que o nÃ com uma maior quantidade de dados leva para finalizar a computaÃÃo dos dados a ele atribuÃdos. Por essa razÃo nÃs tambÃm propomos uma estratÃgia de particionamento de dados, chamada G2P, que busca distribuir o conjunto de dados de forma balanceada entre as partiÃÃes e que leva em consideraÃÃo as caracterÃsticas do algoritmo DBSCAN. Mais especificamente, a estratÃgia G2P usa estruturas de grade e grafo para auxiliar na divisÃo do espaÃo em regiÃes de baixa densidade. JÃ o processamento distribuÃdo do algoritmo DBSCAN se dÃ por meio de duas fases de processamento MapReduce e uma fase intermediÃria que identifica clusters que podem ter sido divididos em mais de uma partiÃÃo, chamados de candidatos Ã junÃÃo. A primeira fase de MapReduce aplica o algoritmo DSBCAN nas partiÃÃes de dados individualmente, e a segunda verifica e corrige, caso necessÃrio, os clusters candidatos Ã junÃÃo. Experimentos utilizando dados reais mostram que a estratÃgia G2P-DBSCAN se comporta melhor que a soluÃÃo utilizada para comparaÃÃo em todos os cenÃrios considerados, tanto em tempo de execuÃÃo quanto em qualidade das partiÃÃes obtidas.
description	Clustering is a data mining technique that brings together elements of a data set such so that the elements of a same group are more similar to each other than to those from other groups. This thesis studied the problem of processing the clustering based on density DBSCAN algorithm distributedly through the MapReduce paradigm. In the distributed processing it is important that the partitions are processed have approximately the same size, provided that the total of the processing time is limited by the time the node with a larger amount of data leads to complete the computation of data assigned to it. For this reason we also propose a data set partitioning strategy called G2P, which aims to distribute the data set in a balanced manner between partitions and takes into account the characteristics of DBSCAN algorithm. More Specifically, the G2P strategy uses grid and graph structures to assist in the division of space low density regions. Distributed DBSCAN the algorithm is done processing MapReduce two stages and an intermediate phase that identifies groupings that can were divided into more than one partition, called candidates from merging. The first MapReduce phase applies the algorithm DSBCAN the partitions individually. The second and checks correcting, if necessary, merge candidate clusters. Experiments using data sets demonstrate that true G2P-DBSCAN strategy overcomes the baseline adopted in all the scenarios, both at runtime and quality of obtained partitions.
publishDate	2015
dc.date.issued.fl_str_mv	2015-08-17
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
status_str	publishedVersion
format	masterThesis
dc.identifier.uri.fl_str_mv	http://www.teses.ufc.br/tde_busca/arquivo.php?codArquivo=15592
url	http://www.teses.ufc.br/tde_busca/arquivo.php?codArquivo=15592
dc.language.iso.fl_str_mv	por
language	por
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Universidade Federal do CearÃ
dc.publisher.program.fl_str_mv	Programa de PÃs-GraduaÃÃo em CiÃncia da ComputaÃÃo
dc.publisher.initials.fl_str_mv	UFC
dc.publisher.country.fl_str_mv	BR
publisher.none.fl_str_mv	Universidade Federal do CearÃ
dc.source.none.fl_str_mv	reponame:Biblioteca Digital de Teses e Dissertações da UFC instname:Universidade Federal do Ceará instacron:UFC
reponame_str	Biblioteca Digital de Teses e Dissertações da UFC
collection	Biblioteca Digital de Teses e Dissertações da UFC
instname_str	Universidade Federal do Ceará
instacron_str	UFC
institution	UFC
repository.name.fl_str_mv	-
repository.mail.fl_str_mv	mail@mail.com
_version_	1643295214189674496

G2P-DBSCAN: Data Partitioning Strategy and Distributed Processing of DBSCAN with MapReduce.

Registros relacionados