Unsupervised Dimensionality Reduction in Big Data via Massive Parallel Processing with MapReduce and Resilient Distributed Datasets

Oliveira, Jadson Jose Monteiro

Unsupervised Dimensionality Reduction in Big Data via Massive Parallel Processing with MapReduce and Resilient Distributed Datasets

Detalhes bibliográficos
Autor(a) principal:	Oliveira, Jadson Jose Monteiro
Data de Publicação:	2020
Tipo de documento:	Dissertação
Idioma:	eng
Título da fonte:	Biblioteca Digital de Teses e Dissertações da USP
Texto Completo:	https://www.teses.usp.br/teses/disponiveis/55/55134/tde-20012021-125711/
Resumo:	The volume and complexity of data generated in scientific and commercial applications have been growing exponentially in many areas. Nowadays, it is common the need for finding patterns in Terabytes or even Petabytes of complex data, such as image collections, climate measurements, fingerprints and large graphs extracted from the Web or from Social Networks. For example, how to analyze Terabytes of data from decades of frequent climate measurements comprised of dozens of climatic features, such as temperatures, rainfall and air humidity, so to identify patterns that precede extreme weather events for use in alert systems? A well-known fact in complex data analysis is that the search for patterns requires preprocessing by means of dimensionality reduction, due to a problem known as the curse of high-dimensionality. Nowadays, few techniques have been able to effectively reduce the dimensionality of such data in the scale of Terabytes or even Petabytes, which are referred to in this monograph as Big Data. In this context, massively parallel processing, linear scalability to the number of objects, and the ability to detect the most diverse types of correlations among the attributes are exceptionally desirable. This MSc work presents an in-depth study comparing two distinct approaches for dimensionality reduction in Big Data: ( a ) a standard approach based on data variance preservation, and; ( b ) an alternative, Fractal-based solution that is rarely explored, for which we propose a fast and scalable algorithm based on MapReduce and concepts from Resilient Distributed Datasets, using a new attribute-set-partitioning strategy that enables us to process datasets of high dimensionality. We evaluated both strategies by inserting into 11 real-world datasets, redundant attributes formed by correlations of various types, such as linear, quadratic, logarithmic and exponential, and verifying the ability of these approaches to detect such redundancies. The results indicate that, at least for large datasets with up to 1;000 attributes, our fractal-based technique is the best option. It removed redundant attributes in nearly all cases with high precision, as opposed to the standard variance-preservation approaches that presented considerably worse results even when applying the KPCA technique that is made to detect nonlinear correlations.

Metadados do item

id	USP_cb7dc2e688f83e4fa3984442b90076de
oai_identifier_str	oai:teses.usp.br:tde-20012021-125711
network_acronym_str	USP
network_name_str	Biblioteca Digital de Teses e Dissertações da USP
repository_id_str	2721
spelling	Unsupervised Dimensionality Reduction in Big Data via Massive Parallel Processing with MapReduce and Resilient Distributed DatasetsRedução de Dimensionalidade Não-Supervisionada em Big Data utilizando Processamento Paralelo com MapReduce e Resilient Distributed DatasetsBig dataBig dataDescriptive data miningFractal theoryMineração de dados descritivaRedução de Dimensionalidade Não-SupervisionadaTeoria de fractaisUnsupervised dimensionality reductionThe volume and complexity of data generated in scientific and commercial applications have been growing exponentially in many areas. Nowadays, it is common the need for finding patterns in Terabytes or even Petabytes of complex data, such as image collections, climate measurements, fingerprints and large graphs extracted from the Web or from Social Networks. For example, how to analyze Terabytes of data from decades of frequent climate measurements comprised of dozens of climatic features, such as temperatures, rainfall and air humidity, so to identify patterns that precede extreme weather events for use in alert systems? A well-known fact in complex data analysis is that the search for patterns requires preprocessing by means of dimensionality reduction, due to a problem known as the curse of high-dimensionality. Nowadays, few techniques have been able to effectively reduce the dimensionality of such data in the scale of Terabytes or even Petabytes, which are referred to in this monograph as Big Data. In this context, massively parallel processing, linear scalability to the number of objects, and the ability to detect the most diverse types of correlations among the attributes are exceptionally desirable. This MSc work presents an in-depth study comparing two distinct approaches for dimensionality reduction in Big Data: ( a ) a standard approach based on data variance preservation, and; ( b ) an alternative, Fractal-based solution that is rarely explored, for which we propose a fast and scalable algorithm based on MapReduce and concepts from Resilient Distributed Datasets, using a new attribute-set-partitioning strategy that enables us to process datasets of high dimensionality. We evaluated both strategies by inserting into 11 real-world datasets, redundant attributes formed by correlations of various types, such as linear, quadratic, logarithmic and exponential, and verifying the ability of these approaches to detect such redundancies. The results indicate that, at least for large datasets with up to 1;000 attributes, our fractal-based technique is the best option. It removed redundant attributes in nearly all cases with high precision, as opposed to the standard variance-preservation approaches that presented considerably worse results even when applying the KPCA technique that is made to detect nonlinear correlations.O volume e a complexidade dos dados gerados em aplicações científicas e comerciais vêm crescendo exponencialmente em diversas áreas. Hoje, é comum a necessidade de encontrar padrões em Terabytes ou até mesmo em Petabytes de dados complexos, como em coleções de imagens, medições climáticas, impressões digitais e grandes grafos extraídos da Web ou de Redes Sociais. Por exemplo, como analisar Terabytes de dados oriundos de décadas de medições climáticas frequentes, compostos por dezenas de atributos climáticos como temperaturas, precipitação de chuva e umidade do ar, a fim de identificar padrões que antecedam eventos climáticos extremos para uso em sistemas de alerta? Um fato bem conhecido em análise de dados complexos é que a busca por padrões requer pré-processamento por redução de dimensionalidade, devido a um problema conhecido como maldição da alta dimensionalidade. Hoje, poucos trabalhos permitem reduzir, de forma eficaz, a dimensionalidade de tais dados em escala de Terabytes e Petabytes referenciados nesta monografia como Big Data visto que é extremamente desejável processamento paralelo em massa, escalabilidade linear em relação ao número de objetos, e capacidade para detectar os mais diversos tipos de correlações entre os atributos do conjunto de dados. Este trabalho de mestrado apresenta um estudo aprofundado, comparando duas abordagens distintas para redução de dimensionalidade em Big Data: ( a ) uma abordagem padrão, baseada na preservação da variância dos dados, e; ( b ) uma alternativa, baseada na Teoria de Fractais, que é raramente explorada na literatura. Para esta última nós propomos um algoritmo rápido e escalável baseado no modelo MapReduce e na estrutura de Resilient Distributed Datasets, utilizando uma nova estratégia de particionamento no conjunto de atributos que nos habilita a processar dados de alta dimensionalidade. Ambas as estratégias foram avaliadas a partir da inserção de atributos redundantes formados por correlações de diversos tipos, tais como linear, quadrática, logarítmica e exponencial, em 11 conjuntos de dados reais, e verificando a habilidade dessas abordagens em detectar tais redundâncias. Os resultados indicam que, pelo menos para grandes conjuntos de dados com dimensionalidade de até 1:000 atributos, nossa técnica baseada em fractais é a melhor opção, visto que ela removeu com alta precisão os atributos redundantes em quase todos os casos, ao contrário das abordagens baseadas em variância, mesmo quando utilizada a técnica KPCA que é feita para detectar correlações não lineares.Biblioteca Digitais de Teses e Dissertações da USPCordeiro, Robson Leonardo FerreiraOliveira, Jadson Jose Monteiro2020-10-30info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/55/55134/tde-20012021-125711/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2021-01-20T18:06:02Zoai:teses.usp.br:tde-20012021-125711Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br\|\| atendimento@aguia.usp.br\|\|virginia@if.usp.bropendoar:27212021-01-20T18:06:02Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv	Unsupervised Dimensionality Reduction in Big Data via Massive Parallel Processing with MapReduce and Resilient Distributed Datasets Redução de Dimensionalidade Não-Supervisionada em Big Data utilizando Processamento Paralelo com MapReduce e Resilient Distributed Datasets
title	Unsupervised Dimensionality Reduction in Big Data via Massive Parallel Processing with MapReduce and Resilient Distributed Datasets
spellingShingle	Unsupervised Dimensionality Reduction in Big Data via Massive Parallel Processing with MapReduce and Resilient Distributed Datasets Oliveira, Jadson Jose Monteiro Big data Big data Descriptive data mining Fractal theory Mineração de dados descritiva Redução de Dimensionalidade Não-Supervisionada Teoria de fractais Unsupervised dimensionality reduction
title_short	Unsupervised Dimensionality Reduction in Big Data via Massive Parallel Processing with MapReduce and Resilient Distributed Datasets
title_full	Unsupervised Dimensionality Reduction in Big Data via Massive Parallel Processing with MapReduce and Resilient Distributed Datasets
title_fullStr	Unsupervised Dimensionality Reduction in Big Data via Massive Parallel Processing with MapReduce and Resilient Distributed Datasets
title_full_unstemmed	Unsupervised Dimensionality Reduction in Big Data via Massive Parallel Processing with MapReduce and Resilient Distributed Datasets
title_sort	Unsupervised Dimensionality Reduction in Big Data via Massive Parallel Processing with MapReduce and Resilient Distributed Datasets
author	Oliveira, Jadson Jose Monteiro
author_facet	Oliveira, Jadson Jose Monteiro
author_role	author
dc.contributor.none.fl_str_mv	Cordeiro, Robson Leonardo Ferreira
dc.contributor.author.fl_str_mv	Oliveira, Jadson Jose Monteiro
dc.subject.por.fl_str_mv	Big data Big data Descriptive data mining Fractal theory Mineração de dados descritiva Redução de Dimensionalidade Não-Supervisionada Teoria de fractais Unsupervised dimensionality reduction
topic	Big data Big data Descriptive data mining Fractal theory Mineração de dados descritiva Redução de Dimensionalidade Não-Supervisionada Teoria de fractais Unsupervised dimensionality reduction
description	The volume and complexity of data generated in scientific and commercial applications have been growing exponentially in many areas. Nowadays, it is common the need for finding patterns in Terabytes or even Petabytes of complex data, such as image collections, climate measurements, fingerprints and large graphs extracted from the Web or from Social Networks. For example, how to analyze Terabytes of data from decades of frequent climate measurements comprised of dozens of climatic features, such as temperatures, rainfall and air humidity, so to identify patterns that precede extreme weather events for use in alert systems? A well-known fact in complex data analysis is that the search for patterns requires preprocessing by means of dimensionality reduction, due to a problem known as the curse of high-dimensionality. Nowadays, few techniques have been able to effectively reduce the dimensionality of such data in the scale of Terabytes or even Petabytes, which are referred to in this monograph as Big Data. In this context, massively parallel processing, linear scalability to the number of objects, and the ability to detect the most diverse types of correlations among the attributes are exceptionally desirable. This MSc work presents an in-depth study comparing two distinct approaches for dimensionality reduction in Big Data: ( a ) a standard approach based on data variance preservation, and; ( b ) an alternative, Fractal-based solution that is rarely explored, for which we propose a fast and scalable algorithm based on MapReduce and concepts from Resilient Distributed Datasets, using a new attribute-set-partitioning strategy that enables us to process datasets of high dimensionality. We evaluated both strategies by inserting into 11 real-world datasets, redundant attributes formed by correlations of various types, such as linear, quadratic, logarithmic and exponential, and verifying the ability of these approaches to detect such redundancies. The results indicate that, at least for large datasets with up to 1;000 attributes, our fractal-based technique is the best option. It removed redundant attributes in nearly all cases with high precision, as opposed to the standard variance-preservation approaches that presented considerably worse results even when applying the KPCA technique that is made to detect nonlinear correlations.
publishDate	2020
dc.date.none.fl_str_mv	2020-10-30
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://www.teses.usp.br/teses/disponiveis/55/55134/tde-20012021-125711/
url	https://www.teses.usp.br/teses/disponiveis/55/55134/tde-20012021-125711/
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv	Liberar o conteúdo para acesso público. info:eu-repo/semantics/openAccess
rights_invalid_str_mv	Liberar o conteúdo para acesso público.
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv	Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv	Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv	reponame:Biblioteca Digital de Teses e Dissertações da USP instname:Universidade de São Paulo (USP) instacron:USP
instname_str	Universidade de São Paulo (USP)
instacron_str	USP
institution	USP
reponame_str	Biblioteca Digital de Teses e Dissertações da USP
collection	Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv	Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv	virginia@if.usp.br\|\| atendimento@aguia.usp.br\|\|virginia@if.usp.br
_version_	1815257142871982080

Unsupervised Dimensionality Reduction in Big Data via Massive Parallel Processing with MapReduce and Resilient Distributed Datasets

Registros relacionados