Unsupervised Dimensionality Reduction in Big Data via Massive Parallel Processing with MapReduce and Resilient Distributed Datasets
Autor(a) principal: | |
---|---|
Data de Publicação: | 2020 |
Tipo de documento: | Dissertação |
Idioma: | eng |
Título da fonte: | Biblioteca Digital de Teses e Dissertações da USP |
Texto Completo: | https://www.teses.usp.br/teses/disponiveis/55/55134/tde-20012021-125711/ |
Resumo: | The volume and complexity of data generated in scientific and commercial applications have been growing exponentially in many areas. Nowadays, it is common the need for finding patterns in Terabytes or even Petabytes of complex data, such as image collections, climate measurements, fingerprints and large graphs extracted from the Web or from Social Networks. For example, how to analyze Terabytes of data from decades of frequent climate measurements comprised of dozens of climatic features, such as temperatures, rainfall and air humidity, so to identify patterns that precede extreme weather events for use in alert systems? A well-known fact in complex data analysis is that the search for patterns requires preprocessing by means of dimensionality reduction, due to a problem known as the curse of high-dimensionality. Nowadays, few techniques have been able to effectively reduce the dimensionality of such data in the scale of Terabytes or even Petabytes, which are referred to in this monograph as Big Data. In this context, massively parallel processing, linear scalability to the number of objects, and the ability to detect the most diverse types of correlations among the attributes are exceptionally desirable. This MSc work presents an in-depth study comparing two distinct approaches for dimensionality reduction in Big Data: ( a ) a standard approach based on data variance preservation, and; ( b ) an alternative, Fractal-based solution that is rarely explored, for which we propose a fast and scalable algorithm based on MapReduce and concepts from Resilient Distributed Datasets, using a new attribute-set-partitioning strategy that enables us to process datasets of high dimensionality. We evaluated both strategies by inserting into 11 real-world datasets, redundant attributes formed by correlations of various types, such as linear, quadratic, logarithmic and exponential, and verifying the ability of these approaches to detect such redundancies. The results indicate that, at least for large datasets with up to 1;000 attributes, our fractal-based technique is the best option. It removed redundant attributes in nearly all cases with high precision, as opposed to the standard variance-preservation approaches that presented considerably worse results even when applying the KPCA technique that is made to detect nonlinear correlations. |
id |
USP_cb7dc2e688f83e4fa3984442b90076de |
---|---|
oai_identifier_str |
oai:teses.usp.br:tde-20012021-125711 |
network_acronym_str |
USP |
network_name_str |
Biblioteca Digital de Teses e Dissertações da USP |
repository_id_str |
2721 |
spelling |
Unsupervised Dimensionality Reduction in Big Data via Massive Parallel Processing with MapReduce and Resilient Distributed DatasetsRedução de Dimensionalidade Não-Supervisionada em Big Data utilizando Processamento Paralelo com MapReduce e Resilient Distributed DatasetsBig dataBig dataDescriptive data miningFractal theoryMineração de dados descritivaRedução de Dimensionalidade Não-SupervisionadaTeoria de fractaisUnsupervised dimensionality reductionThe volume and complexity of data generated in scientific and commercial applications have been growing exponentially in many areas. Nowadays, it is common the need for finding patterns in Terabytes or even Petabytes of complex data, such as image collections, climate measurements, fingerprints and large graphs extracted from the Web or from Social Networks. For example, how to analyze Terabytes of data from decades of frequent climate measurements comprised of dozens of climatic features, such as temperatures, rainfall and air humidity, so to identify patterns that precede extreme weather events for use in alert systems? A well-known fact in complex data analysis is that the search for patterns requires preprocessing by means of dimensionality reduction, due to a problem known as the curse of high-dimensionality. Nowadays, few techniques have been able to effectively reduce the dimensionality of such data in the scale of Terabytes or even Petabytes, which are referred to in this monograph as Big Data. In this context, massively parallel processing, linear scalability to the number of objects, and the ability to detect the most diverse types of correlations among the attributes are exceptionally desirable. This MSc work presents an in-depth study comparing two distinct approaches for dimensionality reduction in Big Data: ( a ) a standard approach based on data variance preservation, and; ( b ) an alternative, Fractal-based solution that is rarely explored, for which we propose a fast and scalable algorithm based on MapReduce and concepts from Resilient Distributed Datasets, using a new attribute-set-partitioning strategy that enables us to process datasets of high dimensionality. We evaluated both strategies by inserting into 11 real-world datasets, redundant attributes formed by correlations of various types, such as linear, quadratic, logarithmic and exponential, and verifying the ability of these approaches to detect such redundancies. The results indicate that, at least for large datasets with up to 1;000 attributes, our fractal-based technique is the best option. It removed redundant attributes in nearly all cases with high precision, as opposed to the standard variance-preservation approaches that presented considerably worse results even when applying the KPCA technique that is made to detect nonlinear correlations.O volume e a complexidade dos dados gerados em aplicações científicas e comerciais vêm crescendo exponencialmente em diversas áreas. Hoje, é comum a necessidade de encontrar padrões em Terabytes ou até mesmo em Petabytes de dados complexos, como em coleções de imagens, medições climáticas, impressões digitais e grandes grafos extraídos da Web ou de Redes Sociais. Por exemplo, como analisar Terabytes de dados oriundos de décadas de medições climáticas frequentes, compostos por dezenas de atributos climáticos como temperaturas, precipitação de chuva e umidade do ar, a fim de identificar padrões que antecedam eventos climáticos extremos para uso em sistemas de alerta? Um fato bem conhecido em análise de dados complexos é que a busca por padrões requer pré-processamento por redução de dimensionalidade, devido a um problema conhecido como maldição da alta dimensionalidade. Hoje, poucos trabalhos permitem reduzir, de forma eficaz, a dimensionalidade de tais dados em escala de Terabytes e Petabytes referenciados nesta monografia como Big Data visto que é extremamente desejável processamento paralelo em massa, escalabilidade linear em relação ao número de objetos, e capacidade para detectar os mais diversos tipos de correlações entre os atributos do conjunto de dados. Este trabalho de mestrado apresenta um estudo aprofundado, comparando duas abordagens distintas para redução de dimensionalidade em Big Data: ( a ) uma abordagem padrão, baseada na preservação da variância dos dados, e; ( b ) uma alternativa, baseada na Teoria de Fractais, que é raramente explorada na literatura. Para esta última nós propomos um algoritmo rápido e escalável baseado no modelo MapReduce e na estrutura de Resilient Distributed Datasets, utilizando uma nova estratégia de particionamento no conjunto de atributos que nos habilita a processar dados de alta dimensionalidade. Ambas as estratégias foram avaliadas a partir da inserção de atributos redundantes formados por correlações de diversos tipos, tais como linear, quadrática, logarítmica e exponencial, em 11 conjuntos de dados reais, e verificando a habilidade dessas abordagens em detectar tais redundâncias. Os resultados indicam que, pelo menos para grandes conjuntos de dados com dimensionalidade de até 1:000 atributos, nossa técnica baseada em fractais é a melhor opção, visto que ela removeu com alta precisão os atributos redundantes em quase todos os casos, ao contrário das abordagens baseadas em variância, mesmo quando utilizada a técnica KPCA que é feita para detectar correlações não lineares.Biblioteca Digitais de Teses e Dissertações da USPCordeiro, Robson Leonardo FerreiraOliveira, Jadson Jose Monteiro2020-10-30info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/55/55134/tde-20012021-125711/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2021-01-20T18:06:02Zoai:teses.usp.br:tde-20012021-125711Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212021-01-20T18:06:02Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false |
dc.title.none.fl_str_mv |
Unsupervised Dimensionality Reduction in Big Data via Massive Parallel Processing with MapReduce and Resilient Distributed Datasets Redução de Dimensionalidade Não-Supervisionada em Big Data utilizando Processamento Paralelo com MapReduce e Resilient Distributed Datasets |
title |
Unsupervised Dimensionality Reduction in Big Data via Massive Parallel Processing with MapReduce and Resilient Distributed Datasets |
spellingShingle |
Unsupervised Dimensionality Reduction in Big Data via Massive Parallel Processing with MapReduce and Resilient Distributed Datasets Oliveira, Jadson Jose Monteiro Big data Big data Descriptive data mining Fractal theory Mineração de dados descritiva Redução de Dimensionalidade Não-Supervisionada Teoria de fractais Unsupervised dimensionality reduction |
title_short |
Unsupervised Dimensionality Reduction in Big Data via Massive Parallel Processing with MapReduce and Resilient Distributed Datasets |
title_full |
Unsupervised Dimensionality Reduction in Big Data via Massive Parallel Processing with MapReduce and Resilient Distributed Datasets |
title_fullStr |
Unsupervised Dimensionality Reduction in Big Data via Massive Parallel Processing with MapReduce and Resilient Distributed Datasets |
title_full_unstemmed |
Unsupervised Dimensionality Reduction in Big Data via Massive Parallel Processing with MapReduce and Resilient Distributed Datasets |
title_sort |
Unsupervised Dimensionality Reduction in Big Data via Massive Parallel Processing with MapReduce and Resilient Distributed Datasets |
author |
Oliveira, Jadson Jose Monteiro |
author_facet |
Oliveira, Jadson Jose Monteiro |
author_role |
author |
dc.contributor.none.fl_str_mv |
Cordeiro, Robson Leonardo Ferreira |
dc.contributor.author.fl_str_mv |
Oliveira, Jadson Jose Monteiro |
dc.subject.por.fl_str_mv |
Big data Big data Descriptive data mining Fractal theory Mineração de dados descritiva Redução de Dimensionalidade Não-Supervisionada Teoria de fractais Unsupervised dimensionality reduction |
topic |
Big data Big data Descriptive data mining Fractal theory Mineração de dados descritiva Redução de Dimensionalidade Não-Supervisionada Teoria de fractais Unsupervised dimensionality reduction |
description |
The volume and complexity of data generated in scientific and commercial applications have been growing exponentially in many areas. Nowadays, it is common the need for finding patterns in Terabytes or even Petabytes of complex data, such as image collections, climate measurements, fingerprints and large graphs extracted from the Web or from Social Networks. For example, how to analyze Terabytes of data from decades of frequent climate measurements comprised of dozens of climatic features, such as temperatures, rainfall and air humidity, so to identify patterns that precede extreme weather events for use in alert systems? A well-known fact in complex data analysis is that the search for patterns requires preprocessing by means of dimensionality reduction, due to a problem known as the curse of high-dimensionality. Nowadays, few techniques have been able to effectively reduce the dimensionality of such data in the scale of Terabytes or even Petabytes, which are referred to in this monograph as Big Data. In this context, massively parallel processing, linear scalability to the number of objects, and the ability to detect the most diverse types of correlations among the attributes are exceptionally desirable. This MSc work presents an in-depth study comparing two distinct approaches for dimensionality reduction in Big Data: ( a ) a standard approach based on data variance preservation, and; ( b ) an alternative, Fractal-based solution that is rarely explored, for which we propose a fast and scalable algorithm based on MapReduce and concepts from Resilient Distributed Datasets, using a new attribute-set-partitioning strategy that enables us to process datasets of high dimensionality. We evaluated both strategies by inserting into 11 real-world datasets, redundant attributes formed by correlations of various types, such as linear, quadratic, logarithmic and exponential, and verifying the ability of these approaches to detect such redundancies. The results indicate that, at least for large datasets with up to 1;000 attributes, our fractal-based technique is the best option. It removed redundant attributes in nearly all cases with high precision, as opposed to the standard variance-preservation approaches that presented considerably worse results even when applying the KPCA technique that is made to detect nonlinear correlations. |
publishDate |
2020 |
dc.date.none.fl_str_mv |
2020-10-30 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://www.teses.usp.br/teses/disponiveis/55/55134/tde-20012021-125711/ |
url |
https://www.teses.usp.br/teses/disponiveis/55/55134/tde-20012021-125711/ |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
|
dc.rights.driver.fl_str_mv |
Liberar o conteúdo para acesso público. info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
Liberar o conteúdo para acesso público. |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.coverage.none.fl_str_mv |
|
dc.publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
dc.source.none.fl_str_mv |
reponame:Biblioteca Digital de Teses e Dissertações da USP instname:Universidade de São Paulo (USP) instacron:USP |
instname_str |
Universidade de São Paulo (USP) |
instacron_str |
USP |
institution |
USP |
reponame_str |
Biblioteca Digital de Teses e Dissertações da USP |
collection |
Biblioteca Digital de Teses e Dissertações da USP |
repository.name.fl_str_mv |
Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP) |
repository.mail.fl_str_mv |
virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br |
_version_ |
1815257142871982080 |