Enriching data analytics with incremental data cleaning and attribute domain management

Detalhes bibliográficos
Autor(a) principal: Oliveira, Paulo Henrique de
Data de Publicação: 2021
Tipo de documento: Tese
Idioma: eng
Título da fonte: Biblioteca Digital de Teses e Dissertações da USP
Texto Completo: https://www.teses.usp.br/teses/disponiveis/55/55134/tde-16072021-120503/
Resumo: In the present Big Data era, many businesses have become more data-driven, seeking to improve their decision-making processes based on solid Data Analytics practices. Several steps constitute the Data Analytics pipeline and all of them involve specific approaches and technologies, which are constantly evolving. In order to accommodate new needs and trends, there is always room for improvements in the steps of Data Analytics. In this context, this PhD research has focused on improving two of those steps: (i) data cleaning and (ii) data analysis. Regarding the first step, we addressed the problem of performing data cleaning incrementally, considering dynamic scenarios with incoming data batches, as well as holistically, that is, jointly taking into account multiple error detection criteria. As a result, we have developed an incremental data cleaning framework which significantly outperforms competitors, enabling higher efficiency while compromising little on repair quality, as well as addresses the problem in an innovative way, hence filling a gap in the literature. Regarding the second improved step, we addressed the problem of handling queries over an Attribute Domain, which consists of the set of stored values within a domain of attributes, usually across multiple relations. As a result, we have proposed three contributions: (a) the Domain Index, an access method for efficiently performing queries over Attribute Domains, which we refer to as Domain Queries; (b) a comprehensive case study of Domain Indexes applied to the medical domain, focusing on content-based Domain Queries for supporting physicians in decision-making; and (c) an approach for including support to Attribute Domains as first-class citizens in a Relational Database Management System (RDBMS). Together, those contributions target a distinct category of queries which, until the execution of this PhD research, had not been addressed in the literature elsewhere. Experimental results highlight the superior performance enabled by the Domain Index compared to existing techniques of modern RDBMSs, which not only are inefficient in several scenarios, but also are not always applicable. Ultimately, those contributions enrich data analyses down the road. Hence, this PhD research advances the state of the art in the field of Data Analytics, as well as opens several directions for future work.
id USP_5861b0aee484fab4641606e125e73c19
oai_identifier_str oai:teses.usp.br:tde-16072021-120503
network_acronym_str USP
network_name_str Biblioteca Digital de Teses e Dissertações da USP
repository_id_str 2721
spelling Enriching data analytics with incremental data cleaning and attribute domain managementEnriquecendo a análise de dados com limpeza incremental dos dados e gerenciamento dos domínios de atributosAnálise de dadosAttribute domainConsulta de domínioData analyticsData qualityDomain indexDomain queryDomínio de atributosIncremental data cleaningÍndice de domínioLimpeza de dados IncrementalQualidade de dadosIn the present Big Data era, many businesses have become more data-driven, seeking to improve their decision-making processes based on solid Data Analytics practices. Several steps constitute the Data Analytics pipeline and all of them involve specific approaches and technologies, which are constantly evolving. In order to accommodate new needs and trends, there is always room for improvements in the steps of Data Analytics. In this context, this PhD research has focused on improving two of those steps: (i) data cleaning and (ii) data analysis. Regarding the first step, we addressed the problem of performing data cleaning incrementally, considering dynamic scenarios with incoming data batches, as well as holistically, that is, jointly taking into account multiple error detection criteria. As a result, we have developed an incremental data cleaning framework which significantly outperforms competitors, enabling higher efficiency while compromising little on repair quality, as well as addresses the problem in an innovative way, hence filling a gap in the literature. Regarding the second improved step, we addressed the problem of handling queries over an Attribute Domain, which consists of the set of stored values within a domain of attributes, usually across multiple relations. As a result, we have proposed three contributions: (a) the Domain Index, an access method for efficiently performing queries over Attribute Domains, which we refer to as Domain Queries; (b) a comprehensive case study of Domain Indexes applied to the medical domain, focusing on content-based Domain Queries for supporting physicians in decision-making; and (c) an approach for including support to Attribute Domains as first-class citizens in a Relational Database Management System (RDBMS). Together, those contributions target a distinct category of queries which, until the execution of this PhD research, had not been addressed in the literature elsewhere. Experimental results highlight the superior performance enabled by the Domain Index compared to existing techniques of modern RDBMSs, which not only are inefficient in several scenarios, but also are not always applicable. Ultimately, those contributions enrich data analyses down the road. Hence, this PhD research advances the state of the art in the field of Data Analytics, as well as opens several directions for future work.Na presente era do Big Data, as organizações têm se tornado mais orientadas a dados, buscando melhorar seus processos de tomada de decisão com base em sólidas práticas de Análises de Dados. Diversos passos constituem o processo de Análises de Dados e todos envolvem abordagens e tecnologias específicas, que estão evoluindo constantemente. De maneira a acomodar as novas necessidades e tendências, há sempre espaço para melhorias nos passos de Análises de Dados. Nesse contexto, esta pesquisa de doutorado focou em melhorar dois desses passos: (i) limpeza de dados e (ii) análise de dados. Com relação ao primeiro, esta pesquisa lidou com o problema de realizar limpeza de dados incrementalmente, considerando cenários dinâmicos com novos lotes de dados, bem como holisticamente, isto é, juntamente levando em consideração múltiplos critérios para detecção de erros. Como resultado, desenvolveu-se um arcabouço para limpeza de dados incremental que supera significativamente os competidores, permitindo uma maior eficiência ao mesmo tempo em que se compromete pouco a qualidade de reparo, bem como trata o problema de forma inovadora, portanto preenchendo uma lacuna na literatura. Referente ao segundo passo, abordou-se o problema de manipular consultas sobre um Domínio de Atributos, que consiste no conjunto de valores que compõe um domínio de atributos, normalmente armazenados em múltiplas relações. Como resultado, propôs-se três contribuições: (a) o Índice de Domínio, um método de acesso voltado à execução eficiente de consultas sobre Domínios de Atributos, também chamadas de Consultas de Domínio; (b) um estudo de caso abrangente de Índices de Domínio aplicados sobre o domínio médico, focando em Consultas de Domínio baseadas em conteúdo para auxiliar profissionais da saúde no processo de tomada de decisão; e (c) uma abordagem para incluir suporte a Domínios de Atributos como cidadãos de primeira classe em um Sistema de Gerenciamento de Bancos de Dados Relacional (SGBDR). Juntas, essas contribuições focam em uma categoria distinta de consultas que, até a execução desta pesquisa de doutorado, não havia sido abordada na literatura. Resultados experimentais destacam o desempenho superior do Índice de Domínio comparado às técnicas existentes de SGBDRs modernos, que não somente são ineficientes sob diversos aspectos, como também não são aplicáveis a certos cenários. Portanto, essas contribuições também enriquecem análises de dados subsequentes. Assim, esta pesquisa de doutorado avança o estado da arte no campo de Análises de Dados, bem como abre diversas portas de trabalhos futuros.Biblioteca Digitais de Teses e Dissertações da USPTraina Junior, CaetanoOliveira, Paulo Henrique de2021-04-30info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/55/55134/tde-16072021-120503/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2021-07-16T19:19:02Zoai:teses.usp.br:tde-16072021-120503Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212021-07-16T19:19:02Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv Enriching data analytics with incremental data cleaning and attribute domain management
Enriquecendo a análise de dados com limpeza incremental dos dados e gerenciamento dos domínios de atributos
title Enriching data analytics with incremental data cleaning and attribute domain management
spellingShingle Enriching data analytics with incremental data cleaning and attribute domain management
Oliveira, Paulo Henrique de
Análise de dados
Attribute domain
Consulta de domínio
Data analytics
Data quality
Domain index
Domain query
Domínio de atributos
Incremental data cleaning
Índice de domínio
Limpeza de dados Incremental
Qualidade de dados
title_short Enriching data analytics with incremental data cleaning and attribute domain management
title_full Enriching data analytics with incremental data cleaning and attribute domain management
title_fullStr Enriching data analytics with incremental data cleaning and attribute domain management
title_full_unstemmed Enriching data analytics with incremental data cleaning and attribute domain management
title_sort Enriching data analytics with incremental data cleaning and attribute domain management
author Oliveira, Paulo Henrique de
author_facet Oliveira, Paulo Henrique de
author_role author
dc.contributor.none.fl_str_mv Traina Junior, Caetano
dc.contributor.author.fl_str_mv Oliveira, Paulo Henrique de
dc.subject.por.fl_str_mv Análise de dados
Attribute domain
Consulta de domínio
Data analytics
Data quality
Domain index
Domain query
Domínio de atributos
Incremental data cleaning
Índice de domínio
Limpeza de dados Incremental
Qualidade de dados
topic Análise de dados
Attribute domain
Consulta de domínio
Data analytics
Data quality
Domain index
Domain query
Domínio de atributos
Incremental data cleaning
Índice de domínio
Limpeza de dados Incremental
Qualidade de dados
description In the present Big Data era, many businesses have become more data-driven, seeking to improve their decision-making processes based on solid Data Analytics practices. Several steps constitute the Data Analytics pipeline and all of them involve specific approaches and technologies, which are constantly evolving. In order to accommodate new needs and trends, there is always room for improvements in the steps of Data Analytics. In this context, this PhD research has focused on improving two of those steps: (i) data cleaning and (ii) data analysis. Regarding the first step, we addressed the problem of performing data cleaning incrementally, considering dynamic scenarios with incoming data batches, as well as holistically, that is, jointly taking into account multiple error detection criteria. As a result, we have developed an incremental data cleaning framework which significantly outperforms competitors, enabling higher efficiency while compromising little on repair quality, as well as addresses the problem in an innovative way, hence filling a gap in the literature. Regarding the second improved step, we addressed the problem of handling queries over an Attribute Domain, which consists of the set of stored values within a domain of attributes, usually across multiple relations. As a result, we have proposed three contributions: (a) the Domain Index, an access method for efficiently performing queries over Attribute Domains, which we refer to as Domain Queries; (b) a comprehensive case study of Domain Indexes applied to the medical domain, focusing on content-based Domain Queries for supporting physicians in decision-making; and (c) an approach for including support to Attribute Domains as first-class citizens in a Relational Database Management System (RDBMS). Together, those contributions target a distinct category of queries which, until the execution of this PhD research, had not been addressed in the literature elsewhere. Experimental results highlight the superior performance enabled by the Domain Index compared to existing techniques of modern RDBMSs, which not only are inefficient in several scenarios, but also are not always applicable. Ultimately, those contributions enrich data analyses down the road. Hence, this PhD research advances the state of the art in the field of Data Analytics, as well as opens several directions for future work.
publishDate 2021
dc.date.none.fl_str_mv 2021-04-30
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://www.teses.usp.br/teses/disponiveis/55/55134/tde-16072021-120503/
url https://www.teses.usp.br/teses/disponiveis/55/55134/tde-16072021-120503/
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv Liberar o conteúdo para acesso público.
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Liberar o conteúdo para acesso público.
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv
reponame:Biblioteca Digital de Teses e Dissertações da USP
instname:Universidade de São Paulo (USP)
instacron:USP
instname_str Universidade de São Paulo (USP)
instacron_str USP
institution USP
reponame_str Biblioteca Digital de Teses e Dissertações da USP
collection Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br
_version_ 1809090773239463936