Enriching data analytics with incremental data cleaning and attribute domain management
Autor(a) principal: | |
---|---|
Data de Publicação: | 2021 |
Tipo de documento: | Tese |
Idioma: | eng |
Título da fonte: | Biblioteca Digital de Teses e Dissertações da USP |
Texto Completo: | https://www.teses.usp.br/teses/disponiveis/55/55134/tde-16072021-120503/ |
Resumo: | In the present Big Data era, many businesses have become more data-driven, seeking to improve their decision-making processes based on solid Data Analytics practices. Several steps constitute the Data Analytics pipeline and all of them involve specific approaches and technologies, which are constantly evolving. In order to accommodate new needs and trends, there is always room for improvements in the steps of Data Analytics. In this context, this PhD research has focused on improving two of those steps: (i) data cleaning and (ii) data analysis. Regarding the first step, we addressed the problem of performing data cleaning incrementally, considering dynamic scenarios with incoming data batches, as well as holistically, that is, jointly taking into account multiple error detection criteria. As a result, we have developed an incremental data cleaning framework which significantly outperforms competitors, enabling higher efficiency while compromising little on repair quality, as well as addresses the problem in an innovative way, hence filling a gap in the literature. Regarding the second improved step, we addressed the problem of handling queries over an Attribute Domain, which consists of the set of stored values within a domain of attributes, usually across multiple relations. As a result, we have proposed three contributions: (a) the Domain Index, an access method for efficiently performing queries over Attribute Domains, which we refer to as Domain Queries; (b) a comprehensive case study of Domain Indexes applied to the medical domain, focusing on content-based Domain Queries for supporting physicians in decision-making; and (c) an approach for including support to Attribute Domains as first-class citizens in a Relational Database Management System (RDBMS). Together, those contributions target a distinct category of queries which, until the execution of this PhD research, had not been addressed in the literature elsewhere. Experimental results highlight the superior performance enabled by the Domain Index compared to existing techniques of modern RDBMSs, which not only are inefficient in several scenarios, but also are not always applicable. Ultimately, those contributions enrich data analyses down the road. Hence, this PhD research advances the state of the art in the field of Data Analytics, as well as opens several directions for future work. |
id |
USP_5861b0aee484fab4641606e125e73c19 |
---|---|
oai_identifier_str |
oai:teses.usp.br:tde-16072021-120503 |
network_acronym_str |
USP |
network_name_str |
Biblioteca Digital de Teses e Dissertações da USP |
repository_id_str |
2721 |
spelling |
Enriching data analytics with incremental data cleaning and attribute domain managementEnriquecendo a análise de dados com limpeza incremental dos dados e gerenciamento dos domínios de atributosAnálise de dadosAttribute domainConsulta de domínioData analyticsData qualityDomain indexDomain queryDomínio de atributosIncremental data cleaningÍndice de domínioLimpeza de dados IncrementalQualidade de dadosIn the present Big Data era, many businesses have become more data-driven, seeking to improve their decision-making processes based on solid Data Analytics practices. Several steps constitute the Data Analytics pipeline and all of them involve specific approaches and technologies, which are constantly evolving. In order to accommodate new needs and trends, there is always room for improvements in the steps of Data Analytics. In this context, this PhD research has focused on improving two of those steps: (i) data cleaning and (ii) data analysis. Regarding the first step, we addressed the problem of performing data cleaning incrementally, considering dynamic scenarios with incoming data batches, as well as holistically, that is, jointly taking into account multiple error detection criteria. As a result, we have developed an incremental data cleaning framework which significantly outperforms competitors, enabling higher efficiency while compromising little on repair quality, as well as addresses the problem in an innovative way, hence filling a gap in the literature. Regarding the second improved step, we addressed the problem of handling queries over an Attribute Domain, which consists of the set of stored values within a domain of attributes, usually across multiple relations. As a result, we have proposed three contributions: (a) the Domain Index, an access method for efficiently performing queries over Attribute Domains, which we refer to as Domain Queries; (b) a comprehensive case study of Domain Indexes applied to the medical domain, focusing on content-based Domain Queries for supporting physicians in decision-making; and (c) an approach for including support to Attribute Domains as first-class citizens in a Relational Database Management System (RDBMS). Together, those contributions target a distinct category of queries which, until the execution of this PhD research, had not been addressed in the literature elsewhere. Experimental results highlight the superior performance enabled by the Domain Index compared to existing techniques of modern RDBMSs, which not only are inefficient in several scenarios, but also are not always applicable. Ultimately, those contributions enrich data analyses down the road. Hence, this PhD research advances the state of the art in the field of Data Analytics, as well as opens several directions for future work.Na presente era do Big Data, as organizações têm se tornado mais orientadas a dados, buscando melhorar seus processos de tomada de decisão com base em sólidas práticas de Análises de Dados. Diversos passos constituem o processo de Análises de Dados e todos envolvem abordagens e tecnologias específicas, que estão evoluindo constantemente. De maneira a acomodar as novas necessidades e tendências, há sempre espaço para melhorias nos passos de Análises de Dados. Nesse contexto, esta pesquisa de doutorado focou em melhorar dois desses passos: (i) limpeza de dados e (ii) análise de dados. Com relação ao primeiro, esta pesquisa lidou com o problema de realizar limpeza de dados incrementalmente, considerando cenários dinâmicos com novos lotes de dados, bem como holisticamente, isto é, juntamente levando em consideração múltiplos critérios para detecção de erros. Como resultado, desenvolveu-se um arcabouço para limpeza de dados incremental que supera significativamente os competidores, permitindo uma maior eficiência ao mesmo tempo em que se compromete pouco a qualidade de reparo, bem como trata o problema de forma inovadora, portanto preenchendo uma lacuna na literatura. Referente ao segundo passo, abordou-se o problema de manipular consultas sobre um Domínio de Atributos, que consiste no conjunto de valores que compõe um domínio de atributos, normalmente armazenados em múltiplas relações. Como resultado, propôs-se três contribuições: (a) o Índice de Domínio, um método de acesso voltado à execução eficiente de consultas sobre Domínios de Atributos, também chamadas de Consultas de Domínio; (b) um estudo de caso abrangente de Índices de Domínio aplicados sobre o domínio médico, focando em Consultas de Domínio baseadas em conteúdo para auxiliar profissionais da saúde no processo de tomada de decisão; e (c) uma abordagem para incluir suporte a Domínios de Atributos como cidadãos de primeira classe em um Sistema de Gerenciamento de Bancos de Dados Relacional (SGBDR). Juntas, essas contribuições focam em uma categoria distinta de consultas que, até a execução desta pesquisa de doutorado, não havia sido abordada na literatura. Resultados experimentais destacam o desempenho superior do Índice de Domínio comparado às técnicas existentes de SGBDRs modernos, que não somente são ineficientes sob diversos aspectos, como também não são aplicáveis a certos cenários. Portanto, essas contribuições também enriquecem análises de dados subsequentes. Assim, esta pesquisa de doutorado avança o estado da arte no campo de Análises de Dados, bem como abre diversas portas de trabalhos futuros.Biblioteca Digitais de Teses e Dissertações da USPTraina Junior, CaetanoOliveira, Paulo Henrique de2021-04-30info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/55/55134/tde-16072021-120503/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2021-07-16T19:19:02Zoai:teses.usp.br:tde-16072021-120503Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212021-07-16T19:19:02Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false |
dc.title.none.fl_str_mv |
Enriching data analytics with incremental data cleaning and attribute domain management Enriquecendo a análise de dados com limpeza incremental dos dados e gerenciamento dos domínios de atributos |
title |
Enriching data analytics with incremental data cleaning and attribute domain management |
spellingShingle |
Enriching data analytics with incremental data cleaning and attribute domain management Oliveira, Paulo Henrique de Análise de dados Attribute domain Consulta de domínio Data analytics Data quality Domain index Domain query Domínio de atributos Incremental data cleaning Índice de domínio Limpeza de dados Incremental Qualidade de dados |
title_short |
Enriching data analytics with incremental data cleaning and attribute domain management |
title_full |
Enriching data analytics with incremental data cleaning and attribute domain management |
title_fullStr |
Enriching data analytics with incremental data cleaning and attribute domain management |
title_full_unstemmed |
Enriching data analytics with incremental data cleaning and attribute domain management |
title_sort |
Enriching data analytics with incremental data cleaning and attribute domain management |
author |
Oliveira, Paulo Henrique de |
author_facet |
Oliveira, Paulo Henrique de |
author_role |
author |
dc.contributor.none.fl_str_mv |
Traina Junior, Caetano |
dc.contributor.author.fl_str_mv |
Oliveira, Paulo Henrique de |
dc.subject.por.fl_str_mv |
Análise de dados Attribute domain Consulta de domínio Data analytics Data quality Domain index Domain query Domínio de atributos Incremental data cleaning Índice de domínio Limpeza de dados Incremental Qualidade de dados |
topic |
Análise de dados Attribute domain Consulta de domínio Data analytics Data quality Domain index Domain query Domínio de atributos Incremental data cleaning Índice de domínio Limpeza de dados Incremental Qualidade de dados |
description |
In the present Big Data era, many businesses have become more data-driven, seeking to improve their decision-making processes based on solid Data Analytics practices. Several steps constitute the Data Analytics pipeline and all of them involve specific approaches and technologies, which are constantly evolving. In order to accommodate new needs and trends, there is always room for improvements in the steps of Data Analytics. In this context, this PhD research has focused on improving two of those steps: (i) data cleaning and (ii) data analysis. Regarding the first step, we addressed the problem of performing data cleaning incrementally, considering dynamic scenarios with incoming data batches, as well as holistically, that is, jointly taking into account multiple error detection criteria. As a result, we have developed an incremental data cleaning framework which significantly outperforms competitors, enabling higher efficiency while compromising little on repair quality, as well as addresses the problem in an innovative way, hence filling a gap in the literature. Regarding the second improved step, we addressed the problem of handling queries over an Attribute Domain, which consists of the set of stored values within a domain of attributes, usually across multiple relations. As a result, we have proposed three contributions: (a) the Domain Index, an access method for efficiently performing queries over Attribute Domains, which we refer to as Domain Queries; (b) a comprehensive case study of Domain Indexes applied to the medical domain, focusing on content-based Domain Queries for supporting physicians in decision-making; and (c) an approach for including support to Attribute Domains as first-class citizens in a Relational Database Management System (RDBMS). Together, those contributions target a distinct category of queries which, until the execution of this PhD research, had not been addressed in the literature elsewhere. Experimental results highlight the superior performance enabled by the Domain Index compared to existing techniques of modern RDBMSs, which not only are inefficient in several scenarios, but also are not always applicable. Ultimately, those contributions enrich data analyses down the road. Hence, this PhD research advances the state of the art in the field of Data Analytics, as well as opens several directions for future work. |
publishDate |
2021 |
dc.date.none.fl_str_mv |
2021-04-30 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
format |
doctoralThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://www.teses.usp.br/teses/disponiveis/55/55134/tde-16072021-120503/ |
url |
https://www.teses.usp.br/teses/disponiveis/55/55134/tde-16072021-120503/ |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
|
dc.rights.driver.fl_str_mv |
Liberar o conteúdo para acesso público. info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
Liberar o conteúdo para acesso público. |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.coverage.none.fl_str_mv |
|
dc.publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
dc.source.none.fl_str_mv |
reponame:Biblioteca Digital de Teses e Dissertações da USP instname:Universidade de São Paulo (USP) instacron:USP |
instname_str |
Universidade de São Paulo (USP) |
instacron_str |
USP |
institution |
USP |
reponame_str |
Biblioteca Digital de Teses e Dissertações da USP |
collection |
Biblioteca Digital de Teses e Dissertações da USP |
repository.name.fl_str_mv |
Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP) |
repository.mail.fl_str_mv |
virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br |
_version_ |
1809090773239463936 |