Robust outlier labeling rules for light-tailed and heavy-tailed Data
Autor(a) principal: | |
---|---|
Data de Publicação: | 2019 |
Tipo de documento: | Tese |
Idioma: | eng |
Título da fonte: | Biblioteca Digital de Teses e Dissertações da USP |
Texto Completo: | http://www.teses.usp.br/teses/disponiveis/55/55134/tde-29042019-145141/ |
Resumo: | Outlier rules are used to detect outliers in univariate data. A commonly used outlier rule is based on a graphical tool for univariate data analysis, named the boxplot. However, it is well known that boxplot exhibits significantly lower performance for skewed distributions, in comparison to the symmetric case. In order to overcome this deficiency, an outlier rule known as adjusted boxplot, has been proposed in the literature. Adjusted boxplot modifies the classical boxplot by incorporating into it a skewness measure. Although this modification has resulted in a state-of-the-art version of the classical boxplot, it has the drawback of leading to a rule that is not flexible enough to permit easily to pre-specify a nominal outside rate. Furthermore, the adjusted boxplot can present, for some situations, significantly higher computational cost than the classical boxplot, since its computational complexity is O(nlogn), while the classical boxplot is O(n): In order to address those issues, this thesis proposes a more formal approach to deriving outlier rules that proved to produce rules which exhibit overall better performance than that of the adjusted boxplot, specially as the contamination level increases. Moreover, those proposed rules have the advantages of being more flexible and possessing lower computational cost than the adjusted boxplot. Furthermore, it is shown that the classical boxplot and many of its modifications or variations are unified by the same concept introduced by this thesis: quartile contrast. The problem with the outlier rules based on quartile contrast, as well as the adjusted boxplot, lies in the fact that they are more suitable for light-tailed data than for heavy-tailed data. For heavy-tailed data, it has been proposed in the literature an outlier rule known as the generalized boxplot. The main problem with the generalized boxplot lies in the fact it is very unstable, since a single outlier might dramatically affect its performance. In order to address this issue, the thesis uses the quartile contrast approach to deriving an outlier rule sensitive to tail heaviness. The experimental analysis show that the tail-heaviness sensitive outlier rule proposed by the thesis indeed presents more stable performance than the generalized boxplot. The performance evaluation of outlier rules is a problem on its own. Therefore, to measure performance of outlier rules, the thesis introduces the GME, a measure that has proved to be more effective to assess performance of outlier rules than the traditional measures involving only false positive rate and false negative rate. |
id |
USP_3c00cfbd4f91545ab7ab602df688bc0e |
---|---|
oai_identifier_str |
oai:teses.usp.br:tde-29042019-145141 |
network_acronym_str |
USP |
network_name_str |
Biblioteca Digital de Teses e Dissertações da USP |
repository_id_str |
2721 |
spelling |
Robust outlier labeling rules for light-tailed and heavy-tailed DataRegras robustas para rotular outliers em dados de caudas leves e caudas pesadas.Assimetria ou peso da caudaErro de rotulaçãoEvaluation measureMedida de avaliaçãoMétodos robustosOutlier rulesOutside rateRegras robustasRobust methodsSkewness or tail heavinessOutlier rules are used to detect outliers in univariate data. A commonly used outlier rule is based on a graphical tool for univariate data analysis, named the boxplot. However, it is well known that boxplot exhibits significantly lower performance for skewed distributions, in comparison to the symmetric case. In order to overcome this deficiency, an outlier rule known as adjusted boxplot, has been proposed in the literature. Adjusted boxplot modifies the classical boxplot by incorporating into it a skewness measure. Although this modification has resulted in a state-of-the-art version of the classical boxplot, it has the drawback of leading to a rule that is not flexible enough to permit easily to pre-specify a nominal outside rate. Furthermore, the adjusted boxplot can present, for some situations, significantly higher computational cost than the classical boxplot, since its computational complexity is O(nlogn), while the classical boxplot is O(n): In order to address those issues, this thesis proposes a more formal approach to deriving outlier rules that proved to produce rules which exhibit overall better performance than that of the adjusted boxplot, specially as the contamination level increases. Moreover, those proposed rules have the advantages of being more flexible and possessing lower computational cost than the adjusted boxplot. Furthermore, it is shown that the classical boxplot and many of its modifications or variations are unified by the same concept introduced by this thesis: quartile contrast. The problem with the outlier rules based on quartile contrast, as well as the adjusted boxplot, lies in the fact that they are more suitable for light-tailed data than for heavy-tailed data. For heavy-tailed data, it has been proposed in the literature an outlier rule known as the generalized boxplot. The main problem with the generalized boxplot lies in the fact it is very unstable, since a single outlier might dramatically affect its performance. In order to address this issue, the thesis uses the quartile contrast approach to deriving an outlier rule sensitive to tail heaviness. The experimental analysis show that the tail-heaviness sensitive outlier rule proposed by the thesis indeed presents more stable performance than the generalized boxplot. The performance evaluation of outlier rules is a problem on its own. Therefore, to measure performance of outlier rules, the thesis introduces the GME, a measure that has proved to be more effective to assess performance of outlier rules than the traditional measures involving only false positive rate and false negative rate.As regras de outlier são usadas para detectar outlier em dados univariados. Uma regra de outlier comumente usada é baseada em uma ferramenta gráfica para análise univariada de dados, denominada boxplot. No entanto, é bem conhecido que o boxplot apresenta um desempenho significativamente inferior para distribuições assimétricas, em comparação com o caso simétrico. Para superar essa deficiência, uma regra de outlier conhecida como boxplot ajustado foi proposta na literatura. O boxplot ajustado é uma modificação do boxplot clássico, incorporando nele uma medida de assimetria. Embora o boxplot ajustado tenha resultado em uma versão melhorada, se comparada ao boxplot clássico, ele tem a desvantagem de ser uma regra não flexível o suficiente para permitir a pré-especificação de um erro nominal de rotulação. Além disso, o boxplot ajustado pode apresentar, para algumas situações, um custo computacional significativamente maior se comparado ao boxplot clássico, já que a sua complexidade computacional é O(nlogn), enquanto o boxplot clássico é O(n): A fim de abordar essas questões, esta tese propõe uma abordagem mais formal para deduzir regras de outlier que produzim regras que exibem um desempenho geral melhor do que o do boxplot ajustado, especialmente à medida que o nível de contaminação aumenta. Além disso, essas regras propostas têm as vantagens de serem mais flexíveis e possuírem menor custo computacional do que o boxplot ajustado. Além disso, é mostrado que o boxplot clássico e muitas de suas modificações ou variações são unificadas pelo mesmo conceito introduzido por esta tese: contraste de quartis. O problema com as regras de outlier baseadas em contraste de quartis, bem como o boxplot ajustado, reside no fato de que elas são mais adequadas para dados unimodais simétricos e assimétricos do que para dados com cauda pesada. Para dados de cauda pesada, foi proposto na literatura uma regra de outlier conhecida como boxplot generalizado. O principal problema com o boxplot generalizado está no fato de ele ser muito instável, já que um único outlier pode afetar drasticamente seu desempenho. Para resolver esse problema, a tese usa a abordagem contraste de quartis para deduzir uma regra de outlier sensível ao peso da cauda. As análises experimentais mostram que a regra de outlier sensível ao peso da cauda proposta pela tese realmente apresenta um desempenho mais estável do que o boxplot generalizado. A avaliação de desempenho de regras de outlier é um problema por si só. Portanto, para medir o desempenho de regras outlier, a tese apresenta a GME, uma medida que se mostrou mais eficaz para avaliar o desempenho de regras de outlier do que as medidas tradicionais envolvendo apenas taxa de falsos positivos e taxa de falsos negativos.Biblioteca Digitais de Teses e Dissertações da USPCarvalho, André Carlos Ponce de Leon Ferreira deSilva, Kelly Cristina Ramos da2019-02-01info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttp://www.teses.usp.br/teses/disponiveis/55/55134/tde-29042019-145141/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2019-11-08T23:48:48Zoai:teses.usp.br:tde-29042019-145141Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212019-11-08T23:48:48Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false |
dc.title.none.fl_str_mv |
Robust outlier labeling rules for light-tailed and heavy-tailed Data Regras robustas para rotular outliers em dados de caudas leves e caudas pesadas. |
title |
Robust outlier labeling rules for light-tailed and heavy-tailed Data |
spellingShingle |
Robust outlier labeling rules for light-tailed and heavy-tailed Data Silva, Kelly Cristina Ramos da Assimetria ou peso da cauda Erro de rotulação Evaluation measure Medida de avaliação Métodos robustos Outlier rules Outside rate Regras robustas Robust methods Skewness or tail heaviness |
title_short |
Robust outlier labeling rules for light-tailed and heavy-tailed Data |
title_full |
Robust outlier labeling rules for light-tailed and heavy-tailed Data |
title_fullStr |
Robust outlier labeling rules for light-tailed and heavy-tailed Data |
title_full_unstemmed |
Robust outlier labeling rules for light-tailed and heavy-tailed Data |
title_sort |
Robust outlier labeling rules for light-tailed and heavy-tailed Data |
author |
Silva, Kelly Cristina Ramos da |
author_facet |
Silva, Kelly Cristina Ramos da |
author_role |
author |
dc.contributor.none.fl_str_mv |
Carvalho, André Carlos Ponce de Leon Ferreira de |
dc.contributor.author.fl_str_mv |
Silva, Kelly Cristina Ramos da |
dc.subject.por.fl_str_mv |
Assimetria ou peso da cauda Erro de rotulação Evaluation measure Medida de avaliação Métodos robustos Outlier rules Outside rate Regras robustas Robust methods Skewness or tail heaviness |
topic |
Assimetria ou peso da cauda Erro de rotulação Evaluation measure Medida de avaliação Métodos robustos Outlier rules Outside rate Regras robustas Robust methods Skewness or tail heaviness |
description |
Outlier rules are used to detect outliers in univariate data. A commonly used outlier rule is based on a graphical tool for univariate data analysis, named the boxplot. However, it is well known that boxplot exhibits significantly lower performance for skewed distributions, in comparison to the symmetric case. In order to overcome this deficiency, an outlier rule known as adjusted boxplot, has been proposed in the literature. Adjusted boxplot modifies the classical boxplot by incorporating into it a skewness measure. Although this modification has resulted in a state-of-the-art version of the classical boxplot, it has the drawback of leading to a rule that is not flexible enough to permit easily to pre-specify a nominal outside rate. Furthermore, the adjusted boxplot can present, for some situations, significantly higher computational cost than the classical boxplot, since its computational complexity is O(nlogn), while the classical boxplot is O(n): In order to address those issues, this thesis proposes a more formal approach to deriving outlier rules that proved to produce rules which exhibit overall better performance than that of the adjusted boxplot, specially as the contamination level increases. Moreover, those proposed rules have the advantages of being more flexible and possessing lower computational cost than the adjusted boxplot. Furthermore, it is shown that the classical boxplot and many of its modifications or variations are unified by the same concept introduced by this thesis: quartile contrast. The problem with the outlier rules based on quartile contrast, as well as the adjusted boxplot, lies in the fact that they are more suitable for light-tailed data than for heavy-tailed data. For heavy-tailed data, it has been proposed in the literature an outlier rule known as the generalized boxplot. The main problem with the generalized boxplot lies in the fact it is very unstable, since a single outlier might dramatically affect its performance. In order to address this issue, the thesis uses the quartile contrast approach to deriving an outlier rule sensitive to tail heaviness. The experimental analysis show that the tail-heaviness sensitive outlier rule proposed by the thesis indeed presents more stable performance than the generalized boxplot. The performance evaluation of outlier rules is a problem on its own. Therefore, to measure performance of outlier rules, the thesis introduces the GME, a measure that has proved to be more effective to assess performance of outlier rules than the traditional measures involving only false positive rate and false negative rate. |
publishDate |
2019 |
dc.date.none.fl_str_mv |
2019-02-01 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
format |
doctoralThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://www.teses.usp.br/teses/disponiveis/55/55134/tde-29042019-145141/ |
url |
http://www.teses.usp.br/teses/disponiveis/55/55134/tde-29042019-145141/ |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
|
dc.rights.driver.fl_str_mv |
Liberar o conteúdo para acesso público. info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
Liberar o conteúdo para acesso público. |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.coverage.none.fl_str_mv |
|
dc.publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
publisher.none.fl_str_mv |
Biblioteca Digitais de Teses e Dissertações da USP |
dc.source.none.fl_str_mv |
reponame:Biblioteca Digital de Teses e Dissertações da USP instname:Universidade de São Paulo (USP) instacron:USP |
instname_str |
Universidade de São Paulo (USP) |
instacron_str |
USP |
institution |
USP |
reponame_str |
Biblioteca Digital de Teses e Dissertações da USP |
collection |
Biblioteca Digital de Teses e Dissertações da USP |
repository.name.fl_str_mv |
Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP) |
repository.mail.fl_str_mv |
virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br |
_version_ |
1815257442700754944 |