Evaluation and model selection for unsupervised outlier detection and one-class classification

Detalhes bibliográficos
Autor(a) principal: Marques, Henrique Oliveira
Data de Publicação: 2019
Tipo de documento: Tese
Idioma: eng
Título da fonte: Biblioteca Digital de Teses e Dissertações da USP
Texto Completo: https://www.teses.usp.br/teses/disponiveis/55/55134/tde-07012020-105601/
Resumo: Outlier detection (or anomaly detection) plays an important role in the pattern discovery from data that can be considered exceptional in some sense. An important distinction is that between the supervised, semi-supervised and unsupervised techniques. In this work, we focus on semisupervised and unsupervised techniques. It has been shown that unsupervised outlier detection techniques can be adapted to be applicable also in the semi-supervised setting. Therefore, we conduct a comparative study between the semi-supervised techniques and unsupervised techniques adapted to the semi-supervised context. The main focus of this work, however, is on the unsupervised evaluation of outlier detection. Although there is a large and growing literature that tackles the outlier detection problem, the unsupervised evaluation of outlier detection results is still virtually untouched in the literature, especially in the context of unsupervised detection. The so-called internal evaluation, based solely on the data and the assessed solutions themselves, is required if one wants to statistically validate (in absolute terms) or just compare (in relative terms) the solutions provided by different algorithms or by different parameterizations of a given algorithm in the absence of labeled data. However, in contrast to cluster analysis, where indexes for internal evaluation and validation of clustering solutions have been conceived and shown to be very useful, in the outlier detection domain this problem has been notably overlooked. Here we discuss this problem and provide solutions for the internal evaluation of outlier detection results. In the scenario of semi-supervised detection, we propose an (relative) internal evaluation measure based on data perturbation and compared it with the main measures of the literature, providing the reader with clear recommendations of the best scenario for the use of each one. In the scenario of unsupervised detection, the pioneering measure for internal evaluation of binary outlier solutions, proposed by the author of this thesis in his masters work, is extended to the more general scenario of non-binary outlier solutions, which involves the evaluation of outlier detection scorings, which is the type of result produced by most widely used database-oriented algorithms in the literature. We extensively evaluate both measures in several experiments involving different collections of synthetic and real datasets collected from public repositories.
id USP_1df5ba71e718c4730105ebe2cba37cf6
oai_identifier_str oai:teses.usp.br:tde-07012020-105601
network_acronym_str USP
network_name_str Biblioteca Digital de Teses e Dissertações da USP
repository_id_str 2721
spelling Evaluation and model selection for unsupervised outlier detection and one-class classificationAvaliação e seleção de modelos em detecção não supervisionada de outliers e classificação de classe únicaAprendizado não supervisionadoAprendizado semissupervisionadoAvaliação internaDetecção de outliersInternal evaluationModel selectionOutlier detectionSeleção de modelosSemi-supervised learningUnsupervised learningOutlier detection (or anomaly detection) plays an important role in the pattern discovery from data that can be considered exceptional in some sense. An important distinction is that between the supervised, semi-supervised and unsupervised techniques. In this work, we focus on semisupervised and unsupervised techniques. It has been shown that unsupervised outlier detection techniques can be adapted to be applicable also in the semi-supervised setting. Therefore, we conduct a comparative study between the semi-supervised techniques and unsupervised techniques adapted to the semi-supervised context. The main focus of this work, however, is on the unsupervised evaluation of outlier detection. Although there is a large and growing literature that tackles the outlier detection problem, the unsupervised evaluation of outlier detection results is still virtually untouched in the literature, especially in the context of unsupervised detection. The so-called internal evaluation, based solely on the data and the assessed solutions themselves, is required if one wants to statistically validate (in absolute terms) or just compare (in relative terms) the solutions provided by different algorithms or by different parameterizations of a given algorithm in the absence of labeled data. However, in contrast to cluster analysis, where indexes for internal evaluation and validation of clustering solutions have been conceived and shown to be very useful, in the outlier detection domain this problem has been notably overlooked. Here we discuss this problem and provide solutions for the internal evaluation of outlier detection results. In the scenario of semi-supervised detection, we propose an (relative) internal evaluation measure based on data perturbation and compared it with the main measures of the literature, providing the reader with clear recommendations of the best scenario for the use of each one. In the scenario of unsupervised detection, the pioneering measure for internal evaluation of binary outlier solutions, proposed by the author of this thesis in his masters work, is extended to the more general scenario of non-binary outlier solutions, which involves the evaluation of outlier detection scorings, which is the type of result produced by most widely used database-oriented algorithms in the literature. We extensively evaluate both measures in several experiments involving different collections of synthetic and real datasets collected from public repositories.A área de detecção de outliers (ou detecção de anomalias) possui um papel fundamental na descoberta de padrões em dados que podem ser considerados excepcionais sob alguma perspectiva. Uma importante distinção se dá entre as técnicas supervisionadas, semissupervisionadas e não supervisionadas de detecção. O presente trabalho enfoca as técnicas de detecção semissupervisionadas e não supervisionadas. As técnicas não supervisionadas de detecção podem ser adaptadas para operarem também de forma semissupervisionada. Desta forma, foi realizado um estudo comparativo entre as técnicas de detecção semissupervisionada e as técnicas não supervisionadas adaptadas ao contexto semissupervisionado. O principal foco deste trabalho, no entanto, está na avaliação não supervisionada de detecção de outliers. Embora exista uma literatura grande e crescente que aborde o problema de detecção de outliers, a avaliação não supervisionada dos resultados em detecção de outliers ainda está praticamente intocada na literatura, especialmente no contexto de detecção não supervisionada. A chamada avaliação interna, que baseia-se unicamente nos dados e nas próprias soluções a serem avaliadas, é necessária se for preciso validar estatisticamente (em termos absolutos) ou apenas comparar (em termos relativos) as soluções fornecidas por diferentes algoritmos ou por diferentes parametrizações de um dado algoritmo na ausência de dados rotulados. No entanto, em contraste com agrupamento de dados, onde os índices para validação e avaliação interna de soluções de agrupamento foram concebidos e demonstraram ser bastantes úteis, no domínio de detecção de outliers, este problema tem sido notavelmente negligenciado. Nesta tese, este problema é discutido e soluções são fornecidas para a avaliação interna dos resultados em detecção de outliers. No cenário de detecção semissupervisionada, uma medida (relativa) de avaliação interna baseada na perturbação dos dados é proposta e comparada com as principais medidas da literatura, fornecendo ao leitor recomendações claras do melhor cenário para a utilização de cada uma delas. No cenário de detecção não supervisionada, a medida pioneira para avaliação interna de soluções binárias de detecção de outliers, proposta pelo autor desta tese em seu trabalho de mestrado, é estendida para o cenário mais geral de soluções não binárias de detecção de outliers, que envolve a avaliação de scorings de detecção de outliers, que é o tipo de resultado produzido pela ampla maioria dos algoritmos. Ambas medidas são extensivamente avaliadas em vários experimentos envolvendo diferentes coleções de bases de dados sintéticas e reais coletadas de repositórios públicos.Biblioteca Digitais de Teses e Dissertações da USPCampello, Ricardo José Gabrielli BarretoMarques, Henrique Oliveira2019-11-27info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/55/55134/tde-07012020-105601/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2020-01-10T22:06:01Zoai:teses.usp.br:tde-07012020-105601Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212020-01-10T22:06:01Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv Evaluation and model selection for unsupervised outlier detection and one-class classification
Avaliação e seleção de modelos em detecção não supervisionada de outliers e classificação de classe única
title Evaluation and model selection for unsupervised outlier detection and one-class classification
spellingShingle Evaluation and model selection for unsupervised outlier detection and one-class classification
Marques, Henrique Oliveira
Aprendizado não supervisionado
Aprendizado semissupervisionado
Avaliação interna
Detecção de outliers
Internal evaluation
Model selection
Outlier detection
Seleção de modelos
Semi-supervised learning
Unsupervised learning
title_short Evaluation and model selection for unsupervised outlier detection and one-class classification
title_full Evaluation and model selection for unsupervised outlier detection and one-class classification
title_fullStr Evaluation and model selection for unsupervised outlier detection and one-class classification
title_full_unstemmed Evaluation and model selection for unsupervised outlier detection and one-class classification
title_sort Evaluation and model selection for unsupervised outlier detection and one-class classification
author Marques, Henrique Oliveira
author_facet Marques, Henrique Oliveira
author_role author
dc.contributor.none.fl_str_mv Campello, Ricardo José Gabrielli Barreto
dc.contributor.author.fl_str_mv Marques, Henrique Oliveira
dc.subject.por.fl_str_mv Aprendizado não supervisionado
Aprendizado semissupervisionado
Avaliação interna
Detecção de outliers
Internal evaluation
Model selection
Outlier detection
Seleção de modelos
Semi-supervised learning
Unsupervised learning
topic Aprendizado não supervisionado
Aprendizado semissupervisionado
Avaliação interna
Detecção de outliers
Internal evaluation
Model selection
Outlier detection
Seleção de modelos
Semi-supervised learning
Unsupervised learning
description Outlier detection (or anomaly detection) plays an important role in the pattern discovery from data that can be considered exceptional in some sense. An important distinction is that between the supervised, semi-supervised and unsupervised techniques. In this work, we focus on semisupervised and unsupervised techniques. It has been shown that unsupervised outlier detection techniques can be adapted to be applicable also in the semi-supervised setting. Therefore, we conduct a comparative study between the semi-supervised techniques and unsupervised techniques adapted to the semi-supervised context. The main focus of this work, however, is on the unsupervised evaluation of outlier detection. Although there is a large and growing literature that tackles the outlier detection problem, the unsupervised evaluation of outlier detection results is still virtually untouched in the literature, especially in the context of unsupervised detection. The so-called internal evaluation, based solely on the data and the assessed solutions themselves, is required if one wants to statistically validate (in absolute terms) or just compare (in relative terms) the solutions provided by different algorithms or by different parameterizations of a given algorithm in the absence of labeled data. However, in contrast to cluster analysis, where indexes for internal evaluation and validation of clustering solutions have been conceived and shown to be very useful, in the outlier detection domain this problem has been notably overlooked. Here we discuss this problem and provide solutions for the internal evaluation of outlier detection results. In the scenario of semi-supervised detection, we propose an (relative) internal evaluation measure based on data perturbation and compared it with the main measures of the literature, providing the reader with clear recommendations of the best scenario for the use of each one. In the scenario of unsupervised detection, the pioneering measure for internal evaluation of binary outlier solutions, proposed by the author of this thesis in his masters work, is extended to the more general scenario of non-binary outlier solutions, which involves the evaluation of outlier detection scorings, which is the type of result produced by most widely used database-oriented algorithms in the literature. We extensively evaluate both measures in several experiments involving different collections of synthetic and real datasets collected from public repositories.
publishDate 2019
dc.date.none.fl_str_mv 2019-11-27
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://www.teses.usp.br/teses/disponiveis/55/55134/tde-07012020-105601/
url https://www.teses.usp.br/teses/disponiveis/55/55134/tde-07012020-105601/
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv Liberar o conteúdo para acesso público.
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Liberar o conteúdo para acesso público.
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv
reponame:Biblioteca Digital de Teses e Dissertações da USP
instname:Universidade de São Paulo (USP)
instacron:USP
instname_str Universidade de São Paulo (USP)
instacron_str USP
institution USP
reponame_str Biblioteca Digital de Teses e Dissertações da USP
collection Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br
_version_ 1809090467918249984