Detecting outliers and annotating their types with indexing structures

Silva, Guilherme Domingos Faria

Detecting outliers and annotating their types with indexing structures

Detalhes bibliográficos
Autor(a) principal:	Silva, Guilherme Domingos Faria
Data de Publicação:	2022
Tipo de documento:	Dissertação
Idioma:	eng
Título da fonte:	Biblioteca Digital de Teses e Dissertações da USP
Texto Completo:	https://www.teses.usp.br/teses/disponiveis/55/55134/tde-15092022-141353/
Resumo:	The constant increase in the amount of data available on the internet is accentuated with the popularization of technologies such as 5G and Internet of Things. In datasets of large volume there is usually a strong presence of outliers that are not detected or that are just discarded. The outlier detection literature demonstrates that the investigation of these singular instances can provide new insights into the behavior of systems and people. This inspection allows diseases to be identified early, financial market trends to be better interpreted and cybersecurity attacks to be prevented. However, outlier detection techniques carry limitations, being: (1) dependent on the availability of the instances features, which can generate privacy issues; (2) poorly scalable and; (3) capable of providing only a binary separation that allows detecting outliers, but not classifying them so that they are better understood. Starting from an unlabeled dataset for which only the distances between the instances are available, how to detect outliers and categorize them by type efficiently? In the vast literature on outlier detection, there is no work, as far as we know, that deals with the problem of annotating outliers. Outliers can be classified into three large groups: (a) global outliers, instances that are severely different from others in the dataset, such as errors during insertion of information into a database; (b) local outliers, instances that, despite being similar to the others in the dataset as a whole, have minimal variations that make them different in a smaller scope, for instance, a football player who makes many mistakes while playing in a strong team and; (c) collective outliers, small groups of instances that are, simultaneously, quite different from the rest, such as a denial-of-service cyberattack, with few machines having similar harmful behavior. In this project we introduce C-ALLOUT: a new method for detecting outliers that is also able to categorize them by type. C-ALLOUT is able to maintain itself in terms of equality, or even superiority, when compared to state-of-the-art algorithms, still contributing with the annotation of outliers, a task that competitors are not able to perform. C-ALLOUT is based on Slim-tree, an indexing structure that makes it scalable, with O(nlogn) complexity of time and space. Our proposed method deals with both scenarios: having the features available or limited to distances only. Finally, C-ALLOUT works without depending on any interaction with the user, being parameter-free by default, the ideal for unsupervised tasks like outlier analysis.

Metadados do item

id	USP_d4a0352de0732b7caead8b5b5a0298dd
oai_identifier_str	oai:teses.usp.br:tde-15092022-141353
network_acronym_str	USP
network_name_str	Biblioteca Digital de Teses e Dissertações da USP
repository_id_str	2721
spelling	Detecting outliers and annotating their types with indexing structuresDetectando anomalias e anotando seus tipos com estruturas de indexaçãoAnomaliesAnotação de anomaliasCasos de exceçãoDetecção de anomaliasEstruturas de indexaçãoIndexing structuresOutlier annotationOutlier detectionSlim-treeSlim-treeThe constant increase in the amount of data available on the internet is accentuated with the popularization of technologies such as 5G and Internet of Things. In datasets of large volume there is usually a strong presence of outliers that are not detected or that are just discarded. The outlier detection literature demonstrates that the investigation of these singular instances can provide new insights into the behavior of systems and people. This inspection allows diseases to be identified early, financial market trends to be better interpreted and cybersecurity attacks to be prevented. However, outlier detection techniques carry limitations, being: (1) dependent on the availability of the instances features, which can generate privacy issues; (2) poorly scalable and; (3) capable of providing only a binary separation that allows detecting outliers, but not classifying them so that they are better understood. Starting from an unlabeled dataset for which only the distances between the instances are available, how to detect outliers and categorize them by type efficiently? In the vast literature on outlier detection, there is no work, as far as we know, that deals with the problem of annotating outliers. Outliers can be classified into three large groups: (a) global outliers, instances that are severely different from others in the dataset, such as errors during insertion of information into a database; (b) local outliers, instances that, despite being similar to the others in the dataset as a whole, have minimal variations that make them different in a smaller scope, for instance, a football player who makes many mistakes while playing in a strong team and; (c) collective outliers, small groups of instances that are, simultaneously, quite different from the rest, such as a denial-of-service cyberattack, with few machines having similar harmful behavior. In this project we introduce C-ALLOUT: a new method for detecting outliers that is also able to categorize them by type. C-ALLOUT is able to maintain itself in terms of equality, or even superiority, when compared to state-of-the-art algorithms, still contributing with the annotation of outliers, a task that competitors are not able to perform. C-ALLOUT is based on Slim-tree, an indexing structure that makes it scalable, with O(nlogn) complexity of time and space. Our proposed method deals with both scenarios: having the features available or limited to distances only. Finally, C-ALLOUT works without depending on any interaction with the user, being parameter-free by default, the ideal for unsupervised tasks like outlier analysis.O aumento na quantidade de dados disponíveis na internet se acentua com a popularização de tecnologias como 5G e Internet das Coisas. Em grandes bases de dados costuma haver forte presença de anomalias não detectadas ou apenas descartadas. A literatura de detecção de anomalias demonstra que a investigação dessas instâncias singulares pode fornecer novas perspectivas sobre os comportamentos de sistemas e pessoas. Essa inspeção permite que doenças sejam identificadas precocemente, tendências do mercado financeiro sejam melhor interpretadas e que ataques de segurança digital sejam impedidos. Contudo, as técnicas de detecção de anomalias carregam limitações, sendo: (1) dependentes da disponibilidade dos atributos das instâncias, o que pode gerar problemas de privacidade; (2) pouco escaláveis e; (3) capazes de prover apenas uma separação binária que permite detectar anomalias, mas não classificá-las para que sejam melhor entendidas. Partindo de um conjunto de dados não rotulados para o qual apenas distâncias entre as instâncias estão disponíveis, como detectar anomalias e categorizá-las por tipo de forma eficiente? Na vasta literatura de detecção de anomalias não há trabalhos, até onde sabemos, que lidam com o problema de anotação de anomalias. Anomalias podem ser classificadas em três grandes grupos: (a) anomalias globais, instâncias severamente diferentes das outras, como erros de inserção de informações em uma base de dados; (b) anomalias locais, instâncias parecidas com algumas outras da base de dados como um todo, mas com variações mínimas que as tornam diferentes em um escopo menor, como um jogador de futebol que acerta poucas jogadas jogando em um time forte e; (c) anomalias coletivas, pequenos grupos de instâncias que são, em conjunto, bastante diferentes das restantes, como um ataque cibernético por negação de serviço, com poucas máquinas tendo um comportamento nocivo semelhante. Neste projeto é apresentado o C-ALLOUT: um novo método para detecção e anotação de anomalias. C-ALLOUT é capaz de se manter em nível de igualdade, ou até superioridade, quando comparado aos algoritmos do estado da arte, ainda contribuindo com a anotação das anomalias, uma tarefa que os competidores não são capazes de realizar. C-ALLOUT toma proveito da Slim-tree, uma estrutura de indexação que o torna escalável, atingindo complexidade de tempo e espaço O(nlogn). O método funciona tendo acesso aos atributos das instâncias ou limitado a distâncias. Por fim, C-ALLOUT não depende de nenhuma interação com o usuário, sendo livre de parâmetros por padrão, o ideal para tarefas não-supervisionadas como análise de anomalias.Biblioteca Digitais de Teses e Dissertações da USPCordeiro, Robson Leonardo FerreiraSilva, Guilherme Domingos Faria2022-07-15info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/55/55134/tde-15092022-141353/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2022-09-15T17:20:00Zoai:teses.usp.br:tde-15092022-141353Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br\|\| atendimento@aguia.usp.br\|\|virginia@if.usp.bropendoar:27212022-09-15T17:20Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv	Detecting outliers and annotating their types with indexing structures Detectando anomalias e anotando seus tipos com estruturas de indexação
title	Detecting outliers and annotating their types with indexing structures
spellingShingle	Detecting outliers and annotating their types with indexing structures Silva, Guilherme Domingos Faria Anomalies Anotação de anomalias Casos de exceção Detecção de anomalias Estruturas de indexação Indexing structures Outlier annotation Outlier detection Slim-tree Slim-tree
title_short	Detecting outliers and annotating their types with indexing structures
title_full	Detecting outliers and annotating their types with indexing structures
title_fullStr	Detecting outliers and annotating their types with indexing structures
title_full_unstemmed	Detecting outliers and annotating their types with indexing structures
title_sort	Detecting outliers and annotating their types with indexing structures
author	Silva, Guilherme Domingos Faria
author_facet	Silva, Guilherme Domingos Faria
author_role	author
dc.contributor.none.fl_str_mv	Cordeiro, Robson Leonardo Ferreira
dc.contributor.author.fl_str_mv	Silva, Guilherme Domingos Faria
dc.subject.por.fl_str_mv	Anomalies Anotação de anomalias Casos de exceção Detecção de anomalias Estruturas de indexação Indexing structures Outlier annotation Outlier detection Slim-tree Slim-tree
topic	Anomalies Anotação de anomalias Casos de exceção Detecção de anomalias Estruturas de indexação Indexing structures Outlier annotation Outlier detection Slim-tree Slim-tree
description	The constant increase in the amount of data available on the internet is accentuated with the popularization of technologies such as 5G and Internet of Things. In datasets of large volume there is usually a strong presence of outliers that are not detected or that are just discarded. The outlier detection literature demonstrates that the investigation of these singular instances can provide new insights into the behavior of systems and people. This inspection allows diseases to be identified early, financial market trends to be better interpreted and cybersecurity attacks to be prevented. However, outlier detection techniques carry limitations, being: (1) dependent on the availability of the instances features, which can generate privacy issues; (2) poorly scalable and; (3) capable of providing only a binary separation that allows detecting outliers, but not classifying them so that they are better understood. Starting from an unlabeled dataset for which only the distances between the instances are available, how to detect outliers and categorize them by type efficiently? In the vast literature on outlier detection, there is no work, as far as we know, that deals with the problem of annotating outliers. Outliers can be classified into three large groups: (a) global outliers, instances that are severely different from others in the dataset, such as errors during insertion of information into a database; (b) local outliers, instances that, despite being similar to the others in the dataset as a whole, have minimal variations that make them different in a smaller scope, for instance, a football player who makes many mistakes while playing in a strong team and; (c) collective outliers, small groups of instances that are, simultaneously, quite different from the rest, such as a denial-of-service cyberattack, with few machines having similar harmful behavior. In this project we introduce C-ALLOUT: a new method for detecting outliers that is also able to categorize them by type. C-ALLOUT is able to maintain itself in terms of equality, or even superiority, when compared to state-of-the-art algorithms, still contributing with the annotation of outliers, a task that competitors are not able to perform. C-ALLOUT is based on Slim-tree, an indexing structure that makes it scalable, with O(nlogn) complexity of time and space. Our proposed method deals with both scenarios: having the features available or limited to distances only. Finally, C-ALLOUT works without depending on any interaction with the user, being parameter-free by default, the ideal for unsupervised tasks like outlier analysis.
publishDate	2022
dc.date.none.fl_str_mv	2022-07-15
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	https://www.teses.usp.br/teses/disponiveis/55/55134/tde-15092022-141353/
url	https://www.teses.usp.br/teses/disponiveis/55/55134/tde-15092022-141353/
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv	Liberar o conteúdo para acesso público. info:eu-repo/semantics/openAccess
rights_invalid_str_mv	Liberar o conteúdo para acesso público.
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv	Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv	Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv	reponame:Biblioteca Digital de Teses e Dissertações da USP instname:Universidade de São Paulo (USP) instacron:USP
instname_str	Universidade de São Paulo (USP)
instacron_str	USP
institution	USP
reponame_str	Biblioteca Digital de Teses e Dissertações da USP
collection	Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv	Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv	virginia@if.usp.br\|\| atendimento@aguia.usp.br\|\|virginia@if.usp.br
_version_	1815257031474413568

Detecting outliers and annotating their types with indexing structures

Registros relacionados