Outlier detection for improved clustering : empirical research for unsupervised data mining

Detalhes bibliográficos
Autor(a) principal: Madsen, Jacob Hastrup
Data de Publicação: 2018
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10362/34464
Resumo: Dissertation presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business Intelligence
id RCAP_1b8bae2f544ef84986e3e0e358191534
oai_identifier_str oai:run.unl.pt:10362/34464
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str
spelling Outlier detection for improved clustering : empirical research for unsupervised data miningOutlier DetectionUnsupervised LearningClusteringData MiningDissertation presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business IntelligenceMany clustering algorithms are sensitive to noise disturbing the results when trying to identify and characterize clusters in data. Due to the multidimensional nature of clustering, the discipline of outlier detection is a complex task as statistical approaches are not adequate. In this research work, we contend that for clustering, outliers should be perceived as observations with deviating characteristics worsening the ratio of intra-cluster and inter-cluster distance. We present a research question that deals with improving clustering results specifically for the two clustering algorithms, k-means and hierarchical clustering, by the means of outlier detection. To improve clustering results, we identify and discuss the literature of outlier detection, and undertake on 11 algorithms and 2 statistical test to the process of treating data prior to clustering. To evaluate the results of applied clustering, six evaluation metrics are applied, of which one metric is introduced in this study. Using real world datasets, we demonstrate that outlier detection does improve clustering results with respect to clustering objectives, but only to an extent where data allows it. That is, if data contains ‘real’ clusters and actual outliers, proper use of outlier algorithms improves clustering significantly. Advantages and disadvantages for outlier algorithms, when dealing with different types of data, are discussed along with the different properties of evaluation metrics describing the fulfillment of clustering objectives. Finally, it is demonstrated that the main challenge of improving clustering results for users, with regards to outlier detection, is the lack of tools to understand data structures prior to clustering. Future research is emphasized for tools such as dimension reduction, to help users avoid applying every tool in the toolbox.Costa, Ana Cristina Marinho daRUNMadsen, Jacob Hastrup2018-04-13T15:52:25Z2018-04-122018-04-12T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/34464TID:201898608enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-07-10T15:43:27ZPortal AgregadorONG
dc.title.none.fl_str_mv Outlier detection for improved clustering : empirical research for unsupervised data mining
title Outlier detection for improved clustering : empirical research for unsupervised data mining
spellingShingle Outlier detection for improved clustering : empirical research for unsupervised data mining
Madsen, Jacob Hastrup
Outlier Detection
Unsupervised Learning
Clustering
Data Mining
title_short Outlier detection for improved clustering : empirical research for unsupervised data mining
title_full Outlier detection for improved clustering : empirical research for unsupervised data mining
title_fullStr Outlier detection for improved clustering : empirical research for unsupervised data mining
title_full_unstemmed Outlier detection for improved clustering : empirical research for unsupervised data mining
title_sort Outlier detection for improved clustering : empirical research for unsupervised data mining
author Madsen, Jacob Hastrup
author_facet Madsen, Jacob Hastrup
author_role author
dc.contributor.none.fl_str_mv Costa, Ana Cristina Marinho da
RUN
dc.contributor.author.fl_str_mv Madsen, Jacob Hastrup
dc.subject.por.fl_str_mv Outlier Detection
Unsupervised Learning
Clustering
Data Mining
topic Outlier Detection
Unsupervised Learning
Clustering
Data Mining
description Dissertation presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business Intelligence
publishDate 2018
dc.date.none.fl_str_mv 2018-04-13T15:52:25Z
2018-04-12
2018-04-12T00:00:00Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10362/34464
TID:201898608
url http://hdl.handle.net/10362/34464
identifier_str_mv TID:201898608
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv
repository.mail.fl_str_mv
_version_ 1777302962320703488