Outlier detection for improved clustering : empirical research for unsupervised data mining
Autor(a) principal: | |
---|---|
Data de Publicação: | 2018 |
Tipo de documento: | Dissertação |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10362/34464 |
Resumo: | Dissertation presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business Intelligence |
id |
RCAP_1b8bae2f544ef84986e3e0e358191534 |
---|---|
oai_identifier_str |
oai:run.unl.pt:10362/34464 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Outlier detection for improved clustering : empirical research for unsupervised data miningOutlier DetectionUnsupervised LearningClusteringData MiningDissertation presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business IntelligenceMany clustering algorithms are sensitive to noise disturbing the results when trying to identify and characterize clusters in data. Due to the multidimensional nature of clustering, the discipline of outlier detection is a complex task as statistical approaches are not adequate. In this research work, we contend that for clustering, outliers should be perceived as observations with deviating characteristics worsening the ratio of intra-cluster and inter-cluster distance. We present a research question that deals with improving clustering results specifically for the two clustering algorithms, k-means and hierarchical clustering, by the means of outlier detection. To improve clustering results, we identify and discuss the literature of outlier detection, and undertake on 11 algorithms and 2 statistical test to the process of treating data prior to clustering. To evaluate the results of applied clustering, six evaluation metrics are applied, of which one metric is introduced in this study. Using real world datasets, we demonstrate that outlier detection does improve clustering results with respect to clustering objectives, but only to an extent where data allows it. That is, if data contains ‘real’ clusters and actual outliers, proper use of outlier algorithms improves clustering significantly. Advantages and disadvantages for outlier algorithms, when dealing with different types of data, are discussed along with the different properties of evaluation metrics describing the fulfillment of clustering objectives. Finally, it is demonstrated that the main challenge of improving clustering results for users, with regards to outlier detection, is the lack of tools to understand data structures prior to clustering. Future research is emphasized for tools such as dimension reduction, to help users avoid applying every tool in the toolbox.Costa, Ana Cristina Marinho daRUNMadsen, Jacob Hastrup2018-04-13T15:52:25Z2018-04-122018-04-12T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/34464TID:201898608enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T04:18:54Zoai:run.unl.pt:10362/34464Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:30:11.072060Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Outlier detection for improved clustering : empirical research for unsupervised data mining |
title |
Outlier detection for improved clustering : empirical research for unsupervised data mining |
spellingShingle |
Outlier detection for improved clustering : empirical research for unsupervised data mining Madsen, Jacob Hastrup Outlier Detection Unsupervised Learning Clustering Data Mining |
title_short |
Outlier detection for improved clustering : empirical research for unsupervised data mining |
title_full |
Outlier detection for improved clustering : empirical research for unsupervised data mining |
title_fullStr |
Outlier detection for improved clustering : empirical research for unsupervised data mining |
title_full_unstemmed |
Outlier detection for improved clustering : empirical research for unsupervised data mining |
title_sort |
Outlier detection for improved clustering : empirical research for unsupervised data mining |
author |
Madsen, Jacob Hastrup |
author_facet |
Madsen, Jacob Hastrup |
author_role |
author |
dc.contributor.none.fl_str_mv |
Costa, Ana Cristina Marinho da RUN |
dc.contributor.author.fl_str_mv |
Madsen, Jacob Hastrup |
dc.subject.por.fl_str_mv |
Outlier Detection Unsupervised Learning Clustering Data Mining |
topic |
Outlier Detection Unsupervised Learning Clustering Data Mining |
description |
Dissertation presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business Intelligence |
publishDate |
2018 |
dc.date.none.fl_str_mv |
2018-04-13T15:52:25Z 2018-04-12 2018-04-12T00:00:00Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10362/34464 TID:201898608 |
url |
http://hdl.handle.net/10362/34464 |
identifier_str_mv |
TID:201898608 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799137926108413952 |