On fuzzy cluster validity indices for soft subspace clustering of high-dimensional datasets

Eustáquio, Fernanda Silva

On fuzzy cluster validity indices for soft subspace clustering of high-dimensional datasets

Detalhes bibliográficos
Autor(a) principal:	Eustáquio, Fernanda Silva
Data de Publicação:	2020
Tipo de documento:	Dissertação
Idioma:	eng
Título da fonte:	Repositório Institucional da UFBA
Texto Completo:	http://repositorio.ufba.br/ri/handle/ri/33507
Resumo:	Most of the well-known and widely used conventional clustering algorithms, as k-Means and Fuzzy c-Means (FCM), were designed by assuming that, in most cases, the number of objects in a dataset will be greater than its number of dimensions (features). However, this assumption fails when a dataset consists of text documents or DNA microarrays, in which the number of dimensions is much bigger than the number of objects. Most studies have revealed that FCM and the fuzzy cluster validity indices (CVIs) perform poorly when they are used with high-dimensional data even when a similarity or dissimilarity measure suitable to this type of data is used. The problems faced by high dimensionality are known as the curse of dimensionality and some approaches such as feature transformation, feature selection, feature weighting, and subspace clustering were de ned to deal with thousands of dimensions. To be convinced that the number of dimensions should be maintained to learn as much as possible from an object and to know that just one subset of features might not be enough to all clusters, the soft subspace clustering technique was used in the proposed work. Besides FCM, three soft subspace algorithms, Simultaneous Clustering and Attribute Discrimination (SCAD), Maximum-entropy-regularized Weighted Fuzzy c-Means (EWFCM) and Enhanced Soft Subspace Clustering (ESSC) were performed to cluster three types of high-dimensional data (Gaussian mixture, text, microarray) and they were evaluated employing fuzzy CVIs instead of using external measures like Clustering Accuracy, Rand Index, Normalized Mutual Information, that use information from class labels, as usually done in most research studies. From the experimental results, in a general evaluation, all the clustering algorithms had similar performances highlighting that ESSC presented the best result and FCM was better than the remaining soft subspace algorithms. Besides the use of the soft subspace technique, in the search for the cause of the poor performance of the conventional techniques for high-dimensional data, it was investigated which distance measure or value of weighting fuzzy exponent (m) produced the best clustering result. Furthermore, the performance of nineteen fuzzy CVIs was evaluated by verifying if some tendencies and problems related to previous research studies are maintained when validating soft subspace clustering results. From the analysis made in this work, it was clear that the type of data was determinant to the performance of the clustering algorithms and fuzzy CVIs.

Metadados do item

id	UFBA-2_644dbe034270b895312fc35d7f998e3a
oai_identifier_str	oai:repositorio.ufba.br:ri/33507
network_acronym_str	UFBA-2
network_name_str	Repositório Institucional da UFBA
repository_id_str	1932
spelling	Eustáquio, Fernanda SilvaEustáquio, Fernanda SilvaRios, Tatiane NogueiraCamargo, Heloísa de ArrudaMarcacini, Ricardo Marcondes2021-05-27T21:10:35Z2021-05-272020-04-16http://repositorio.ufba.br/ri/handle/ri/33507Most of the well-known and widely used conventional clustering algorithms, as k-Means and Fuzzy c-Means (FCM), were designed by assuming that, in most cases, the number of objects in a dataset will be greater than its number of dimensions (features). However, this assumption fails when a dataset consists of text documents or DNA microarrays, in which the number of dimensions is much bigger than the number of objects. Most studies have revealed that FCM and the fuzzy cluster validity indices (CVIs) perform poorly when they are used with high-dimensional data even when a similarity or dissimilarity measure suitable to this type of data is used. The problems faced by high dimensionality are known as the curse of dimensionality and some approaches such as feature transformation, feature selection, feature weighting, and subspace clustering were de ned to deal with thousands of dimensions. To be convinced that the number of dimensions should be maintained to learn as much as possible from an object and to know that just one subset of features might not be enough to all clusters, the soft subspace clustering technique was used in the proposed work. Besides FCM, three soft subspace algorithms, Simultaneous Clustering and Attribute Discrimination (SCAD), Maximum-entropy-regularized Weighted Fuzzy c-Means (EWFCM) and Enhanced Soft Subspace Clustering (ESSC) were performed to cluster three types of high-dimensional data (Gaussian mixture, text, microarray) and they were evaluated employing fuzzy CVIs instead of using external measures like Clustering Accuracy, Rand Index, Normalized Mutual Information, that use information from class labels, as usually done in most research studies. From the experimental results, in a general evaluation, all the clustering algorithms had similar performances highlighting that ESSC presented the best result and FCM was better than the remaining soft subspace algorithms. Besides the use of the soft subspace technique, in the search for the cause of the poor performance of the conventional techniques for high-dimensional data, it was investigated which distance measure or value of weighting fuzzy exponent (m) produced the best clustering result. Furthermore, the performance of nineteen fuzzy CVIs was evaluated by verifying if some tendencies and problems related to previous research studies are maintained when validating soft subspace clustering results. From the analysis made in this work, it was clear that the type of data was determinant to the performance of the clustering algorithms and fuzzy CVIs.Submitted by Fernanda Eustáquio (fe-nanda7@hotmail.com) on 2021-05-22T14:47:23Z No. of bitstreams: 1 On_fuzzy_cluster_validity_indices_for_soft_subspace_clustering_of_high_dimensional_datasets.pdf: 3804438 bytes, checksum: 20a815fc083d5f23d5d22e66c06c5568 (MD5)Approved for entry into archive by Solange Rocha (soluny@gmail.com) on 2021-05-27T21:10:35Z (GMT) No. of bitstreams: 1 On_fuzzy_cluster_validity_indices_for_soft_subspace_clustering_of_high_dimensional_datasets.pdf: 3804438 bytes, checksum: 20a815fc083d5f23d5d22e66c06c5568 (MD5)Made available in DSpace on 2021-05-27T21:10:35Z (GMT). No. of bitstreams: 1 On_fuzzy_cluster_validity_indices_for_soft_subspace_clustering_of_high_dimensional_datasets.pdf: 3804438 bytes, checksum: 20a815fc083d5f23d5d22e66c06c5568 (MD5)Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)Ciências Exatas e da TerraCiência da ComputaçãoFuzzy cluster validity indicesSoft subspace clusteringFuzzy clusteringFuzzy c-Means modelHigh-dimensional dataAlgoritmo de agrupamentoOn fuzzy cluster validity indices for soft subspace clustering of high-dimensional datasetsinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesis10000-01-01Universidade Federal da BahiaInstituto de Matemática e EstatísticaDepartamento de Ciência da Computaçãoem Ciência da ComputaçãoUFBAbrasilinfo:eu-repo/semantics/openAccessengreponame:Repositório Institucional da UFBAinstname:Universidade Federal da Bahia (UFBA)instacron:UFBAORIGINALOn_fuzzy_cluster_validity_indices_for_soft_subspace_clustering_of_high_dimensional_datasets.pdfOn_fuzzy_cluster_validity_indices_for_soft_subspace_clustering_of_high_dimensional_datasets.pdfapplication/pdf3804438https://repositorio.ufba.br/bitstream/ri/33507/1/On_fuzzy_cluster_validity_indices_for_soft_subspace_clustering_of_high_dimensional_datasets.pdf20a815fc083d5f23d5d22e66c06c5568MD51LICENSElicense.txtlicense.txttext/plain1442https://repositorio.ufba.br/bitstream/ri/33507/2/license.txt817035eff4c4c7dda1d546e170ee2a1aMD52ri/335072023-07-18 13:22:18.786oai:repositorio.ufba.br:ri/33507VGVybW8gZGUgTGljZW7vv71hLCBu77+9byBleGNsdXNpdm8sIHBhcmEgbyBkZXDvv71zaXRvIG5vIFJlcG9zaXTvv71yaW8gSW5zdGl0dWNpb25hbCBkYSBVRkJBLgoKIFBlbG8gcHJvY2Vzc28gZGUgc3VibWlzc8ODwqNvIGRlIGRvY3VtZW50b3MsIG8gYXV0b3Igb3Ugc2V1IHJlcHJlc2VudGFudGUgbGVnYWwsIGFvIGFjZWl0YXIgZXNzZSB0ZXJtbyBkZSBsaWNlbsODwqdhLCBjb25jZWRlIGFvIFJlcG9zaXTDg8KzcmlvIEluc3RpdHVjaW9uYWwgZGEgVW5pdmVyc2lkYWRlIEZlZGVyYWwgZGEgQmFoaWEgbyBkaXJlaXRvIGRlIG1hbnRlciB1bWEgY8ODwrNwaWEgZW0gc2V1IHJlcG9zaXTDg8KzcmlvIGNvbSBhIGZpbmFsaWRhZGUsIHByaW1laXJhLCBkZSBwcmVzZXJ2YcODwqfDg8Kjby4gCgpFc3NlcyB0ZXJtb3MsIG7Dg8KjbyBleGNsdXNpdm9zLCBtYW50w4PCqW0gb3MgZGlyZWl0b3MgZGUgYXV0b3IvY29weXJpZ2h0LCBtYXMgZW50ZW5kZSBvIGRvY3VtZW50byBjb21vIHBhcnRlIGRvIGFjZXJ2byBpbnRlbGVjdHVhbCBkZXNzYSBVbml2ZXJzaWRhZGUuCgogUGFyYSBvcyBkb2N1bWVudG9zIHB1YmxpY2Fkb3MgY29tIHJlcGFzc2UgZGUgZGlyZWl0b3MgZGUgZGlzdHJpYnVpw4PCp8ODwqNvLCBlc3NlIHRlcm1vIGRlIGxpY2Vuw4PCp2EgZW50ZW5kZSBxdWU6CgogTWFudGVuZG8gb3MgZGlyZWl0b3MgYXV0b3JhaXMsIHJlcGFzc2Fkb3MgYSB0ZXJjZWlyb3MsIGVtIGNhc28gZGUgcHVibGljYcODwqfDg8K1ZXMsIG8gcmVwb3NpdMODwrNyaW8gcG9kZSByZXN0cmluZ2lyIG8gYWNlc3NvIGFvIHRleHRvIGludGVncmFsLCBtYXMgbGliZXJhIGFzIGluZm9ybWHDg8Knw4PCtWVzIHNvYnJlIG8gZG9jdW1lbnRvIChNZXRhZGFkb3MgZGVzY3JpdGl2b3MpLgoKIERlc3RhIGZvcm1hLCBhdGVuZGVuZG8gYW9zIGFuc2Vpb3MgZGVzc2EgdW5pdmVyc2lkYWRlIGVtIG1hbnRlciBzdWEgcHJvZHXDg8Knw4PCo28gY2llbnTDg8KtZmljYSBjb20gYXMgcmVzdHJpw4PCp8ODwrVlcyBpbXBvc3RhcyBwZWxvcyBlZGl0b3JlcyBkZSBwZXJpw4PCs2RpY29zLgoKIFBhcmEgYXMgcHVibGljYcODwqfDg8K1ZXMgc2VtIGluaWNpYXRpdmFzIHF1ZSBzZWd1ZW0gYSBwb2zDg8KtdGljYSBkZSBBY2Vzc28gQWJlcnRvLCBvcyBkZXDDg8Kzc2l0b3MgY29tcHVsc8ODwrNyaW9zIG5lc3NlIHJlcG9zaXTDg8KzcmlvIG1hbnTDg8KpbSBvcyBkaXJlaXRvcyBhdXRvcmFpcywgbWFzIG1hbnTDg8KpbSBhY2Vzc28gaXJyZXN0cml0byBhb3MgbWV0YWRhZG9zIGUgdGV4dG8gY29tcGxldG8uIEFzc2ltLCBhIGFjZWl0YcODwqfDg8KjbyBkZXNzZSB0ZXJtbyBuw4PCo28gbmVjZXNzaXRhIGRlIGNvbnNlbnRpbWVudG8gcG9yIHBhcnRlIGRlIGF1dG9yZXMvZGV0ZW50b3JlcyBkb3MgZGlyZWl0b3MsIHBvciBlc3RhcmVtIGVtIGluaWNpYXRpdmFzIGRlIGFjZXNzbyBhYmVydG8uCg==Repositório InstitucionalPUBhttp://192.188.11.11:8080/oai/requestopendoar:19322023-07-18T16:22:18Repositório Institucional da UFBA - Universidade Federal da Bahia (UFBA)false
dc.title.pt_BR.fl_str_mv	On fuzzy cluster validity indices for soft subspace clustering of high-dimensional datasets
title	On fuzzy cluster validity indices for soft subspace clustering of high-dimensional datasets
spellingShingle	On fuzzy cluster validity indices for soft subspace clustering of high-dimensional datasets Eustáquio, Fernanda Silva Ciências Exatas e da Terra Ciência da Computação Fuzzy cluster validity indices Soft subspace clustering Fuzzy clustering Fuzzy c-Means model High-dimensional data Algoritmo de agrupamento
title_short	On fuzzy cluster validity indices for soft subspace clustering of high-dimensional datasets
title_full	On fuzzy cluster validity indices for soft subspace clustering of high-dimensional datasets
title_fullStr	On fuzzy cluster validity indices for soft subspace clustering of high-dimensional datasets
title_full_unstemmed	On fuzzy cluster validity indices for soft subspace clustering of high-dimensional datasets
title_sort	On fuzzy cluster validity indices for soft subspace clustering of high-dimensional datasets
author	Eustáquio, Fernanda Silva
author_facet	Eustáquio, Fernanda Silva
author_role	author
dc.contributor.author.fl_str_mv	Eustáquio, Fernanda Silva Eustáquio, Fernanda Silva
dc.contributor.advisor1.fl_str_mv	Rios, Tatiane Nogueira
dc.contributor.referee1.fl_str_mv	Camargo, Heloísa de Arruda Marcacini, Ricardo Marcondes
contributor_str_mv	Rios, Tatiane Nogueira Camargo, Heloísa de Arruda Marcacini, Ricardo Marcondes
dc.subject.cnpq.fl_str_mv	Ciências Exatas e da Terra Ciência da Computação
topic	Ciências Exatas e da Terra Ciência da Computação Fuzzy cluster validity indices Soft subspace clustering Fuzzy clustering Fuzzy c-Means model High-dimensional data Algoritmo de agrupamento
dc.subject.por.fl_str_mv	Fuzzy cluster validity indices Soft subspace clustering Fuzzy clustering Fuzzy c-Means model High-dimensional data Algoritmo de agrupamento
description	Most of the well-known and widely used conventional clustering algorithms, as k-Means and Fuzzy c-Means (FCM), were designed by assuming that, in most cases, the number of objects in a dataset will be greater than its number of dimensions (features). However, this assumption fails when a dataset consists of text documents or DNA microarrays, in which the number of dimensions is much bigger than the number of objects. Most studies have revealed that FCM and the fuzzy cluster validity indices (CVIs) perform poorly when they are used with high-dimensional data even when a similarity or dissimilarity measure suitable to this type of data is used. The problems faced by high dimensionality are known as the curse of dimensionality and some approaches such as feature transformation, feature selection, feature weighting, and subspace clustering were de ned to deal with thousands of dimensions. To be convinced that the number of dimensions should be maintained to learn as much as possible from an object and to know that just one subset of features might not be enough to all clusters, the soft subspace clustering technique was used in the proposed work. Besides FCM, three soft subspace algorithms, Simultaneous Clustering and Attribute Discrimination (SCAD), Maximum-entropy-regularized Weighted Fuzzy c-Means (EWFCM) and Enhanced Soft Subspace Clustering (ESSC) were performed to cluster three types of high-dimensional data (Gaussian mixture, text, microarray) and they were evaluated employing fuzzy CVIs instead of using external measures like Clustering Accuracy, Rand Index, Normalized Mutual Information, that use information from class labels, as usually done in most research studies. From the experimental results, in a general evaluation, all the clustering algorithms had similar performances highlighting that ESSC presented the best result and FCM was better than the remaining soft subspace algorithms. Besides the use of the soft subspace technique, in the search for the cause of the poor performance of the conventional techniques for high-dimensional data, it was investigated which distance measure or value of weighting fuzzy exponent (m) produced the best clustering result. Furthermore, the performance of nineteen fuzzy CVIs was evaluated by verifying if some tendencies and problems related to previous research studies are maintained when validating soft subspace clustering results. From the analysis made in this work, it was clear that the type of data was determinant to the performance of the clustering algorithms and fuzzy CVIs.
publishDate	2020
dc.date.submitted.none.fl_str_mv	2020-04-16
dc.date.accessioned.fl_str_mv	2021-05-27T21:10:35Z
dc.date.issued.fl_str_mv	2021-05-27
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://repositorio.ufba.br/ri/handle/ri/33507
url	http://repositorio.ufba.br/ri/handle/ri/33507
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.publisher.none.fl_str_mv	Universidade Federal da Bahia Instituto de Matemática e Estatística Departamento de Ciência da Computação
dc.publisher.program.fl_str_mv	em Ciência da Computação
dc.publisher.initials.fl_str_mv	UFBA
dc.publisher.country.fl_str_mv	brasil
publisher.none.fl_str_mv	Universidade Federal da Bahia Instituto de Matemática e Estatística Departamento de Ciência da Computação
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFBA instname:Universidade Federal da Bahia (UFBA) instacron:UFBA
instname_str	Universidade Federal da Bahia (UFBA)
instacron_str	UFBA
institution	UFBA
reponame_str	Repositório Institucional da UFBA
collection	Repositório Institucional da UFBA
bitstream.url.fl_str_mv	https://repositorio.ufba.br/bitstream/ri/33507/1/On_fuzzy_cluster_validity_indices_for_soft_subspace_clustering_of_high_dimensional_datasets.pdf https://repositorio.ufba.br/bitstream/ri/33507/2/license.txt
bitstream.checksum.fl_str_mv	20a815fc083d5f23d5d22e66c06c5568 817035eff4c4c7dda1d546e170ee2a1a
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da UFBA - Universidade Federal da Bahia (UFBA)
repository.mail.fl_str_mv
_version_	1808459628272418816

On fuzzy cluster validity indices for soft subspace clustering of high-dimensional datasets

Registros relacionados