Feature selection for clustering categorical data with an embedded modeling approach

Silvestre, C.; Cardoso, M. G. M. S.; Figueiredo, M.

Feature selection for clustering categorical data with an embedded modeling approach

Detalhes bibliográficos
Autor(a) principal:	Silvestre, C.
Data de Publicação:	2015
Outros Autores:	Cardoso, M. G. M. S., Figueiredo, M.
Tipo de documento:	Artigo
Idioma:	eng
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	http://hdl.handle.net/10071/9550
Resumo:	Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation-maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation-maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness.

Metadados do item

id	RCAP_87c45ed3ac9df19bc053f414f886afc3
oai_identifier_str	oai:repositorio.iscte-iul.pt:10071/9550
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	Feature selection for clustering categorical data with an embedded modeling approachCluster analysisFinite mixtures modelsEM algorithmFeature selectionCategorical featuresResearch on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation-maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation-maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness.Wiley2015-08-04T14:38:18Z2015-01-01T00:00:00Z20152019-05-07T13:05:46Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10071/9550eng0266-472010.1111/exsy.12082Silvestre, C.Cardoso, M. G. M. S.Figueiredo, M.info:eu-repo/semantics/embargoedAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-09T17:51:32Zoai:repositorio.iscte-iul.pt:10071/9550Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T22:25:31.681827Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	Feature selection for clustering categorical data with an embedded modeling approach
title	Feature selection for clustering categorical data with an embedded modeling approach
spellingShingle	Feature selection for clustering categorical data with an embedded modeling approach Silvestre, C. Cluster analysis Finite mixtures models EM algorithm Feature selection Categorical features
title_short	Feature selection for clustering categorical data with an embedded modeling approach
title_full	Feature selection for clustering categorical data with an embedded modeling approach
title_fullStr	Feature selection for clustering categorical data with an embedded modeling approach
title_full_unstemmed	Feature selection for clustering categorical data with an embedded modeling approach
title_sort	Feature selection for clustering categorical data with an embedded modeling approach
author	Silvestre, C.
author_facet	Silvestre, C. Cardoso, M. G. M. S. Figueiredo, M.
author_role	author
author2	Cardoso, M. G. M. S. Figueiredo, M.
author2_role	author author
dc.contributor.author.fl_str_mv	Silvestre, C. Cardoso, M. G. M. S. Figueiredo, M.
dc.subject.por.fl_str_mv	Cluster analysis Finite mixtures models EM algorithm Feature selection Categorical features
topic	Cluster analysis Finite mixtures models EM algorithm Feature selection Categorical features
description	Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation-maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation-maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness.
publishDate	2015
dc.date.none.fl_str_mv	2015-08-04T14:38:18Z 2015-01-01T00:00:00Z 2015 2019-05-07T13:05:46Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/article
format	article
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10071/9550
url	http://hdl.handle.net/10071/9550
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	0266-4720 10.1111/exsy.12082
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/embargoedAccess
eu_rights_str_mv	embargoedAccess
dc.format.none.fl_str_mv	application/pdf
dc.publisher.none.fl_str_mv	Wiley
publisher.none.fl_str_mv	Wiley
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799134818300067840

Feature selection for clustering categorical data with an embedded modeling approach

Registros relacionados