Feature selection for clustering categorical data with an embedded modeling approach

Detalhes bibliográficos
Autor(a) principal: Silvestre, C.
Data de Publicação: 2015
Outros Autores: Cardoso, M. G. M. S., Figueiredo, M.
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10071/9550
Resumo: Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation-maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation-maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness.
id RCAP_87c45ed3ac9df19bc053f414f886afc3
oai_identifier_str oai:repositorio.iscte-iul.pt:10071/9550
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Feature selection for clustering categorical data with an embedded modeling approachCluster analysisFinite mixtures modelsEM algorithmFeature selectionCategorical featuresResearch on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation-maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation-maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness.Wiley2015-08-04T14:38:18Z2015-01-01T00:00:00Z20152019-05-07T13:05:46Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10071/9550eng0266-472010.1111/exsy.12082Silvestre, C.Cardoso, M. G. M. S.Figueiredo, M.info:eu-repo/semantics/embargoedAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-09T17:51:32Zoai:repositorio.iscte-iul.pt:10071/9550Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T22:25:31.681827Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Feature selection for clustering categorical data with an embedded modeling approach
title Feature selection for clustering categorical data with an embedded modeling approach
spellingShingle Feature selection for clustering categorical data with an embedded modeling approach
Silvestre, C.
Cluster analysis
Finite mixtures models
EM algorithm
Feature selection
Categorical features
title_short Feature selection for clustering categorical data with an embedded modeling approach
title_full Feature selection for clustering categorical data with an embedded modeling approach
title_fullStr Feature selection for clustering categorical data with an embedded modeling approach
title_full_unstemmed Feature selection for clustering categorical data with an embedded modeling approach
title_sort Feature selection for clustering categorical data with an embedded modeling approach
author Silvestre, C.
author_facet Silvestre, C.
Cardoso, M. G. M. S.
Figueiredo, M.
author_role author
author2 Cardoso, M. G. M. S.
Figueiredo, M.
author2_role author
author
dc.contributor.author.fl_str_mv Silvestre, C.
Cardoso, M. G. M. S.
Figueiredo, M.
dc.subject.por.fl_str_mv Cluster analysis
Finite mixtures models
EM algorithm
Feature selection
Categorical features
topic Cluster analysis
Finite mixtures models
EM algorithm
Feature selection
Categorical features
description Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation-maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation-maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness.
publishDate 2015
dc.date.none.fl_str_mv 2015-08-04T14:38:18Z
2015-01-01T00:00:00Z
2015
2019-05-07T13:05:46Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10071/9550
url http://hdl.handle.net/10071/9550
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv 0266-4720
10.1111/exsy.12082
dc.rights.driver.fl_str_mv info:eu-repo/semantics/embargoedAccess
eu_rights_str_mv embargoedAccess
dc.format.none.fl_str_mv application/pdf
dc.publisher.none.fl_str_mv Wiley
publisher.none.fl_str_mv Wiley
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799134818300067840