Feature selection for clustering categorical data with an embedded modeling approach
Autor(a) principal: | |
---|---|
Data de Publicação: | 2015 |
Outros Autores: | , |
Tipo de documento: | Artigo |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10071/9550 |
Resumo: | Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation-maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation-maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness. |
id |
RCAP_87c45ed3ac9df19bc053f414f886afc3 |
---|---|
oai_identifier_str |
oai:repositorio.iscte-iul.pt:10071/9550 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Feature selection for clustering categorical data with an embedded modeling approachCluster analysisFinite mixtures modelsEM algorithmFeature selectionCategorical featuresResearch on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation-maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation-maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness.Wiley2015-08-04T14:38:18Z2015-01-01T00:00:00Z20152019-05-07T13:05:46Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10071/9550eng0266-472010.1111/exsy.12082Silvestre, C.Cardoso, M. G. M. S.Figueiredo, M.info:eu-repo/semantics/embargoedAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2023-11-09T17:51:32Zoai:repositorio.iscte-iul.pt:10071/9550Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-19T22:25:31.681827Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Feature selection for clustering categorical data with an embedded modeling approach |
title |
Feature selection for clustering categorical data with an embedded modeling approach |
spellingShingle |
Feature selection for clustering categorical data with an embedded modeling approach Silvestre, C. Cluster analysis Finite mixtures models EM algorithm Feature selection Categorical features |
title_short |
Feature selection for clustering categorical data with an embedded modeling approach |
title_full |
Feature selection for clustering categorical data with an embedded modeling approach |
title_fullStr |
Feature selection for clustering categorical data with an embedded modeling approach |
title_full_unstemmed |
Feature selection for clustering categorical data with an embedded modeling approach |
title_sort |
Feature selection for clustering categorical data with an embedded modeling approach |
author |
Silvestre, C. |
author_facet |
Silvestre, C. Cardoso, M. G. M. S. Figueiredo, M. |
author_role |
author |
author2 |
Cardoso, M. G. M. S. Figueiredo, M. |
author2_role |
author author |
dc.contributor.author.fl_str_mv |
Silvestre, C. Cardoso, M. G. M. S. Figueiredo, M. |
dc.subject.por.fl_str_mv |
Cluster analysis Finite mixtures models EM algorithm Feature selection Categorical features |
topic |
Cluster analysis Finite mixtures models EM algorithm Feature selection Categorical features |
description |
Research on the problem of feature selection for clustering continues to develop. This is a challenging task, mainly due to the absence of class labels to guide the search for relevant features. Categorical feature selection for clustering has rarely been addressed in the literature, with most of the proposed approaches having focused on numerical data. In this work, we propose an approach to simultaneously cluster categorical data and select a subset of relevant features. Our approach is based on a modification of a finite mixture model (of multinomial distributions), where a set of latent variables indicate the relevance of each feature. To estimate the model parameters, we implement a variant of the expectation-maximization algorithm that simultaneously selects the subset of relevant features, using a minimum message length criterion. The proposed approach compares favourably with two baseline methods: a filter based on an entropy measure and a wrapper based on mutual information. The results obtained on synthetic data illustrate the ability of the proposed expectation-maximization method to recover ground truth. An application to real data, referred to official statistics, shows its usefulness. |
publishDate |
2015 |
dc.date.none.fl_str_mv |
2015-08-04T14:38:18Z 2015-01-01T00:00:00Z 2015 2019-05-07T13:05:46Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
format |
article |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10071/9550 |
url |
http://hdl.handle.net/10071/9550 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
0266-4720 10.1111/exsy.12082 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/embargoedAccess |
eu_rights_str_mv |
embargoedAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.publisher.none.fl_str_mv |
Wiley |
publisher.none.fl_str_mv |
Wiley |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799134818300067840 |