Product partition model for categorical features
Autor(a) principal: | |
---|---|
Data de Publicação: | 2019 |
Tipo de documento: | Dissertação |
Idioma: | eng |
Título da fonte: | Repositório Institucional da UFMG |
Texto Completo: | http://hdl.handle.net/1843/34069 |
Resumo: | A common difficulty in data analysis is how to handle categorical predictors with a large number of levels or categories. There are few proposals developed in the literature to handle this important and frequent problem. We introduce a generative model that simultaneously carries out the model fitting and the aggregation of the categorical levels into larger groups. Our approach is based on imposing a graph where the nodes are categories and creating a probability distribution over meaningful partitions of this graph. Being a Bayesian model, it allows the posterior inference, including uncertainty measurement, on the estimated parameters and the categories partition. We compare our method with the state-of-art methods showing that it has equally good predictive performance and much better interpretation ability. Given the current concern on balancing accuracy versus interpretability, our proposal reaches an excellent result. |
id |
UFMG_c2aef2b154c1a6a306f707b171408a48 |
---|---|
oai_identifier_str |
oai:repositorio.ufmg.br:1843/34069 |
network_acronym_str |
UFMG |
network_name_str |
Repositório Institucional da UFMG |
repository_id_str |
|
spelling |
Wagner Meira Juniorhttp://lattes.cnpq.br/9092587237114334Renato Martins Assunçãohttp://lattes.cnpq.br/3575559872183767Rosangela Helena LoschiDenis Deratani MauáFabrício Murai Ferreirahttp://lattes.cnpq.br/3857763744325044Tulio Lima Criscuolo2020-08-28T19:10:39Z2020-08-28T19:10:39Z2019-10-25http://hdl.handle.net/1843/34069A common difficulty in data analysis is how to handle categorical predictors with a large number of levels or categories. There are few proposals developed in the literature to handle this important and frequent problem. We introduce a generative model that simultaneously carries out the model fitting and the aggregation of the categorical levels into larger groups. Our approach is based on imposing a graph where the nodes are categories and creating a probability distribution over meaningful partitions of this graph. Being a Bayesian model, it allows the posterior inference, including uncertainty measurement, on the estimated parameters and the categories partition. We compare our method with the state-of-art methods showing that it has equally good predictive performance and much better interpretation ability. Given the current concern on balancing accuracy versus interpretability, our proposal reaches an excellent result.O tratamento de atributos categóricos com uma grande quantidade de categorias é um problema recorrente em análise de dados. Existem poucas propostas na literatura para lidar com este problema importante e recorrente. Introduzimos um modelo generativo, que simultaneamente estima os parâmetros e o agrupamento do atributo categórico em grupos. Nossa proposta é baseada em impor um grafo no qual os nós correspondem a categorias e criando uma distribuição de probabilidade sobre partições deste grafo. Sendo um modelo Bayesiano, somos capazes de fazer inferência a posteriori sobre os seus parâmetros e o particionamento do atributo categórico. Comparamos nosso mod- elo com métodos estado da arte e mostramos que obtemos uma capacidade preditiva igualmente boa e melhor interpretação dos resultados obtidos.CNPq - Conselho Nacional de Desenvolvimento Científico e TecnológicoengUniversidade Federal de Minas GeraisPrograma de Pós-Graduação em Ciência da ComputaçãoUFMGBrasilICEX - INSTITUTO DE CIÊNCIAS EXATASModelo HierárquicoRegressãoClusterizaçãoRedução de dimensãoHierarchical ModelRegressionClusteringDimensionality ReductionProduct partition model for categorical featuresModelo partição produto para atributos categóricosinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisinfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da UFMGinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMGORIGINALTulioLimaCriscuolo_substituicaofinal.pdfTulioLimaCriscuolo_substituicaofinal.pdfapplication/pdf10627356https://repositorio.ufmg.br/bitstream/1843/34069/3/TulioLimaCriscuolo_substituicaofinal.pdfbcbea5d3e5685e9c67197f984d02b46eMD53LICENSElicense.txtlicense.txttext/plain; charset=utf-82119https://repositorio.ufmg.br/bitstream/1843/34069/2/license.txt34badce4be7e31e3adb4575ae96af679MD521843/340692021-07-20 12:31:33.376oai:repositorio.ufmg.br:1843/34069TElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEgRE8gUkVQT1NJVMOTUklPIElOU1RJVFVDSU9OQUwgREEgVUZNRwoKQ29tIGEgYXByZXNlbnRhw6fDo28gZGVzdGEgbGljZW7Dp2EsIHZvY8OqIChvIGF1dG9yIChlcykgb3UgbyB0aXR1bGFyIGRvcyBkaXJlaXRvcyBkZSBhdXRvcikgY29uY2VkZSBhbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIChSSS1VRk1HKSBvIGRpcmVpdG8gbsOjbyBleGNsdXNpdm8gZSBpcnJldm9nw6F2ZWwgZGUgcmVwcm9kdXppciBlL291IGRpc3RyaWJ1aXIgYSBzdWEgcHVibGljYcOnw6NvIChpbmNsdWluZG8gbyByZXN1bW8pIHBvciB0b2RvIG8gbXVuZG8gbm8gZm9ybWF0byBpbXByZXNzbyBlIGVsZXRyw7RuaWNvIGUgZW0gcXVhbHF1ZXIgbWVpbywgaW5jbHVpbmRvIG9zIGZvcm1hdG9zIMOhdWRpbyBvdSB2w61kZW8uCgpWb2PDqiBkZWNsYXJhIHF1ZSBjb25oZWNlIGEgcG9sw610aWNhIGRlIGNvcHlyaWdodCBkYSBlZGl0b3JhIGRvIHNldSBkb2N1bWVudG8gZSBxdWUgY29uaGVjZSBlIGFjZWl0YSBhcyBEaXJldHJpemVzIGRvIFJJLVVGTUcuCgpWb2PDqiBjb25jb3JkYSBxdWUgbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIHBvZGUsIHNlbSBhbHRlcmFyIG8gY29udGXDumRvLCB0cmFuc3BvciBhIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBxdWFscXVlciBtZWlvIG91IGZvcm1hdG8gcGFyYSBmaW5zIGRlIHByZXNlcnZhw6fDo28uCgpWb2PDqiB0YW1iw6ltIGNvbmNvcmRhIHF1ZSBvIFJlcG9zaXTDs3JpbyBJbnN0aXR1Y2lvbmFsIGRhIFVGTUcgcG9kZSBtYW50ZXIgbWFpcyBkZSB1bWEgY8OzcGlhIGRlIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBmaW5zIGRlIHNlZ3VyYW7Dp2EsIGJhY2stdXAgZSBwcmVzZXJ2YcOnw6NvLgoKVm9jw6ogZGVjbGFyYSBxdWUgYSBzdWEgcHVibGljYcOnw6NvIMOpIG9yaWdpbmFsIGUgcXVlIHZvY8OqIHRlbSBvIHBvZGVyIGRlIGNvbmNlZGVyIG9zIGRpcmVpdG9zIGNvbnRpZG9zIG5lc3RhIGxpY2Vuw6dhLiBWb2PDqiB0YW1iw6ltIGRlY2xhcmEgcXVlIG8gZGVww7NzaXRvIGRlIHN1YSBwdWJsaWNhw6fDo28gbsOjbywgcXVlIHNlamEgZGUgc2V1IGNvbmhlY2ltZW50bywgaW5mcmluZ2UgZGlyZWl0b3MgYXV0b3JhaXMgZGUgbmluZ3XDqW0uCgpDYXNvIGEgc3VhIHB1YmxpY2HDp8OjbyBjb250ZW5oYSBtYXRlcmlhbCBxdWUgdm9jw6ogbsOjbyBwb3NzdWkgYSB0aXR1bGFyaWRhZGUgZG9zIGRpcmVpdG9zIGF1dG9yYWlzLCB2b2PDqiBkZWNsYXJhIHF1ZSBvYnRldmUgYSBwZXJtaXNzw6NvIGlycmVzdHJpdGEgZG8gZGV0ZW50b3IgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIHBhcmEgY29uY2VkZXIgYW8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBvcyBkaXJlaXRvcyBhcHJlc2VudGFkb3MgbmVzdGEgbGljZW7Dp2EsIGUgcXVlIGVzc2UgbWF0ZXJpYWwgZGUgcHJvcHJpZWRhZGUgZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUgaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3Ugbm8gY29udGXDumRvIGRhIHB1YmxpY2HDp8OjbyBvcmEgZGVwb3NpdGFkYS4KCkNBU08gQSBQVUJMSUNBw4fDg08gT1JBIERFUE9TSVRBREEgVEVOSEEgU0lETyBSRVNVTFRBRE8gREUgVU0gUEFUUk9Dw41OSU8gT1UgQVBPSU8gREUgVU1BIEFHw4pOQ0lBIERFIEZPTUVOVE8gT1UgT1VUUk8gT1JHQU5JU01PLCBWT0PDiiBERUNMQVJBIFFVRSBSRVNQRUlUT1UgVE9ET1MgRSBRVUFJU1FVRVIgRElSRUlUT1MgREUgUkVWSVPDg08gQ09NTyBUQU1Cw4lNIEFTIERFTUFJUyBPQlJJR0HDh8OVRVMgRVhJR0lEQVMgUE9SIENPTlRSQVRPIE9VIEFDT1JETy4KCk8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBzZSBjb21wcm9tZXRlIGEgaWRlbnRpZmljYXIgY2xhcmFtZW50ZSBvIHNldSBub21lKHMpIG91IG8ocykgbm9tZXMocykgZG8ocykgZGV0ZW50b3IoZXMpIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBkYSBwdWJsaWNhw6fDo28sIGUgbsOjbyBmYXLDoSBxdWFscXVlciBhbHRlcmHDp8OjbywgYWzDqW0gZGFxdWVsYXMgY29uY2VkaWRhcyBwb3IgZXN0YSBsaWNlbsOnYS4KCg==Repositório de PublicaçõesPUBhttps://repositorio.ufmg.br/oaiopendoar:2021-07-20T15:31:33Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)false |
dc.title.pt_BR.fl_str_mv |
Product partition model for categorical features |
dc.title.alternative.pt_BR.fl_str_mv |
Modelo partição produto para atributos categóricos |
title |
Product partition model for categorical features |
spellingShingle |
Product partition model for categorical features Tulio Lima Criscuolo Hierarchical Model Regression Clustering Dimensionality Reduction Modelo Hierárquico Regressão Clusterização Redução de dimensão |
title_short |
Product partition model for categorical features |
title_full |
Product partition model for categorical features |
title_fullStr |
Product partition model for categorical features |
title_full_unstemmed |
Product partition model for categorical features |
title_sort |
Product partition model for categorical features |
author |
Tulio Lima Criscuolo |
author_facet |
Tulio Lima Criscuolo |
author_role |
author |
dc.contributor.advisor1.fl_str_mv |
Wagner Meira Junior |
dc.contributor.advisor1Lattes.fl_str_mv |
http://lattes.cnpq.br/9092587237114334 |
dc.contributor.advisor2.fl_str_mv |
Renato Martins Assunção |
dc.contributor.advisor2Lattes.fl_str_mv |
http://lattes.cnpq.br/3575559872183767 |
dc.contributor.referee1.fl_str_mv |
Rosangela Helena Loschi |
dc.contributor.referee2.fl_str_mv |
Denis Deratani Mauá |
dc.contributor.referee3.fl_str_mv |
Fabrício Murai Ferreira |
dc.contributor.authorLattes.fl_str_mv |
http://lattes.cnpq.br/3857763744325044 |
dc.contributor.author.fl_str_mv |
Tulio Lima Criscuolo |
contributor_str_mv |
Wagner Meira Junior Renato Martins Assunção Rosangela Helena Loschi Denis Deratani Mauá Fabrício Murai Ferreira |
dc.subject.por.fl_str_mv |
Hierarchical Model Regression Clustering Dimensionality Reduction |
topic |
Hierarchical Model Regression Clustering Dimensionality Reduction Modelo Hierárquico Regressão Clusterização Redução de dimensão |
dc.subject.other.pt_BR.fl_str_mv |
Modelo Hierárquico Regressão Clusterização Redução de dimensão |
description |
A common difficulty in data analysis is how to handle categorical predictors with a large number of levels or categories. There are few proposals developed in the literature to handle this important and frequent problem. We introduce a generative model that simultaneously carries out the model fitting and the aggregation of the categorical levels into larger groups. Our approach is based on imposing a graph where the nodes are categories and creating a probability distribution over meaningful partitions of this graph. Being a Bayesian model, it allows the posterior inference, including uncertainty measurement, on the estimated parameters and the categories partition. We compare our method with the state-of-art methods showing that it has equally good predictive performance and much better interpretation ability. Given the current concern on balancing accuracy versus interpretability, our proposal reaches an excellent result. |
publishDate |
2019 |
dc.date.issued.fl_str_mv |
2019-10-25 |
dc.date.accessioned.fl_str_mv |
2020-08-28T19:10:39Z |
dc.date.available.fl_str_mv |
2020-08-28T19:10:39Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/1843/34069 |
url |
http://hdl.handle.net/1843/34069 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.publisher.none.fl_str_mv |
Universidade Federal de Minas Gerais |
dc.publisher.program.fl_str_mv |
Programa de Pós-Graduação em Ciência da Computação |
dc.publisher.initials.fl_str_mv |
UFMG |
dc.publisher.country.fl_str_mv |
Brasil |
dc.publisher.department.fl_str_mv |
ICEX - INSTITUTO DE CIÊNCIAS EXATAS |
publisher.none.fl_str_mv |
Universidade Federal de Minas Gerais |
dc.source.none.fl_str_mv |
reponame:Repositório Institucional da UFMG instname:Universidade Federal de Minas Gerais (UFMG) instacron:UFMG |
instname_str |
Universidade Federal de Minas Gerais (UFMG) |
instacron_str |
UFMG |
institution |
UFMG |
reponame_str |
Repositório Institucional da UFMG |
collection |
Repositório Institucional da UFMG |
bitstream.url.fl_str_mv |
https://repositorio.ufmg.br/bitstream/1843/34069/3/TulioLimaCriscuolo_substituicaofinal.pdf https://repositorio.ufmg.br/bitstream/1843/34069/2/license.txt |
bitstream.checksum.fl_str_mv |
bcbea5d3e5685e9c67197f984d02b46e 34badce4be7e31e3adb4575ae96af679 |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 |
repository.name.fl_str_mv |
Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG) |
repository.mail.fl_str_mv |
|
_version_ |
1803589532260499456 |