Product partition model for categorical features

Detalhes bibliográficos
Autor(a) principal: Tulio Lima Criscuolo
Data de Publicação: 2019
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositório Institucional da UFMG
Texto Completo: http://hdl.handle.net/1843/34069
Resumo: A common difficulty in data analysis is how to handle categorical predictors with a large number of levels or categories. There are few proposals developed in the literature to handle this important and frequent problem. We introduce a generative model that simultaneously carries out the model fitting and the aggregation of the categorical levels into larger groups. Our approach is based on imposing a graph where the nodes are categories and creating a probability distribution over meaningful partitions of this graph. Being a Bayesian model, it allows the posterior inference, including uncertainty measurement, on the estimated parameters and the categories partition. We compare our method with the state-of-art methods showing that it has equally good predictive performance and much better interpretation ability. Given the current concern on balancing accuracy versus interpretability, our proposal reaches an excellent result.
id UFMG_c2aef2b154c1a6a306f707b171408a48
oai_identifier_str oai:repositorio.ufmg.br:1843/34069
network_acronym_str UFMG
network_name_str Repositório Institucional da UFMG
repository_id_str
spelling Wagner Meira Juniorhttp://lattes.cnpq.br/9092587237114334Renato Martins Assunçãohttp://lattes.cnpq.br/3575559872183767Rosangela Helena LoschiDenis Deratani MauáFabrício Murai Ferreirahttp://lattes.cnpq.br/3857763744325044Tulio Lima Criscuolo2020-08-28T19:10:39Z2020-08-28T19:10:39Z2019-10-25http://hdl.handle.net/1843/34069A common difficulty in data analysis is how to handle categorical predictors with a large number of levels or categories. There are few proposals developed in the literature to handle this important and frequent problem. We introduce a generative model that simultaneously carries out the model fitting and the aggregation of the categorical levels into larger groups. Our approach is based on imposing a graph where the nodes are categories and creating a probability distribution over meaningful partitions of this graph. Being a Bayesian model, it allows the posterior inference, including uncertainty measurement, on the estimated parameters and the categories partition. We compare our method with the state-of-art methods showing that it has equally good predictive performance and much better interpretation ability. Given the current concern on balancing accuracy versus interpretability, our proposal reaches an excellent result.O tratamento de atributos categóricos com uma grande quantidade de categorias é um problema recorrente em análise de dados. Existem poucas propostas na literatura para lidar com este problema importante e recorrente. Introduzimos um modelo generativo, que simultaneamente estima os parâmetros e o agrupamento do atributo categórico em grupos. Nossa proposta é baseada em impor um grafo no qual os nós correspondem a categorias e criando uma distribuição de probabilidade sobre partições deste grafo. Sendo um modelo Bayesiano, somos capazes de fazer inferência a posteriori sobre os seus parâmetros e o particionamento do atributo categórico. Comparamos nosso mod- elo com métodos estado da arte e mostramos que obtemos uma capacidade preditiva igualmente boa e melhor interpretação dos resultados obtidos.CNPq - Conselho Nacional de Desenvolvimento Científico e TecnológicoengUniversidade Federal de Minas GeraisPrograma de Pós-Graduação em Ciência da ComputaçãoUFMGBrasilICEX - INSTITUTO DE CIÊNCIAS EXATASModelo HierárquicoRegressãoClusterizaçãoRedução de dimensãoHierarchical ModelRegressionClusteringDimensionality ReductionProduct partition model for categorical featuresModelo partição produto para atributos categóricosinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisinfo:eu-repo/semantics/openAccessreponame:Repositório Institucional da UFMGinstname:Universidade Federal de Minas Gerais (UFMG)instacron:UFMGORIGINALTulioLimaCriscuolo_substituicaofinal.pdfTulioLimaCriscuolo_substituicaofinal.pdfapplication/pdf10627356https://repositorio.ufmg.br/bitstream/1843/34069/3/TulioLimaCriscuolo_substituicaofinal.pdfbcbea5d3e5685e9c67197f984d02b46eMD53LICENSElicense.txtlicense.txttext/plain; charset=utf-82119https://repositorio.ufmg.br/bitstream/1843/34069/2/license.txt34badce4be7e31e3adb4575ae96af679MD521843/340692021-07-20 12:31:33.376oai:repositorio.ufmg.br:1843/34069TElDRU7Dh0EgREUgRElTVFJJQlVJw4fDg08gTsODTy1FWENMVVNJVkEgRE8gUkVQT1NJVMOTUklPIElOU1RJVFVDSU9OQUwgREEgVUZNRwoKQ29tIGEgYXByZXNlbnRhw6fDo28gZGVzdGEgbGljZW7Dp2EsIHZvY8OqIChvIGF1dG9yIChlcykgb3UgbyB0aXR1bGFyIGRvcyBkaXJlaXRvcyBkZSBhdXRvcikgY29uY2VkZSBhbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIChSSS1VRk1HKSBvIGRpcmVpdG8gbsOjbyBleGNsdXNpdm8gZSBpcnJldm9nw6F2ZWwgZGUgcmVwcm9kdXppciBlL291IGRpc3RyaWJ1aXIgYSBzdWEgcHVibGljYcOnw6NvIChpbmNsdWluZG8gbyByZXN1bW8pIHBvciB0b2RvIG8gbXVuZG8gbm8gZm9ybWF0byBpbXByZXNzbyBlIGVsZXRyw7RuaWNvIGUgZW0gcXVhbHF1ZXIgbWVpbywgaW5jbHVpbmRvIG9zIGZvcm1hdG9zIMOhdWRpbyBvdSB2w61kZW8uCgpWb2PDqiBkZWNsYXJhIHF1ZSBjb25oZWNlIGEgcG9sw610aWNhIGRlIGNvcHlyaWdodCBkYSBlZGl0b3JhIGRvIHNldSBkb2N1bWVudG8gZSBxdWUgY29uaGVjZSBlIGFjZWl0YSBhcyBEaXJldHJpemVzIGRvIFJJLVVGTUcuCgpWb2PDqiBjb25jb3JkYSBxdWUgbyBSZXBvc2l0w7NyaW8gSW5zdGl0dWNpb25hbCBkYSBVRk1HIHBvZGUsIHNlbSBhbHRlcmFyIG8gY29udGXDumRvLCB0cmFuc3BvciBhIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBxdWFscXVlciBtZWlvIG91IGZvcm1hdG8gcGFyYSBmaW5zIGRlIHByZXNlcnZhw6fDo28uCgpWb2PDqiB0YW1iw6ltIGNvbmNvcmRhIHF1ZSBvIFJlcG9zaXTDs3JpbyBJbnN0aXR1Y2lvbmFsIGRhIFVGTUcgcG9kZSBtYW50ZXIgbWFpcyBkZSB1bWEgY8OzcGlhIGRlIHN1YSBwdWJsaWNhw6fDo28gcGFyYSBmaW5zIGRlIHNlZ3VyYW7Dp2EsIGJhY2stdXAgZSBwcmVzZXJ2YcOnw6NvLgoKVm9jw6ogZGVjbGFyYSBxdWUgYSBzdWEgcHVibGljYcOnw6NvIMOpIG9yaWdpbmFsIGUgcXVlIHZvY8OqIHRlbSBvIHBvZGVyIGRlIGNvbmNlZGVyIG9zIGRpcmVpdG9zIGNvbnRpZG9zIG5lc3RhIGxpY2Vuw6dhLiBWb2PDqiB0YW1iw6ltIGRlY2xhcmEgcXVlIG8gZGVww7NzaXRvIGRlIHN1YSBwdWJsaWNhw6fDo28gbsOjbywgcXVlIHNlamEgZGUgc2V1IGNvbmhlY2ltZW50bywgaW5mcmluZ2UgZGlyZWl0b3MgYXV0b3JhaXMgZGUgbmluZ3XDqW0uCgpDYXNvIGEgc3VhIHB1YmxpY2HDp8OjbyBjb250ZW5oYSBtYXRlcmlhbCBxdWUgdm9jw6ogbsOjbyBwb3NzdWkgYSB0aXR1bGFyaWRhZGUgZG9zIGRpcmVpdG9zIGF1dG9yYWlzLCB2b2PDqiBkZWNsYXJhIHF1ZSBvYnRldmUgYSBwZXJtaXNzw6NvIGlycmVzdHJpdGEgZG8gZGV0ZW50b3IgZG9zIGRpcmVpdG9zIGF1dG9yYWlzIHBhcmEgY29uY2VkZXIgYW8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBvcyBkaXJlaXRvcyBhcHJlc2VudGFkb3MgbmVzdGEgbGljZW7Dp2EsIGUgcXVlIGVzc2UgbWF0ZXJpYWwgZGUgcHJvcHJpZWRhZGUgZGUgdGVyY2Vpcm9zIGVzdMOhIGNsYXJhbWVudGUgaWRlbnRpZmljYWRvIGUgcmVjb25oZWNpZG8gbm8gdGV4dG8gb3Ugbm8gY29udGXDumRvIGRhIHB1YmxpY2HDp8OjbyBvcmEgZGVwb3NpdGFkYS4KCkNBU08gQSBQVUJMSUNBw4fDg08gT1JBIERFUE9TSVRBREEgVEVOSEEgU0lETyBSRVNVTFRBRE8gREUgVU0gUEFUUk9Dw41OSU8gT1UgQVBPSU8gREUgVU1BIEFHw4pOQ0lBIERFIEZPTUVOVE8gT1UgT1VUUk8gT1JHQU5JU01PLCBWT0PDiiBERUNMQVJBIFFVRSBSRVNQRUlUT1UgVE9ET1MgRSBRVUFJU1FVRVIgRElSRUlUT1MgREUgUkVWSVPDg08gQ09NTyBUQU1Cw4lNIEFTIERFTUFJUyBPQlJJR0HDh8OVRVMgRVhJR0lEQVMgUE9SIENPTlRSQVRPIE9VIEFDT1JETy4KCk8gUmVwb3NpdMOzcmlvIEluc3RpdHVjaW9uYWwgZGEgVUZNRyBzZSBjb21wcm9tZXRlIGEgaWRlbnRpZmljYXIgY2xhcmFtZW50ZSBvIHNldSBub21lKHMpIG91IG8ocykgbm9tZXMocykgZG8ocykgZGV0ZW50b3IoZXMpIGRvcyBkaXJlaXRvcyBhdXRvcmFpcyBkYSBwdWJsaWNhw6fDo28sIGUgbsOjbyBmYXLDoSBxdWFscXVlciBhbHRlcmHDp8OjbywgYWzDqW0gZGFxdWVsYXMgY29uY2VkaWRhcyBwb3IgZXN0YSBsaWNlbsOnYS4KCg==Repositório de PublicaçõesPUBhttps://repositorio.ufmg.br/oaiopendoar:2021-07-20T15:31:33Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)false
dc.title.pt_BR.fl_str_mv Product partition model for categorical features
dc.title.alternative.pt_BR.fl_str_mv Modelo partição produto para atributos categóricos
title Product partition model for categorical features
spellingShingle Product partition model for categorical features
Tulio Lima Criscuolo
Hierarchical Model
Regression
Clustering
Dimensionality Reduction
Modelo Hierárquico
Regressão
Clusterização
Redução de dimensão
title_short Product partition model for categorical features
title_full Product partition model for categorical features
title_fullStr Product partition model for categorical features
title_full_unstemmed Product partition model for categorical features
title_sort Product partition model for categorical features
author Tulio Lima Criscuolo
author_facet Tulio Lima Criscuolo
author_role author
dc.contributor.advisor1.fl_str_mv Wagner Meira Junior
dc.contributor.advisor1Lattes.fl_str_mv http://lattes.cnpq.br/9092587237114334
dc.contributor.advisor2.fl_str_mv Renato Martins Assunção
dc.contributor.advisor2Lattes.fl_str_mv http://lattes.cnpq.br/3575559872183767
dc.contributor.referee1.fl_str_mv Rosangela Helena Loschi
dc.contributor.referee2.fl_str_mv Denis Deratani Mauá
dc.contributor.referee3.fl_str_mv Fabrício Murai Ferreira
dc.contributor.authorLattes.fl_str_mv http://lattes.cnpq.br/3857763744325044
dc.contributor.author.fl_str_mv Tulio Lima Criscuolo
contributor_str_mv Wagner Meira Junior
Renato Martins Assunção
Rosangela Helena Loschi
Denis Deratani Mauá
Fabrício Murai Ferreira
dc.subject.por.fl_str_mv Hierarchical Model
Regression
Clustering
Dimensionality Reduction
topic Hierarchical Model
Regression
Clustering
Dimensionality Reduction
Modelo Hierárquico
Regressão
Clusterização
Redução de dimensão
dc.subject.other.pt_BR.fl_str_mv Modelo Hierárquico
Regressão
Clusterização
Redução de dimensão
description A common difficulty in data analysis is how to handle categorical predictors with a large number of levels or categories. There are few proposals developed in the literature to handle this important and frequent problem. We introduce a generative model that simultaneously carries out the model fitting and the aggregation of the categorical levels into larger groups. Our approach is based on imposing a graph where the nodes are categories and creating a probability distribution over meaningful partitions of this graph. Being a Bayesian model, it allows the posterior inference, including uncertainty measurement, on the estimated parameters and the categories partition. We compare our method with the state-of-art methods showing that it has equally good predictive performance and much better interpretation ability. Given the current concern on balancing accuracy versus interpretability, our proposal reaches an excellent result.
publishDate 2019
dc.date.issued.fl_str_mv 2019-10-25
dc.date.accessioned.fl_str_mv 2020-08-28T19:10:39Z
dc.date.available.fl_str_mv 2020-08-28T19:10:39Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/1843/34069
url http://hdl.handle.net/1843/34069
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv Universidade Federal de Minas Gerais
dc.publisher.program.fl_str_mv Programa de Pós-Graduação em Ciência da Computação
dc.publisher.initials.fl_str_mv UFMG
dc.publisher.country.fl_str_mv Brasil
dc.publisher.department.fl_str_mv ICEX - INSTITUTO DE CIÊNCIAS EXATAS
publisher.none.fl_str_mv Universidade Federal de Minas Gerais
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFMG
instname:Universidade Federal de Minas Gerais (UFMG)
instacron:UFMG
instname_str Universidade Federal de Minas Gerais (UFMG)
instacron_str UFMG
institution UFMG
reponame_str Repositório Institucional da UFMG
collection Repositório Institucional da UFMG
bitstream.url.fl_str_mv https://repositorio.ufmg.br/bitstream/1843/34069/3/TulioLimaCriscuolo_substituicaofinal.pdf
https://repositorio.ufmg.br/bitstream/1843/34069/2/license.txt
bitstream.checksum.fl_str_mv bcbea5d3e5685e9c67197f984d02b46e
34badce4be7e31e3adb4575ae96af679
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
repository.name.fl_str_mv Repositório Institucional da UFMG - Universidade Federal de Minas Gerais (UFMG)
repository.mail.fl_str_mv
_version_ 1803589532260499456