Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis
Autor(a) principal: | |
---|---|
Data de Publicação: | 2021 |
Tipo de documento: | Tese |
Idioma: | eng |
Título da fonte: | Repositório Institucional da UFSCAR |
Texto Completo: | https://repositorio.ufscar.br/handle/ufscar/15024 |
Resumo: | Machine learning methods are divided into two main groups: supervised and unsupervised methods. In the first part of this work, we develop a method for creating prediction bands that can be applied to supervised problems. Our approach is based on conformal methods, which are very appealing because they create prediction bands that control average coverage assuming solely i.i.d. data. It is also often desirable to control conditional coverage, that is, coverage for every new testing point. However, without strong assumptions, conditional coverage is unachievable. Given this limitation, the literature has focused on methods with asymptotical conditional coverage. In order to obtain this property, these methods require strong conditions on the dependence between the target variable and the features. We introduce two conformal methods based on conditional density estimators that do not depend on this type of assumption to obtain asymptotic conditional coverage: Dist-split and CD-split. While Dist-split asymptotically obtains optimal intervals, which are easier to interpret than general regions, CD-split obtains optimal size regions, which are smaller than intervals. CD-split also obtains local coverage by creating prediction bands locally on a partition of the features space. This partition is data-driven and scales to high-dimensional settings. In a wide variety of simulated scenarios, our methods have a better control of conditional coverage and have smaller length than previously proposed methods. In the second part, in a context of unsupervised methods, we develop a new version of the Latent Dirichlet Allocation (LDA) model. The LDA model is a popular method for creating mixed-membership clusters. Despite having been originally developed for text analysis, LDA has been used for a wide range of other applications. We propose a new formulation for the LDA model which incorporates covariates. In this model, a negative binomial regression is embedded within LDA, enabling straight-forward interpretation of the regression coefficients and the analysis of the quantity of cluster-specific elements in each sampling units (instead of the analysis being focused on modeling the proportion of each cluster, as in Structural Topic Models). We use slice sampling within a Gibbs sampling algorithm to estimate model parameters. We rely on simulations to show how our algorithm is able to successfully retrieve the true parameter values. The model is illustrated using real data sets from three different areas: text-mining of Coronavirus articles, analysis of grocery shopping baskets, and ecology of tree species on Barro Colorado Island (Panama). This model allows the identification of mixed-membership clusters in discrete data and provides inference on the relationship between covariates and the abundance of these clusters. |
id |
SCAR_918f72161ec3279a3063703e48b0d3b7 |
---|---|
oai_identifier_str |
oai:repositorio.ufscar.br:ufscar/15024 |
network_acronym_str |
SCAR |
network_name_str |
Repositório Institucional da UFSCAR |
repository_id_str |
4322 |
spelling |
Shimizu, Gilson YuujiIzbicki, Rafaelhttp://lattes.cnpq.br/9991192137633896http://lattes.cnpq.br/753368198363423355d0e00a-e38f-4c7c-8d48-6225abfdfe7f2021-10-18T19:47:04Z2021-10-18T19:47:04Z2021-10-15SHIMIZU, Gilson Yuuji. Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis. 2021. Tese (Doutorado em Estatística) – Universidade Federal de São Carlos, São Carlos, 2021. Disponível em: https://repositorio.ufscar.br/handle/ufscar/15024.https://repositorio.ufscar.br/handle/ufscar/15024Machine learning methods are divided into two main groups: supervised and unsupervised methods. In the first part of this work, we develop a method for creating prediction bands that can be applied to supervised problems. Our approach is based on conformal methods, which are very appealing because they create prediction bands that control average coverage assuming solely i.i.d. data. It is also often desirable to control conditional coverage, that is, coverage for every new testing point. However, without strong assumptions, conditional coverage is unachievable. Given this limitation, the literature has focused on methods with asymptotical conditional coverage. In order to obtain this property, these methods require strong conditions on the dependence between the target variable and the features. We introduce two conformal methods based on conditional density estimators that do not depend on this type of assumption to obtain asymptotic conditional coverage: Dist-split and CD-split. While Dist-split asymptotically obtains optimal intervals, which are easier to interpret than general regions, CD-split obtains optimal size regions, which are smaller than intervals. CD-split also obtains local coverage by creating prediction bands locally on a partition of the features space. This partition is data-driven and scales to high-dimensional settings. In a wide variety of simulated scenarios, our methods have a better control of conditional coverage and have smaller length than previously proposed methods. In the second part, in a context of unsupervised methods, we develop a new version of the Latent Dirichlet Allocation (LDA) model. The LDA model is a popular method for creating mixed-membership clusters. Despite having been originally developed for text analysis, LDA has been used for a wide range of other applications. We propose a new formulation for the LDA model which incorporates covariates. In this model, a negative binomial regression is embedded within LDA, enabling straight-forward interpretation of the regression coefficients and the analysis of the quantity of cluster-specific elements in each sampling units (instead of the analysis being focused on modeling the proportion of each cluster, as in Structural Topic Models). We use slice sampling within a Gibbs sampling algorithm to estimate model parameters. We rely on simulations to show how our algorithm is able to successfully retrieve the true parameter values. The model is illustrated using real data sets from three different areas: text-mining of Coronavirus articles, analysis of grocery shopping baskets, and ecology of tree species on Barro Colorado Island (Panama). This model allows the identification of mixed-membership clusters in discrete data and provides inference on the relationship between covariates and the abundance of these clusters.Métodos de machine learning são basicamente divididos em dois grandes grupos: métodos supervisionados e não supervisionados. Na primeira parte deste trabalho nós desenvolvemos um método para criação de bandas de predição que pode ser aplicado em problemas supervisionados. Nossa abordagem é baseada em métodos conformal, que são interessantes porque criam bandas de predição que controlam a cobertura média assumindo somente dados i.i.d.. Geralmente também é desejável controlar a cobertura condicional, ou seja, a cobertura para toda nova amostra de teste. Contudo, sem fortes suposições, a cobertura condicional é inatingível. Dada esta limitação, a literatura tem focado em métodos com cobertura condicional assintótica. A fim de se obter esta propriedade, estes métodos requerem fortes suposições sobre a dependência entre a variável resposta e as características. Nós introduzimos dois métodos conformal baseados em estimadores de densidade condicionais que não dependem deste tipo de suposição para obter cobertura condicional assintótica: Dist-split e CD-split. Enquanto Dist-split obtém intervalos ótimos assintoticamente, que são mais fáceis de interpretar do que regiões de confiança, CD-split obtém regiões de tamanho ótimo, que são menores do que intervalos. CD-split também obtém cobertura local pela criação de bandas de predição localmente numa partição do espaço de características. Esta partição é baseada em dados e permite trabalhar com dados em alta dimensão. Numa grande variedade de cenários simulados, nossos métodos tem melhor controle da cobertura condicional e tem menores comprimentos do que métodos propostos anteriores. Na segunda parte, num contexto de métodos não supervisionados, estudamos uma nova versão do modelo de Alocação Latente Dirichlet (LDA). O modelo LDA é um método popular para criação de mixed-membership clusters. Apesar de ter ficado conhecido na análise de texto, LDA tem sido usado em uma variedade de outras aplicações. Nós propomos uma nova formulação para o modelo LDA que incorpora covariáveis. Neste modelo, uma regressão binomial negativa é embutida dentro do LDA, possibilitando uma interpretação direta dos coeficientes de regressão e análise da quantidade de elementos específicos dos clusters em cada unidade amostral (ao invés da análise ser focada em modelar a proporção de cada cluster, como nos Modelos de Tópicos Estruturados). Nó usamos slice sampling dentro de um algoritmo de Gibbs sampling para estimar os parâmetros. E usamos simulações para mostrar como nosso algoritmo é capaz de estimar com sucesso os verdadeiros parâmetros do modelo. O modelo é ilustrado usando conjuntos de dados reais de três diferentes áreas: mineração de texto de artigos sobre coronavírus, análise de cestas de supermercados, e análise de espécies de árvores na Ilha de Barro Colorado (Panama). Este modelo permite a identificação de mixed-membership clusters em dados discretos e fornece inferências sobre o relacionamento entre covariáveis e a abundância destes clusters.Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)CAPES: Código de Financiamento 001engUniversidade Federal de São CarlosCâmpus São CarlosPrograma Interinstitucional de Pós-Graduação em Estatística - PIPGEsUFSCarAttribution-NonCommercial-NoDerivs 3.0 Brazilhttp://creativecommons.org/licenses/by-nc-nd/3.0/br/info:eu-repo/semantics/openAccessAprendizagem de máquinaAnálise de textoAlocação latente de Dirichlet (LDA)Bandas de prediçãoPredição conformalMachine learningText analysisLatent Dirichlet allocation (LDA)Prediction bandsConformal predictionCIENCIAS EXATAS E DA TERRA::PROBABILIDADE E ESTATISTICA::ESTATISTICABandas de predição usando densidade condicional estimada e um modelo LDA com covariáveisPrediction bands using estimated conditional density and an LDA model with covariatesinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesis6006003e57f161-19fe-4345-9e87-bc60eb7be98freponame:Repositório Institucional da UFSCARinstname:Universidade Federal de São Carlos (UFSCAR)instacron:UFSCARORIGINALTese_Gilson_Ufscar_211018.pdfTese_Gilson_Ufscar_211018.pdfapplication/pdf8675639https://repositorio.ufscar.br/bitstream/ufscar/15024/1/Tese_Gilson_Ufscar_211018.pdfdd820a7f248d1a8f3800a9f79b5b2405MD51cartacomprovantepipges.pdfcartacomprovantepipges.pdfCarta comprovanteapplication/pdf156662https://repositorio.ufscar.br/bitstream/ufscar/15024/3/cartacomprovantepipges.pdf8b6417109e89f0383b931afaf6c70c3fMD53CC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-8811https://repositorio.ufscar.br/bitstream/ufscar/15024/4/license_rdfe39d27027a6cc9cb039ad269a5db8e34MD54TEXTTese_Gilson_Ufscar_211018.pdf.txtTese_Gilson_Ufscar_211018.pdf.txtExtracted texttext/plain94623https://repositorio.ufscar.br/bitstream/ufscar/15024/5/Tese_Gilson_Ufscar_211018.pdf.txt47bb58537ed60f16d243145ee8bfb3c0MD55cartacomprovantepipges.pdf.txtcartacomprovantepipges.pdf.txtExtracted texttext/plain1282https://repositorio.ufscar.br/bitstream/ufscar/15024/7/cartacomprovantepipges.pdf.txt0d08371cc7c1bc5edfbba92324e8f648MD57THUMBNAILTese_Gilson_Ufscar_211018.pdf.jpgTese_Gilson_Ufscar_211018.pdf.jpgIM Thumbnailimage/jpeg3984https://repositorio.ufscar.br/bitstream/ufscar/15024/6/Tese_Gilson_Ufscar_211018.pdf.jpgdf101586af22bb82ab9b0ee21fe475dbMD56cartacomprovantepipges.pdf.jpgcartacomprovantepipges.pdf.jpgIM Thumbnailimage/jpeg9315https://repositorio.ufscar.br/bitstream/ufscar/15024/8/cartacomprovantepipges.pdf.jpgee4b8b019ea9fc371b3c23f1c3895f35MD58ufscar/150242023-09-18 18:32:18.194oai:repositorio.ufscar.br:ufscar/15024Repositório InstitucionalPUBhttps://repositorio.ufscar.br/oai/requestopendoar:43222023-09-18T18:32:18Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)false |
dc.title.por.fl_str_mv |
Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis |
dc.title.alternative.eng.fl_str_mv |
Prediction bands using estimated conditional density and an LDA model with covariates |
title |
Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis |
spellingShingle |
Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis Shimizu, Gilson Yuuji Aprendizagem de máquina Análise de texto Alocação latente de Dirichlet (LDA) Bandas de predição Predição conformal Machine learning Text analysis Latent Dirichlet allocation (LDA) Prediction bands Conformal prediction CIENCIAS EXATAS E DA TERRA::PROBABILIDADE E ESTATISTICA::ESTATISTICA |
title_short |
Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis |
title_full |
Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis |
title_fullStr |
Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis |
title_full_unstemmed |
Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis |
title_sort |
Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis |
author |
Shimizu, Gilson Yuuji |
author_facet |
Shimizu, Gilson Yuuji |
author_role |
author |
dc.contributor.authorlattes.por.fl_str_mv |
http://lattes.cnpq.br/7533681983634233 |
dc.contributor.author.fl_str_mv |
Shimizu, Gilson Yuuji |
dc.contributor.advisor1.fl_str_mv |
Izbicki, Rafael |
dc.contributor.advisor1Lattes.fl_str_mv |
http://lattes.cnpq.br/9991192137633896 |
dc.contributor.authorID.fl_str_mv |
55d0e00a-e38f-4c7c-8d48-6225abfdfe7f |
contributor_str_mv |
Izbicki, Rafael |
dc.subject.por.fl_str_mv |
Aprendizagem de máquina Análise de texto Alocação latente de Dirichlet (LDA) Bandas de predição Predição conformal |
topic |
Aprendizagem de máquina Análise de texto Alocação latente de Dirichlet (LDA) Bandas de predição Predição conformal Machine learning Text analysis Latent Dirichlet allocation (LDA) Prediction bands Conformal prediction CIENCIAS EXATAS E DA TERRA::PROBABILIDADE E ESTATISTICA::ESTATISTICA |
dc.subject.eng.fl_str_mv |
Machine learning Text analysis Latent Dirichlet allocation (LDA) Prediction bands Conformal prediction |
dc.subject.cnpq.fl_str_mv |
CIENCIAS EXATAS E DA TERRA::PROBABILIDADE E ESTATISTICA::ESTATISTICA |
description |
Machine learning methods are divided into two main groups: supervised and unsupervised methods. In the first part of this work, we develop a method for creating prediction bands that can be applied to supervised problems. Our approach is based on conformal methods, which are very appealing because they create prediction bands that control average coverage assuming solely i.i.d. data. It is also often desirable to control conditional coverage, that is, coverage for every new testing point. However, without strong assumptions, conditional coverage is unachievable. Given this limitation, the literature has focused on methods with asymptotical conditional coverage. In order to obtain this property, these methods require strong conditions on the dependence between the target variable and the features. We introduce two conformal methods based on conditional density estimators that do not depend on this type of assumption to obtain asymptotic conditional coverage: Dist-split and CD-split. While Dist-split asymptotically obtains optimal intervals, which are easier to interpret than general regions, CD-split obtains optimal size regions, which are smaller than intervals. CD-split also obtains local coverage by creating prediction bands locally on a partition of the features space. This partition is data-driven and scales to high-dimensional settings. In a wide variety of simulated scenarios, our methods have a better control of conditional coverage and have smaller length than previously proposed methods. In the second part, in a context of unsupervised methods, we develop a new version of the Latent Dirichlet Allocation (LDA) model. The LDA model is a popular method for creating mixed-membership clusters. Despite having been originally developed for text analysis, LDA has been used for a wide range of other applications. We propose a new formulation for the LDA model which incorporates covariates. In this model, a negative binomial regression is embedded within LDA, enabling straight-forward interpretation of the regression coefficients and the analysis of the quantity of cluster-specific elements in each sampling units (instead of the analysis being focused on modeling the proportion of each cluster, as in Structural Topic Models). We use slice sampling within a Gibbs sampling algorithm to estimate model parameters. We rely on simulations to show how our algorithm is able to successfully retrieve the true parameter values. The model is illustrated using real data sets from three different areas: text-mining of Coronavirus articles, analysis of grocery shopping baskets, and ecology of tree species on Barro Colorado Island (Panama). This model allows the identification of mixed-membership clusters in discrete data and provides inference on the relationship between covariates and the abundance of these clusters. |
publishDate |
2021 |
dc.date.accessioned.fl_str_mv |
2021-10-18T19:47:04Z |
dc.date.available.fl_str_mv |
2021-10-18T19:47:04Z |
dc.date.issued.fl_str_mv |
2021-10-15 |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/doctoralThesis |
format |
doctoralThesis |
status_str |
publishedVersion |
dc.identifier.citation.fl_str_mv |
SHIMIZU, Gilson Yuuji. Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis. 2021. Tese (Doutorado em Estatística) – Universidade Federal de São Carlos, São Carlos, 2021. Disponível em: https://repositorio.ufscar.br/handle/ufscar/15024. |
dc.identifier.uri.fl_str_mv |
https://repositorio.ufscar.br/handle/ufscar/15024 |
identifier_str_mv |
SHIMIZU, Gilson Yuuji. Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis. 2021. Tese (Doutorado em Estatística) – Universidade Federal de São Carlos, São Carlos, 2021. Disponível em: https://repositorio.ufscar.br/handle/ufscar/15024. |
url |
https://repositorio.ufscar.br/handle/ufscar/15024 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.confidence.fl_str_mv |
600 600 |
dc.relation.authority.fl_str_mv |
3e57f161-19fe-4345-9e87-bc60eb7be98f |
dc.rights.driver.fl_str_mv |
Attribution-NonCommercial-NoDerivs 3.0 Brazil http://creativecommons.org/licenses/by-nc-nd/3.0/br/ info:eu-repo/semantics/openAccess |
rights_invalid_str_mv |
Attribution-NonCommercial-NoDerivs 3.0 Brazil http://creativecommons.org/licenses/by-nc-nd/3.0/br/ |
eu_rights_str_mv |
openAccess |
dc.publisher.none.fl_str_mv |
Universidade Federal de São Carlos Câmpus São Carlos |
dc.publisher.program.fl_str_mv |
Programa Interinstitucional de Pós-Graduação em Estatística - PIPGEs |
dc.publisher.initials.fl_str_mv |
UFSCar |
publisher.none.fl_str_mv |
Universidade Federal de São Carlos Câmpus São Carlos |
dc.source.none.fl_str_mv |
reponame:Repositório Institucional da UFSCAR instname:Universidade Federal de São Carlos (UFSCAR) instacron:UFSCAR |
instname_str |
Universidade Federal de São Carlos (UFSCAR) |
instacron_str |
UFSCAR |
institution |
UFSCAR |
reponame_str |
Repositório Institucional da UFSCAR |
collection |
Repositório Institucional da UFSCAR |
bitstream.url.fl_str_mv |
https://repositorio.ufscar.br/bitstream/ufscar/15024/1/Tese_Gilson_Ufscar_211018.pdf https://repositorio.ufscar.br/bitstream/ufscar/15024/3/cartacomprovantepipges.pdf https://repositorio.ufscar.br/bitstream/ufscar/15024/4/license_rdf https://repositorio.ufscar.br/bitstream/ufscar/15024/5/Tese_Gilson_Ufscar_211018.pdf.txt https://repositorio.ufscar.br/bitstream/ufscar/15024/7/cartacomprovantepipges.pdf.txt https://repositorio.ufscar.br/bitstream/ufscar/15024/6/Tese_Gilson_Ufscar_211018.pdf.jpg https://repositorio.ufscar.br/bitstream/ufscar/15024/8/cartacomprovantepipges.pdf.jpg |
bitstream.checksum.fl_str_mv |
dd820a7f248d1a8f3800a9f79b5b2405 8b6417109e89f0383b931afaf6c70c3f e39d27027a6cc9cb039ad269a5db8e34 47bb58537ed60f16d243145ee8bfb3c0 0d08371cc7c1bc5edfbba92324e8f648 df101586af22bb82ab9b0ee21fe475db ee4b8b019ea9fc371b3c23f1c3895f35 |
bitstream.checksumAlgorithm.fl_str_mv |
MD5 MD5 MD5 MD5 MD5 MD5 MD5 |
repository.name.fl_str_mv |
Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR) |
repository.mail.fl_str_mv |
|
_version_ |
1802136397110312960 |