Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis

Detalhes bibliográficos
Autor(a) principal: Shimizu, Gilson Yuuji
Data de Publicação: 2021
Tipo de documento: Tese
Idioma: eng
Título da fonte: Repositório Institucional da UFSCAR
Texto Completo: https://repositorio.ufscar.br/handle/ufscar/15024
Resumo: Machine learning methods are divided into two main groups: supervised and unsupervised methods. In the first part of this work, we develop a method for creating prediction bands that can be applied to supervised problems. Our approach is based on conformal methods, which are very appealing because they create prediction bands that control average coverage assuming solely i.i.d. data. It is also often desirable to control conditional coverage, that is, coverage for every new testing point. However, without strong assumptions, conditional coverage is unachievable. Given this limitation, the literature has focused on methods with asymptotical conditional coverage. In order to obtain this property, these methods require strong conditions on the dependence between the target variable and the features. We introduce two conformal methods based on conditional density estimators that do not depend on this type of assumption to obtain asymptotic conditional coverage: Dist-split and CD-split. While Dist-split asymptotically obtains optimal intervals, which are easier to interpret than general regions, CD-split obtains optimal size regions, which are smaller than intervals. CD-split also obtains local coverage by creating prediction bands locally on a partition of the features space. This partition is data-driven and scales to high-dimensional settings. In a wide variety of simulated scenarios, our methods have a better control of conditional coverage and have smaller length than previously proposed methods. In the second part, in a context of unsupervised methods, we develop a new version of the Latent Dirichlet Allocation (LDA) model. The LDA model is a popular method for creating mixed-membership clusters. Despite having been originally developed for text analysis, LDA has been used for a wide range of other applications. We propose a new formulation for the LDA model which incorporates covariates. In this model, a negative binomial regression is embedded within LDA, enabling straight-forward interpretation of the regression coefficients and the analysis of the quantity of cluster-specific elements in each sampling units (instead of the analysis being focused on modeling the proportion of each cluster, as in Structural Topic Models). We use slice sampling within a Gibbs sampling algorithm to estimate model parameters. We rely on simulations to show how our algorithm is able to successfully retrieve the true parameter values. The model is illustrated using real data sets from three different areas: text-mining of Coronavirus articles, analysis of grocery shopping baskets, and ecology of tree species on Barro Colorado Island (Panama). This model allows the identification of mixed-membership clusters in discrete data and provides inference on the relationship between covariates and the abundance of these clusters.
id SCAR_918f72161ec3279a3063703e48b0d3b7
oai_identifier_str oai:repositorio.ufscar.br:ufscar/15024
network_acronym_str SCAR
network_name_str Repositório Institucional da UFSCAR
repository_id_str 4322
spelling Shimizu, Gilson YuujiIzbicki, Rafaelhttp://lattes.cnpq.br/9991192137633896http://lattes.cnpq.br/753368198363423355d0e00a-e38f-4c7c-8d48-6225abfdfe7f2021-10-18T19:47:04Z2021-10-18T19:47:04Z2021-10-15SHIMIZU, Gilson Yuuji. Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis. 2021. Tese (Doutorado em Estatística) – Universidade Federal de São Carlos, São Carlos, 2021. Disponível em: https://repositorio.ufscar.br/handle/ufscar/15024.https://repositorio.ufscar.br/handle/ufscar/15024Machine learning methods are divided into two main groups: supervised and unsupervised methods. In the first part of this work, we develop a method for creating prediction bands that can be applied to supervised problems. Our approach is based on conformal methods, which are very appealing because they create prediction bands that control average coverage assuming solely i.i.d. data. It is also often desirable to control conditional coverage, that is, coverage for every new testing point. However, without strong assumptions, conditional coverage is unachievable. Given this limitation, the literature has focused on methods with asymptotical conditional coverage. In order to obtain this property, these methods require strong conditions on the dependence between the target variable and the features. We introduce two conformal methods based on conditional density estimators that do not depend on this type of assumption to obtain asymptotic conditional coverage: Dist-split and CD-split. While Dist-split asymptotically obtains optimal intervals, which are easier to interpret than general regions, CD-split obtains optimal size regions, which are smaller than intervals. CD-split also obtains local coverage by creating prediction bands locally on a partition of the features space. This partition is data-driven and scales to high-dimensional settings. In a wide variety of simulated scenarios, our methods have a better control of conditional coverage and have smaller length than previously proposed methods. In the second part, in a context of unsupervised methods, we develop a new version of the Latent Dirichlet Allocation (LDA) model. The LDA model is a popular method for creating mixed-membership clusters. Despite having been originally developed for text analysis, LDA has been used for a wide range of other applications. We propose a new formulation for the LDA model which incorporates covariates. In this model, a negative binomial regression is embedded within LDA, enabling straight-forward interpretation of the regression coefficients and the analysis of the quantity of cluster-specific elements in each sampling units (instead of the analysis being focused on modeling the proportion of each cluster, as in Structural Topic Models). We use slice sampling within a Gibbs sampling algorithm to estimate model parameters. We rely on simulations to show how our algorithm is able to successfully retrieve the true parameter values. The model is illustrated using real data sets from three different areas: text-mining of Coronavirus articles, analysis of grocery shopping baskets, and ecology of tree species on Barro Colorado Island (Panama). This model allows the identification of mixed-membership clusters in discrete data and provides inference on the relationship between covariates and the abundance of these clusters.Métodos de machine learning são basicamente divididos em dois grandes grupos: métodos supervisionados e não supervisionados. Na primeira parte deste trabalho nós desenvolvemos um método para criação de bandas de predição que pode ser aplicado em problemas supervisionados. Nossa abordagem é baseada em métodos conformal, que são interessantes porque criam bandas de predição que controlam a cobertura média assumindo somente dados i.i.d.. Geralmente também é desejável controlar a cobertura condicional, ou seja, a cobertura para toda nova amostra de teste. Contudo, sem fortes suposições, a cobertura condicional é inatingível. Dada esta limitação, a literatura tem focado em métodos com cobertura condicional assintótica. A fim de se obter esta propriedade, estes métodos requerem fortes suposições sobre a dependência entre a variável resposta e as características. Nós introduzimos dois métodos conformal baseados em estimadores de densidade condicionais que não dependem deste tipo de suposição para obter cobertura condicional assintótica: Dist-split e CD-split. Enquanto Dist-split obtém intervalos ótimos assintoticamente, que são mais fáceis de interpretar do que regiões de confiança, CD-split obtém regiões de tamanho ótimo, que são menores do que intervalos. CD-split também obtém cobertura local pela criação de bandas de predição localmente numa partição do espaço de características. Esta partição é baseada em dados e permite trabalhar com dados em alta dimensão. Numa grande variedade de cenários simulados, nossos métodos tem melhor controle da cobertura condicional e tem menores comprimentos do que métodos propostos anteriores. Na segunda parte, num contexto de métodos não supervisionados, estudamos uma nova versão do modelo de Alocação Latente Dirichlet (LDA). O modelo LDA é um método popular para criação de mixed-membership clusters. Apesar de ter ficado conhecido na análise de texto, LDA tem sido usado em uma variedade de outras aplicações. Nós propomos uma nova formulação para o modelo LDA que incorpora covariáveis. Neste modelo, uma regressão binomial negativa é embutida dentro do LDA, possibilitando uma interpretação direta dos coeficientes de regressão e análise da quantidade de elementos específicos dos clusters em cada unidade amostral (ao invés da análise ser focada em modelar a proporção de cada cluster, como nos Modelos de Tópicos Estruturados). Nó usamos slice sampling dentro de um algoritmo de Gibbs sampling para estimar os parâmetros. E usamos simulações para mostrar como nosso algoritmo é capaz de estimar com sucesso os verdadeiros parâmetros do modelo. O modelo é ilustrado usando conjuntos de dados reais de três diferentes áreas: mineração de texto de artigos sobre coronavírus, análise de cestas de supermercados, e análise de espécies de árvores na Ilha de Barro Colorado (Panama). Este modelo permite a identificação de mixed-membership clusters em dados discretos e fornece inferências sobre o relacionamento entre covariáveis e a abundância destes clusters.Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)CAPES: Código de Financiamento 001engUniversidade Federal de São CarlosCâmpus São CarlosPrograma Interinstitucional de Pós-Graduação em Estatística - PIPGEsUFSCarAttribution-NonCommercial-NoDerivs 3.0 Brazilhttp://creativecommons.org/licenses/by-nc-nd/3.0/br/info:eu-repo/semantics/openAccessAprendizagem de máquinaAnálise de textoAlocação latente de Dirichlet (LDA)Bandas de prediçãoPredição conformalMachine learningText analysisLatent Dirichlet allocation (LDA)Prediction bandsConformal predictionCIENCIAS EXATAS E DA TERRA::PROBABILIDADE E ESTATISTICA::ESTATISTICABandas de predição usando densidade condicional estimada e um modelo LDA com covariáveisPrediction bands using estimated conditional density and an LDA model with covariatesinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesis6006003e57f161-19fe-4345-9e87-bc60eb7be98freponame:Repositório Institucional da UFSCARinstname:Universidade Federal de São Carlos (UFSCAR)instacron:UFSCARORIGINALTese_Gilson_Ufscar_211018.pdfTese_Gilson_Ufscar_211018.pdfapplication/pdf8675639https://repositorio.ufscar.br/bitstream/ufscar/15024/1/Tese_Gilson_Ufscar_211018.pdfdd820a7f248d1a8f3800a9f79b5b2405MD51cartacomprovantepipges.pdfcartacomprovantepipges.pdfCarta comprovanteapplication/pdf156662https://repositorio.ufscar.br/bitstream/ufscar/15024/3/cartacomprovantepipges.pdf8b6417109e89f0383b931afaf6c70c3fMD53CC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-8811https://repositorio.ufscar.br/bitstream/ufscar/15024/4/license_rdfe39d27027a6cc9cb039ad269a5db8e34MD54TEXTTese_Gilson_Ufscar_211018.pdf.txtTese_Gilson_Ufscar_211018.pdf.txtExtracted texttext/plain94623https://repositorio.ufscar.br/bitstream/ufscar/15024/5/Tese_Gilson_Ufscar_211018.pdf.txt47bb58537ed60f16d243145ee8bfb3c0MD55cartacomprovantepipges.pdf.txtcartacomprovantepipges.pdf.txtExtracted texttext/plain1282https://repositorio.ufscar.br/bitstream/ufscar/15024/7/cartacomprovantepipges.pdf.txt0d08371cc7c1bc5edfbba92324e8f648MD57THUMBNAILTese_Gilson_Ufscar_211018.pdf.jpgTese_Gilson_Ufscar_211018.pdf.jpgIM Thumbnailimage/jpeg3984https://repositorio.ufscar.br/bitstream/ufscar/15024/6/Tese_Gilson_Ufscar_211018.pdf.jpgdf101586af22bb82ab9b0ee21fe475dbMD56cartacomprovantepipges.pdf.jpgcartacomprovantepipges.pdf.jpgIM Thumbnailimage/jpeg9315https://repositorio.ufscar.br/bitstream/ufscar/15024/8/cartacomprovantepipges.pdf.jpgee4b8b019ea9fc371b3c23f1c3895f35MD58ufscar/150242023-09-18 18:32:18.194oai:repositorio.ufscar.br:ufscar/15024Repositório InstitucionalPUBhttps://repositorio.ufscar.br/oai/requestopendoar:43222023-09-18T18:32:18Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)false
dc.title.por.fl_str_mv Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis
dc.title.alternative.eng.fl_str_mv Prediction bands using estimated conditional density and an LDA model with covariates
title Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis
spellingShingle Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis
Shimizu, Gilson Yuuji
Aprendizagem de máquina
Análise de texto
Alocação latente de Dirichlet (LDA)
Bandas de predição
Predição conformal
Machine learning
Text analysis
Latent Dirichlet allocation (LDA)
Prediction bands
Conformal prediction
CIENCIAS EXATAS E DA TERRA::PROBABILIDADE E ESTATISTICA::ESTATISTICA
title_short Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis
title_full Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis
title_fullStr Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis
title_full_unstemmed Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis
title_sort Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis
author Shimizu, Gilson Yuuji
author_facet Shimizu, Gilson Yuuji
author_role author
dc.contributor.authorlattes.por.fl_str_mv http://lattes.cnpq.br/7533681983634233
dc.contributor.author.fl_str_mv Shimizu, Gilson Yuuji
dc.contributor.advisor1.fl_str_mv Izbicki, Rafael
dc.contributor.advisor1Lattes.fl_str_mv http://lattes.cnpq.br/9991192137633896
dc.contributor.authorID.fl_str_mv 55d0e00a-e38f-4c7c-8d48-6225abfdfe7f
contributor_str_mv Izbicki, Rafael
dc.subject.por.fl_str_mv Aprendizagem de máquina
Análise de texto
Alocação latente de Dirichlet (LDA)
Bandas de predição
Predição conformal
topic Aprendizagem de máquina
Análise de texto
Alocação latente de Dirichlet (LDA)
Bandas de predição
Predição conformal
Machine learning
Text analysis
Latent Dirichlet allocation (LDA)
Prediction bands
Conformal prediction
CIENCIAS EXATAS E DA TERRA::PROBABILIDADE E ESTATISTICA::ESTATISTICA
dc.subject.eng.fl_str_mv Machine learning
Text analysis
Latent Dirichlet allocation (LDA)
Prediction bands
Conformal prediction
dc.subject.cnpq.fl_str_mv CIENCIAS EXATAS E DA TERRA::PROBABILIDADE E ESTATISTICA::ESTATISTICA
description Machine learning methods are divided into two main groups: supervised and unsupervised methods. In the first part of this work, we develop a method for creating prediction bands that can be applied to supervised problems. Our approach is based on conformal methods, which are very appealing because they create prediction bands that control average coverage assuming solely i.i.d. data. It is also often desirable to control conditional coverage, that is, coverage for every new testing point. However, without strong assumptions, conditional coverage is unachievable. Given this limitation, the literature has focused on methods with asymptotical conditional coverage. In order to obtain this property, these methods require strong conditions on the dependence between the target variable and the features. We introduce two conformal methods based on conditional density estimators that do not depend on this type of assumption to obtain asymptotic conditional coverage: Dist-split and CD-split. While Dist-split asymptotically obtains optimal intervals, which are easier to interpret than general regions, CD-split obtains optimal size regions, which are smaller than intervals. CD-split also obtains local coverage by creating prediction bands locally on a partition of the features space. This partition is data-driven and scales to high-dimensional settings. In a wide variety of simulated scenarios, our methods have a better control of conditional coverage and have smaller length than previously proposed methods. In the second part, in a context of unsupervised methods, we develop a new version of the Latent Dirichlet Allocation (LDA) model. The LDA model is a popular method for creating mixed-membership clusters. Despite having been originally developed for text analysis, LDA has been used for a wide range of other applications. We propose a new formulation for the LDA model which incorporates covariates. In this model, a negative binomial regression is embedded within LDA, enabling straight-forward interpretation of the regression coefficients and the analysis of the quantity of cluster-specific elements in each sampling units (instead of the analysis being focused on modeling the proportion of each cluster, as in Structural Topic Models). We use slice sampling within a Gibbs sampling algorithm to estimate model parameters. We rely on simulations to show how our algorithm is able to successfully retrieve the true parameter values. The model is illustrated using real data sets from three different areas: text-mining of Coronavirus articles, analysis of grocery shopping baskets, and ecology of tree species on Barro Colorado Island (Panama). This model allows the identification of mixed-membership clusters in discrete data and provides inference on the relationship between covariates and the abundance of these clusters.
publishDate 2021
dc.date.accessioned.fl_str_mv 2021-10-18T19:47:04Z
dc.date.available.fl_str_mv 2021-10-18T19:47:04Z
dc.date.issued.fl_str_mv 2021-10-15
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.citation.fl_str_mv SHIMIZU, Gilson Yuuji. Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis. 2021. Tese (Doutorado em Estatística) – Universidade Federal de São Carlos, São Carlos, 2021. Disponível em: https://repositorio.ufscar.br/handle/ufscar/15024.
dc.identifier.uri.fl_str_mv https://repositorio.ufscar.br/handle/ufscar/15024
identifier_str_mv SHIMIZU, Gilson Yuuji. Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis. 2021. Tese (Doutorado em Estatística) – Universidade Federal de São Carlos, São Carlos, 2021. Disponível em: https://repositorio.ufscar.br/handle/ufscar/15024.
url https://repositorio.ufscar.br/handle/ufscar/15024
dc.language.iso.fl_str_mv eng
language eng
dc.relation.confidence.fl_str_mv 600
600
dc.relation.authority.fl_str_mv 3e57f161-19fe-4345-9e87-bc60eb7be98f
dc.rights.driver.fl_str_mv Attribution-NonCommercial-NoDerivs 3.0 Brazil
http://creativecommons.org/licenses/by-nc-nd/3.0/br/
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Attribution-NonCommercial-NoDerivs 3.0 Brazil
http://creativecommons.org/licenses/by-nc-nd/3.0/br/
eu_rights_str_mv openAccess
dc.publisher.none.fl_str_mv Universidade Federal de São Carlos
Câmpus São Carlos
dc.publisher.program.fl_str_mv Programa Interinstitucional de Pós-Graduação em Estatística - PIPGEs
dc.publisher.initials.fl_str_mv UFSCar
publisher.none.fl_str_mv Universidade Federal de São Carlos
Câmpus São Carlos
dc.source.none.fl_str_mv reponame:Repositório Institucional da UFSCAR
instname:Universidade Federal de São Carlos (UFSCAR)
instacron:UFSCAR
instname_str Universidade Federal de São Carlos (UFSCAR)
instacron_str UFSCAR
institution UFSCAR
reponame_str Repositório Institucional da UFSCAR
collection Repositório Institucional da UFSCAR
bitstream.url.fl_str_mv https://repositorio.ufscar.br/bitstream/ufscar/15024/1/Tese_Gilson_Ufscar_211018.pdf
https://repositorio.ufscar.br/bitstream/ufscar/15024/3/cartacomprovantepipges.pdf
https://repositorio.ufscar.br/bitstream/ufscar/15024/4/license_rdf
https://repositorio.ufscar.br/bitstream/ufscar/15024/5/Tese_Gilson_Ufscar_211018.pdf.txt
https://repositorio.ufscar.br/bitstream/ufscar/15024/7/cartacomprovantepipges.pdf.txt
https://repositorio.ufscar.br/bitstream/ufscar/15024/6/Tese_Gilson_Ufscar_211018.pdf.jpg
https://repositorio.ufscar.br/bitstream/ufscar/15024/8/cartacomprovantepipges.pdf.jpg
bitstream.checksum.fl_str_mv dd820a7f248d1a8f3800a9f79b5b2405
8b6417109e89f0383b931afaf6c70c3f
e39d27027a6cc9cb039ad269a5db8e34
47bb58537ed60f16d243145ee8bfb3c0
0d08371cc7c1bc5edfbba92324e8f648
df101586af22bb82ab9b0ee21fe475db
ee4b8b019ea9fc371b3c23f1c3895f35
bitstream.checksumAlgorithm.fl_str_mv MD5
MD5
MD5
MD5
MD5
MD5
MD5
repository.name.fl_str_mv Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)
repository.mail.fl_str_mv
_version_ 1802136397110312960