Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis

Shimizu, Gilson Yuuji

Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis

Detalhes bibliográficos
Autor(a) principal:	Shimizu, Gilson Yuuji
Data de Publicação:	2021
Tipo de documento:	Tese
Idioma:	eng
Título da fonte:	Repositório Institucional da UFSCAR
Texto Completo:	https://repositorio.ufscar.br/handle/ufscar/15024
Resumo:	Machine learning methods are divided into two main groups: supervised and unsupervised methods. In the first part of this work, we develop a method for creating prediction bands that can be applied to supervised problems. Our approach is based on conformal methods, which are very appealing because they create prediction bands that control average coverage assuming solely i.i.d. data. It is also often desirable to control conditional coverage, that is, coverage for every new testing point. However, without strong assumptions, conditional coverage is unachievable. Given this limitation, the literature has focused on methods with asymptotical conditional coverage. In order to obtain this property, these methods require strong conditions on the dependence between the target variable and the features. We introduce two conformal methods based on conditional density estimators that do not depend on this type of assumption to obtain asymptotic conditional coverage: Dist-split and CD-split. While Dist-split asymptotically obtains optimal intervals, which are easier to interpret than general regions, CD-split obtains optimal size regions, which are smaller than intervals. CD-split also obtains local coverage by creating prediction bands locally on a partition of the features space. This partition is data-driven and scales to high-dimensional settings. In a wide variety of simulated scenarios, our methods have a better control of conditional coverage and have smaller length than previously proposed methods. In the second part, in a context of unsupervised methods, we develop a new version of the Latent Dirichlet Allocation (LDA) model. The LDA model is a popular method for creating mixed-membership clusters. Despite having been originally developed for text analysis, LDA has been used for a wide range of other applications. We propose a new formulation for the LDA model which incorporates covariates. In this model, a negative binomial regression is embedded within LDA, enabling straight-forward interpretation of the regression coefficients and the analysis of the quantity of cluster-specific elements in each sampling units (instead of the analysis being focused on modeling the proportion of each cluster, as in Structural Topic Models). We use slice sampling within a Gibbs sampling algorithm to estimate model parameters. We rely on simulations to show how our algorithm is able to successfully retrieve the true parameter values. The model is illustrated using real data sets from three different areas: text-mining of Coronavirus articles, analysis of grocery shopping baskets, and ecology of tree species on Barro Colorado Island (Panama). This model allows the identification of mixed-membership clusters in discrete data and provides inference on the relationship between covariates and the abundance of these clusters.

Metadados do item

id	SCAR_918f72161ec3279a3063703e48b0d3b7
oai_identifier_str	oai:repositorio.ufscar.br:ufscar/15024
network_acronym_str	SCAR
network_name_str	Repositório Institucional da UFSCAR
repository_id_str	4322
spelling	Shimizu, Gilson YuujiIzbicki, Rafaelhttp://lattes.cnpq.br/9991192137633896http://lattes.cnpq.br/753368198363423355d0e00a-e38f-4c7c-8d48-6225abfdfe7f2021-10-18T19:47:04Z2021-10-18T19:47:04Z2021-10-15SHIMIZU, Gilson Yuuji. Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis. 2021. Tese (Doutorado em Estatística) – Universidade Federal de São Carlos, São Carlos, 2021. Disponível em: https://repositorio.ufscar.br/handle/ufscar/15024.https://repositorio.ufscar.br/handle/ufscar/15024Machine learning methods are divided into two main groups: supervised and unsupervised methods. In the first part of this work, we develop a method for creating prediction bands that can be applied to supervised problems. Our approach is based on conformal methods, which are very appealing because they create prediction bands that control average coverage assuming solely i.i.d. data. It is also often desirable to control conditional coverage, that is, coverage for every new testing point. However, without strong assumptions, conditional coverage is unachievable. Given this limitation, the literature has focused on methods with asymptotical conditional coverage. In order to obtain this property, these methods require strong conditions on the dependence between the target variable and the features. We introduce two conformal methods based on conditional density estimators that do not depend on this type of assumption to obtain asymptotic conditional coverage: Dist-split and CD-split. While Dist-split asymptotically obtains optimal intervals, which are easier to interpret than general regions, CD-split obtains optimal size regions, which are smaller than intervals. CD-split also obtains local coverage by creating prediction bands locally on a partition of the features space. This partition is data-driven and scales to high-dimensional settings. In a wide variety of simulated scenarios, our methods have a better control of conditional coverage and have smaller length than previously proposed methods. In the second part, in a context of unsupervised methods, we develop a new version of the Latent Dirichlet Allocation (LDA) model. The LDA model is a popular method for creating mixed-membership clusters. Despite having been originally developed for text analysis, LDA has been used for a wide range of other applications. We propose a new formulation for the LDA model which incorporates covariates. In this model, a negative binomial regression is embedded within LDA, enabling straight-forward interpretation of the regression coefficients and the analysis of the quantity of cluster-specific elements in each sampling units (instead of the analysis being focused on modeling the proportion of each cluster, as in Structural Topic Models). We use slice sampling within a Gibbs sampling algorithm to estimate model parameters. We rely on simulations to show how our algorithm is able to successfully retrieve the true parameter values. The model is illustrated using real data sets from three different areas: text-mining of Coronavirus articles, analysis of grocery shopping baskets, and ecology of tree species on Barro Colorado Island (Panama). This model allows the identification of mixed-membership clusters in discrete data and provides inference on the relationship between covariates and the abundance of these clusters.Métodos de machine learning são basicamente divididos em dois grandes grupos: métodos supervisionados e não supervisionados. Na primeira parte deste trabalho nós desenvolvemos um método para criação de bandas de predição que pode ser aplicado em problemas supervisionados. Nossa abordagem é baseada em métodos conformal, que são interessantes porque criam bandas de predição que controlam a cobertura média assumindo somente dados i.i.d.. Geralmente também é desejável controlar a cobertura condicional, ou seja, a cobertura para toda nova amostra de teste. Contudo, sem fortes suposições, a cobertura condicional é inatingível. Dada esta limitação, a literatura tem focado em métodos com cobertura condicional assintótica. A fim de se obter esta propriedade, estes métodos requerem fortes suposições sobre a dependência entre a variável resposta e as características. Nós introduzimos dois métodos conformal baseados em estimadores de densidade condicionais que não dependem deste tipo de suposição para obter cobertura condicional assintótica: Dist-split e CD-split. Enquanto Dist-split obtém intervalos ótimos assintoticamente, que são mais fáceis de interpretar do que regiões de confiança, CD-split obtém regiões de tamanho ótimo, que são menores do que intervalos. CD-split também obtém cobertura local pela criação de bandas de predição localmente numa partição do espaço de características. Esta partição é baseada em dados e permite trabalhar com dados em alta dimensão. Numa grande variedade de cenários simulados, nossos métodos tem melhor controle da cobertura condicional e tem menores comprimentos do que métodos propostos anteriores. Na segunda parte, num contexto de métodos não supervisionados, estudamos uma nova versão do modelo de Alocação Latente Dirichlet (LDA). O modelo LDA é um método popular para criação de mixed-membership clusters. Apesar de ter ficado conhecido na análise de texto, LDA tem sido usado em uma variedade de outras aplicações. Nós propomos uma nova formulação para o modelo LDA que incorpora covariáveis. Neste modelo, uma regressão binomial negativa é embutida dentro do LDA, possibilitando uma interpretação direta dos coeficientes de regressão e análise da quantidade de elementos específicos dos clusters em cada unidade amostral (ao invés da análise ser focada em modelar a proporção de cada cluster, como nos Modelos de Tópicos Estruturados). Nó usamos slice sampling dentro de um algoritmo de Gibbs sampling para estimar os parâmetros. E usamos simulações para mostrar como nosso algoritmo é capaz de estimar com sucesso os verdadeiros parâmetros do modelo. O modelo é ilustrado usando conjuntos de dados reais de três diferentes áreas: mineração de texto de artigos sobre coronavírus, análise de cestas de supermercados, e análise de espécies de árvores na Ilha de Barro Colorado (Panama). Este modelo permite a identificação de mixed-membership clusters em dados discretos e fornece inferências sobre o relacionamento entre covariáveis e a abundância destes clusters.Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)CAPES: Código de Financiamento 001engUniversidade Federal de São CarlosCâmpus São CarlosPrograma Interinstitucional de Pós-Graduação em Estatística - PIPGEsUFSCarAttribution-NonCommercial-NoDerivs 3.0 Brazilhttp://creativecommons.org/licenses/by-nc-nd/3.0/br/info:eu-repo/semantics/openAccessAprendizagem de máquinaAnálise de textoAlocação latente de Dirichlet (LDA)Bandas de prediçãoPredição conformalMachine learningText analysisLatent Dirichlet allocation (LDA)Prediction bandsConformal predictionCIENCIAS EXATAS E DA TERRA::PROBABILIDADE E ESTATISTICA::ESTATISTICABandas de predição usando densidade condicional estimada e um modelo LDA com covariáveisPrediction bands using estimated conditional density and an LDA model with covariatesinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesis6006003e57f161-19fe-4345-9e87-bc60eb7be98freponame:Repositório Institucional da UFSCARinstname:Universidade Federal de São Carlos (UFSCAR)instacron:UFSCARORIGINALTese_Gilson_Ufscar_211018.pdfTese_Gilson_Ufscar_211018.pdfapplication/pdf8675639https://repositorio.ufscar.br/bitstream/ufscar/15024/1/Tese_Gilson_Ufscar_211018.pdfdd820a7f248d1a8f3800a9f79b5b2405MD51cartacomprovantepipges.pdfcartacomprovantepipges.pdfCarta comprovanteapplication/pdf156662https://repositorio.ufscar.br/bitstream/ufscar/15024/3/cartacomprovantepipges.pdf8b6417109e89f0383b931afaf6c70c3fMD53CC-LICENSElicense_rdflicense_rdfapplication/rdf+xml; charset=utf-8811https://repositorio.ufscar.br/bitstream/ufscar/15024/4/license_rdfe39d27027a6cc9cb039ad269a5db8e34MD54TEXTTese_Gilson_Ufscar_211018.pdf.txtTese_Gilson_Ufscar_211018.pdf.txtExtracted texttext/plain94623https://repositorio.ufscar.br/bitstream/ufscar/15024/5/Tese_Gilson_Ufscar_211018.pdf.txt47bb58537ed60f16d243145ee8bfb3c0MD55cartacomprovantepipges.pdf.txtcartacomprovantepipges.pdf.txtExtracted texttext/plain1282https://repositorio.ufscar.br/bitstream/ufscar/15024/7/cartacomprovantepipges.pdf.txt0d08371cc7c1bc5edfbba92324e8f648MD57THUMBNAILTese_Gilson_Ufscar_211018.pdf.jpgTese_Gilson_Ufscar_211018.pdf.jpgIM Thumbnailimage/jpeg3984https://repositorio.ufscar.br/bitstream/ufscar/15024/6/Tese_Gilson_Ufscar_211018.pdf.jpgdf101586af22bb82ab9b0ee21fe475dbMD56cartacomprovantepipges.pdf.jpgcartacomprovantepipges.pdf.jpgIM Thumbnailimage/jpeg9315https://repositorio.ufscar.br/bitstream/ufscar/15024/8/cartacomprovantepipges.pdf.jpgee4b8b019ea9fc371b3c23f1c3895f35MD58ufscar/150242023-09-18 18:32:18.194oai:repositorio.ufscar.br:ufscar/15024Repositório InstitucionalPUBhttps://repositorio.ufscar.br/oai/requestopendoar:43222023-09-18T18:32:18Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)false
dc.title.por.fl_str_mv	Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis
dc.title.alternative.eng.fl_str_mv	Prediction bands using estimated conditional density and an LDA model with covariates
title	Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis
spellingShingle	Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis Shimizu, Gilson Yuuji Aprendizagem de máquina Análise de texto Alocação latente de Dirichlet (LDA) Bandas de predição Predição conformal Machine learning Text analysis Latent Dirichlet allocation (LDA) Prediction bands Conformal prediction CIENCIAS EXATAS E DA TERRA::PROBABILIDADE E ESTATISTICA::ESTATISTICA
title_short	Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis
title_full	Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis
title_fullStr	Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis
title_full_unstemmed	Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis
title_sort	Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis
author	Shimizu, Gilson Yuuji
author_facet	Shimizu, Gilson Yuuji
author_role	author
dc.contributor.authorlattes.por.fl_str_mv	http://lattes.cnpq.br/7533681983634233
dc.contributor.author.fl_str_mv	Shimizu, Gilson Yuuji
dc.contributor.advisor1.fl_str_mv	Izbicki, Rafael
dc.contributor.advisor1Lattes.fl_str_mv	http://lattes.cnpq.br/9991192137633896
dc.contributor.authorID.fl_str_mv	55d0e00a-e38f-4c7c-8d48-6225abfdfe7f
contributor_str_mv	Izbicki, Rafael
dc.subject.por.fl_str_mv	Aprendizagem de máquina Análise de texto Alocação latente de Dirichlet (LDA) Bandas de predição Predição conformal
topic	Aprendizagem de máquina Análise de texto Alocação latente de Dirichlet (LDA) Bandas de predição Predição conformal Machine learning Text analysis Latent Dirichlet allocation (LDA) Prediction bands Conformal prediction CIENCIAS EXATAS E DA TERRA::PROBABILIDADE E ESTATISTICA::ESTATISTICA
dc.subject.eng.fl_str_mv	Machine learning Text analysis Latent Dirichlet allocation (LDA) Prediction bands Conformal prediction
dc.subject.cnpq.fl_str_mv	CIENCIAS EXATAS E DA TERRA::PROBABILIDADE E ESTATISTICA::ESTATISTICA
description	Machine learning methods are divided into two main groups: supervised and unsupervised methods. In the first part of this work, we develop a method for creating prediction bands that can be applied to supervised problems. Our approach is based on conformal methods, which are very appealing because they create prediction bands that control average coverage assuming solely i.i.d. data. It is also often desirable to control conditional coverage, that is, coverage for every new testing point. However, without strong assumptions, conditional coverage is unachievable. Given this limitation, the literature has focused on methods with asymptotical conditional coverage. In order to obtain this property, these methods require strong conditions on the dependence between the target variable and the features. We introduce two conformal methods based on conditional density estimators that do not depend on this type of assumption to obtain asymptotic conditional coverage: Dist-split and CD-split. While Dist-split asymptotically obtains optimal intervals, which are easier to interpret than general regions, CD-split obtains optimal size regions, which are smaller than intervals. CD-split also obtains local coverage by creating prediction bands locally on a partition of the features space. This partition is data-driven and scales to high-dimensional settings. In a wide variety of simulated scenarios, our methods have a better control of conditional coverage and have smaller length than previously proposed methods. In the second part, in a context of unsupervised methods, we develop a new version of the Latent Dirichlet Allocation (LDA) model. The LDA model is a popular method for creating mixed-membership clusters. Despite having been originally developed for text analysis, LDA has been used for a wide range of other applications. We propose a new formulation for the LDA model which incorporates covariates. In this model, a negative binomial regression is embedded within LDA, enabling straight-forward interpretation of the regression coefficients and the analysis of the quantity of cluster-specific elements in each sampling units (instead of the analysis being focused on modeling the proportion of each cluster, as in Structural Topic Models). We use slice sampling within a Gibbs sampling algorithm to estimate model parameters. We rely on simulations to show how our algorithm is able to successfully retrieve the true parameter values. The model is illustrated using real data sets from three different areas: text-mining of Coronavirus articles, analysis of grocery shopping baskets, and ecology of tree species on Barro Colorado Island (Panama). This model allows the identification of mixed-membership clusters in discrete data and provides inference on the relationship between covariates and the abundance of these clusters.
publishDate	2021
dc.date.accessioned.fl_str_mv	2021-10-18T19:47:04Z
dc.date.available.fl_str_mv	2021-10-18T19:47:04Z
dc.date.issued.fl_str_mv	2021-10-15
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/doctoralThesis
format	doctoralThesis
status_str	publishedVersion
dc.identifier.citation.fl_str_mv	SHIMIZU, Gilson Yuuji. Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis. 2021. Tese (Doutorado em Estatística) – Universidade Federal de São Carlos, São Carlos, 2021. Disponível em: https://repositorio.ufscar.br/handle/ufscar/15024.
dc.identifier.uri.fl_str_mv	https://repositorio.ufscar.br/handle/ufscar/15024
identifier_str_mv	SHIMIZU, Gilson Yuuji. Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis. 2021. Tese (Doutorado em Estatística) – Universidade Federal de São Carlos, São Carlos, 2021. Disponível em: https://repositorio.ufscar.br/handle/ufscar/15024.
url	https://repositorio.ufscar.br/handle/ufscar/15024
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.confidence.fl_str_mv	600 600
dc.relation.authority.fl_str_mv	3e57f161-19fe-4345-9e87-bc60eb7be98f
dc.rights.driver.fl_str_mv	Attribution-NonCommercial-NoDerivs 3.0 Brazil http://creativecommons.org/licenses/by-nc-nd/3.0/br/ info:eu-repo/semantics/openAccess
rights_invalid_str_mv	Attribution-NonCommercial-NoDerivs 3.0 Brazil http://creativecommons.org/licenses/by-nc-nd/3.0/br/
eu_rights_str_mv	openAccess
dc.publisher.none.fl_str_mv	Universidade Federal de São Carlos Câmpus São Carlos
dc.publisher.program.fl_str_mv	Programa Interinstitucional de Pós-Graduação em Estatística - PIPGEs
dc.publisher.initials.fl_str_mv	UFSCar
publisher.none.fl_str_mv	Universidade Federal de São Carlos Câmpus São Carlos
dc.source.none.fl_str_mv	reponame:Repositório Institucional da UFSCAR instname:Universidade Federal de São Carlos (UFSCAR) instacron:UFSCAR
instname_str	Universidade Federal de São Carlos (UFSCAR)
instacron_str	UFSCAR
institution	UFSCAR
reponame_str	Repositório Institucional da UFSCAR
collection	Repositório Institucional da UFSCAR
bitstream.url.fl_str_mv	https://repositorio.ufscar.br/bitstream/ufscar/15024/1/Tese_Gilson_Ufscar_211018.pdf https://repositorio.ufscar.br/bitstream/ufscar/15024/3/cartacomprovantepipges.pdf https://repositorio.ufscar.br/bitstream/ufscar/15024/4/license_rdf https://repositorio.ufscar.br/bitstream/ufscar/15024/5/Tese_Gilson_Ufscar_211018.pdf.txt https://repositorio.ufscar.br/bitstream/ufscar/15024/7/cartacomprovantepipges.pdf.txt https://repositorio.ufscar.br/bitstream/ufscar/15024/6/Tese_Gilson_Ufscar_211018.pdf.jpg https://repositorio.ufscar.br/bitstream/ufscar/15024/8/cartacomprovantepipges.pdf.jpg
bitstream.checksum.fl_str_mv	dd820a7f248d1a8f3800a9f79b5b2405 8b6417109e89f0383b931afaf6c70c3f e39d27027a6cc9cb039ad269a5db8e34 47bb58537ed60f16d243145ee8bfb3c0 0d08371cc7c1bc5edfbba92324e8f648 df101586af22bb82ab9b0ee21fe475db ee4b8b019ea9fc371b3c23f1c3895f35
bitstream.checksumAlgorithm.fl_str_mv	MD5 MD5 MD5 MD5 MD5 MD5 MD5
repository.name.fl_str_mv	Repositório Institucional da UFSCAR - Universidade Federal de São Carlos (UFSCAR)
repository.mail.fl_str_mv
_version_	1802136397110312960

Bandas de predição usando densidade condicional estimada e um modelo LDA com covariáveis

Registros relacionados