Visual analytics for machine learning - computing and leveraging decision boundary maps

Detalhes bibliográficos
Autor(a) principal: Rodrigues, Francisco Caio Maia
Data de Publicação: 2020
Tipo de documento: Tese
Idioma: eng
Título da fonte: Biblioteca Digital de Teses e Dissertações da USP
Texto Completo: https://www.teses.usp.br/teses/disponiveis/45/45134/tde-27112020-071803/
Resumo: Machine learning classifiers construct decision boundaries that partition data space into a set of regions to which labels are assigned. Understanding these decision boundaries can notably help the actual practical usage of such classifiers (by answering questions such as showing how a certain model is expected to behave on an empty region), as well as give insights on how to improve the training of a given model (by answering questions such as telling where should more training data be provided). In this thesis we propose and explore visual analytics methods for the explicit creation, construction, and use of decision zones of machine learning classifiers. Current methods employed to visualize how a classifier behaves on a dataset mainly use color-coded sample scatterplots, which do not explicitly show the actual decision boundaries or confusion zones. We propose an image-based technique to improve such visualizations. The method samples the 2D space of a projection and color-codes relevant classifier outputs, such as the majority class label, the confusion, and the sample density, to create a dense visual depiction of the high-dimensional decision boundaries. Our technique is simple to implement, handles any classifier, and has only two simple-to-control free parameters. We demonstrate our proposal on several real-world high-dimensional datasets, classifiers, direct and inverse projection techniques. To our knowledge, our work is the first that can create such explicit depictions of decision boundaries and decision zones for any dataset and any classifier, without explicit knowledge of the classifier\'s internals. Based on these visual depictions of decision boundaries, we developed a visual analytics workflow and associated tooling that allows users to perform two common techniques in machine learning - data augmentation and interactive labeling of unseen samples. We show that our approach can be used to perform guided data augmentation in order to shape the decision boundaries learned by a classifier according to the user\'s input. For interactive labeling, we show that our proposed visual depiction of decision boundaries helps in producing improved labeling in an active learning scenario.
id USP_c3f4103ca965ceb4f975a3c68b6d2e93
oai_identifier_str oai:teses.usp.br:tde-27112020-071803
network_acronym_str USP
network_name_str Biblioteca Digital de Teses e Dissertações da USP
repository_id_str 2721
spelling Visual analytics for machine learning - computing and leveraging decision boundary mapsVisual analytics para aprendizado de máquina - computando e analisando mapas da fronteira de decisão de classificadoresAprendizado de máquinaDimensionality reductionMachine learningRedução de dimensionalidadeVisual analyticsVisualização de dadosMachine learning classifiers construct decision boundaries that partition data space into a set of regions to which labels are assigned. Understanding these decision boundaries can notably help the actual practical usage of such classifiers (by answering questions such as showing how a certain model is expected to behave on an empty region), as well as give insights on how to improve the training of a given model (by answering questions such as telling where should more training data be provided). In this thesis we propose and explore visual analytics methods for the explicit creation, construction, and use of decision zones of machine learning classifiers. Current methods employed to visualize how a classifier behaves on a dataset mainly use color-coded sample scatterplots, which do not explicitly show the actual decision boundaries or confusion zones. We propose an image-based technique to improve such visualizations. The method samples the 2D space of a projection and color-codes relevant classifier outputs, such as the majority class label, the confusion, and the sample density, to create a dense visual depiction of the high-dimensional decision boundaries. Our technique is simple to implement, handles any classifier, and has only two simple-to-control free parameters. We demonstrate our proposal on several real-world high-dimensional datasets, classifiers, direct and inverse projection techniques. To our knowledge, our work is the first that can create such explicit depictions of decision boundaries and decision zones for any dataset and any classifier, without explicit knowledge of the classifier\'s internals. Based on these visual depictions of decision boundaries, we developed a visual analytics workflow and associated tooling that allows users to perform two common techniques in machine learning - data augmentation and interactive labeling of unseen samples. We show that our approach can be used to perform guided data augmentation in order to shape the decision boundaries learned by a classifier according to the user\'s input. For interactive labeling, we show that our proposed visual depiction of decision boundaries helps in producing improved labeling in an active learning scenario.Modelos de aprendizado de máquina chamados classificadores constroem fronteiras de decisão que particionam um certo espaço de dados em um conjunto de regiões, associando-as a um rótulo. Entender a estrutura e forma de tais fronteiras de decisão pode ser de grande ajuda no uso prático de tais classificadores, respondendo, por exemplo, questões sobre como espera-se que certo modelo se comporte em uma região vazia do espaço. Além disso, tal entendimento pode ajudar a dar ideias que levem a melhoria do treino de um certo modelo, por exemplo através da indicação de \\emph mais dados de treino poderiam ser coletados. Nessa tese, propomos e exploramos métodos de visualização para a criação e o uso de modelos visuais das fronteiras de decisão inferidas por classificaores de aprendizado de máquina. Atualmente, métodos utilizados para visualizar o comportamento de um classificador treinado em um certo conjunto de dados fazem uso scatterplot, colorindo os pontos de acordo com a classe atribuida pelo modelo. Nesta tese, propomos uma técnica baseada em imagens para aprimorar tais visualizações. Nosso método amostra o espaço 2D de uma projeção, codificando nas cores dos pixels aspectos relevantes de um classificador treinado, como a maioria dos rótulos naquela região, o grau de confusão e a densidade de amostras, criando uma imagem densa das fronteiras inferidas em espaços de alta dimensão. O método proposto é simples de implementar, funciona para qualquer classificador e possui apenas dois parâmetros intuitivos. Demonstramos o uso da técnica proposta em diferentes datasets de alta dimensionalidade, classificadores, projeções diretas e inversas. No nosso conhecimento, nosso trabalho é o primeiro capaz de criar tais visualizações explícitas das fronteiras de classificadores, para qualquer dataset e classificador, sem necessidade do conhecimento do funcionamento de detalhes internos dos modelos. Baseado nas descrições visuais das fronteiras de decisão, nós desenvolvemos um workflow de visual analytics e uma ferramenta gráfica que permite aos usuários realizarem a rotulagem interativa de amostras. Mostramos ainda que o nosso método proposto de visualização é capaz de ajudar em cenários de rotulação, como é o caso de aprendizado ativo.Biblioteca Digitais de Teses e Dissertações da USPHirata Junior, RobertoRodrigues, Francisco Caio Maia2020-11-09info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/45/45134/tde-27112020-071803/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2020-12-18T16:07:54Zoai:teses.usp.br:tde-27112020-071803Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212020-12-18T16:07:54Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv Visual analytics for machine learning - computing and leveraging decision boundary maps
Visual analytics para aprendizado de máquina - computando e analisando mapas da fronteira de decisão de classificadores
title Visual analytics for machine learning - computing and leveraging decision boundary maps
spellingShingle Visual analytics for machine learning - computing and leveraging decision boundary maps
Rodrigues, Francisco Caio Maia
Aprendizado de máquina
Dimensionality reduction
Machine learning
Redução de dimensionalidade
Visual analytics
Visualização de dados
title_short Visual analytics for machine learning - computing and leveraging decision boundary maps
title_full Visual analytics for machine learning - computing and leveraging decision boundary maps
title_fullStr Visual analytics for machine learning - computing and leveraging decision boundary maps
title_full_unstemmed Visual analytics for machine learning - computing and leveraging decision boundary maps
title_sort Visual analytics for machine learning - computing and leveraging decision boundary maps
author Rodrigues, Francisco Caio Maia
author_facet Rodrigues, Francisco Caio Maia
author_role author
dc.contributor.none.fl_str_mv Hirata Junior, Roberto
dc.contributor.author.fl_str_mv Rodrigues, Francisco Caio Maia
dc.subject.por.fl_str_mv Aprendizado de máquina
Dimensionality reduction
Machine learning
Redução de dimensionalidade
Visual analytics
Visualização de dados
topic Aprendizado de máquina
Dimensionality reduction
Machine learning
Redução de dimensionalidade
Visual analytics
Visualização de dados
description Machine learning classifiers construct decision boundaries that partition data space into a set of regions to which labels are assigned. Understanding these decision boundaries can notably help the actual practical usage of such classifiers (by answering questions such as showing how a certain model is expected to behave on an empty region), as well as give insights on how to improve the training of a given model (by answering questions such as telling where should more training data be provided). In this thesis we propose and explore visual analytics methods for the explicit creation, construction, and use of decision zones of machine learning classifiers. Current methods employed to visualize how a classifier behaves on a dataset mainly use color-coded sample scatterplots, which do not explicitly show the actual decision boundaries or confusion zones. We propose an image-based technique to improve such visualizations. The method samples the 2D space of a projection and color-codes relevant classifier outputs, such as the majority class label, the confusion, and the sample density, to create a dense visual depiction of the high-dimensional decision boundaries. Our technique is simple to implement, handles any classifier, and has only two simple-to-control free parameters. We demonstrate our proposal on several real-world high-dimensional datasets, classifiers, direct and inverse projection techniques. To our knowledge, our work is the first that can create such explicit depictions of decision boundaries and decision zones for any dataset and any classifier, without explicit knowledge of the classifier\'s internals. Based on these visual depictions of decision boundaries, we developed a visual analytics workflow and associated tooling that allows users to perform two common techniques in machine learning - data augmentation and interactive labeling of unseen samples. We show that our approach can be used to perform guided data augmentation in order to shape the decision boundaries learned by a classifier according to the user\'s input. For interactive labeling, we show that our proposed visual depiction of decision boundaries helps in producing improved labeling in an active learning scenario.
publishDate 2020
dc.date.none.fl_str_mv 2020-11-09
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://www.teses.usp.br/teses/disponiveis/45/45134/tde-27112020-071803/
url https://www.teses.usp.br/teses/disponiveis/45/45134/tde-27112020-071803/
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv Liberar o conteúdo para acesso público.
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Liberar o conteúdo para acesso público.
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv
reponame:Biblioteca Digital de Teses e Dissertações da USP
instname:Universidade de São Paulo (USP)
instacron:USP
instname_str Universidade de São Paulo (USP)
instacron_str USP
institution USP
reponame_str Biblioteca Digital de Teses e Dissertações da USP
collection Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br
_version_ 1815257421874987008