Unsupervised Learning Approaches for Non-Stationary Data Streams

Detalhes bibliográficos
Autor(a) principal: Garcia, Kemilly Dearo
Data de Publicação: 2021
Tipo de documento: Tese
Idioma: eng
Título da fonte: Biblioteca Digital de Teses e Dissertações da USP
Texto Completo: https://www.teses.usp.br/teses/disponiveis/55/55134/tde-24062021-161645/
Resumo: Modern society is surrounded by several applications which are daily generating large volumes of data. Nowadays, anyone can monitor their physical activities in real-time by using smartphones or wearable devices. Also, business and governments can learn more about their clients and citizens by analysing information from social media, for example. This data is called data streams when it is a sequence of data generated continuously, usually at high speed. This data is also potentially unbounded in size and may not be strictly stationary. Extracting useful knowledge from data streams is challenged due to several constraints. A data stream requires that a learning algorithm acts in dynamic environments. Meaning that the learning algorithm should allow for real-time processing. Moreover, it should be able to adapt to changes over time, considering the non-stationary nature of the data stream. In the last few decades, many machine learning approaches have been proposed for data streams. Most of them are based on supervised learning. These approaches rely on labelled data to adapt their models to the changes in data streams. However, the process of labelling data is usually costly and can require domain expertise. Besides, if the data is collected at high speed, it may be the case that there will not be enough time to label it. In this thesis, we aim to propose unsupervised and incremental machine learning algorithms for data streams. We focus on algorithms able to update their classification model with few or without external feedback. We start by addressing the problem of concept drift in data streams with few labelled data. For that problem, we propose a semi-supervised approach called Sliding Window Clusters. This method learns the current patterns from the data stream by selecting and summarising the most relevant data. We also study how to learn from data streams when novelties appear over time. So, we proposed an unsupervised learning method called Higia which is able to classify data as normal, novelty or concept drift. In this thesis, we propose an approach to combine different unsupervised approaches into a classification model. We test this approach considering two scenarios. The first is called Homogeneous Ensemble Clustering for Data Streams and it is based on the combination of different runs from the same clustering algorithm. In this study, we also consider the scenario called Heterogeneous Ensemble Clustering for Data Streams, which is based on the combination of different clustering algorithms. These methods allow for the use of clustering approaches with a different bias to obtain a more robust classification model. Furthermore, we evaluate the state-of-art approaches, commonly referred to in the literature of novelty detection in data streams. Most of this thesis focus on clustering approaches. However, given the popularity of neural networks, we also propose Ensemble of Auto-Encoders. This approach is based on the combination of auto-encoders into an ensemble model. Each auto-encoder is specialised on recognising one particular class. The Ensemble of Auto-Encoders has a modular structure that has the advantage of making the model easily adapted to the changes from the data. Besides, it allows for personalised models because the model can adapt to the most request classes. This contribution is applied to the problem of Human Activity Recognition. Experimental results show the potential of the approaches mentioned.
id USP_f146ea2ddfacd1b94566d8ff277f783e
oai_identifier_str oai:teses.usp.br:tde-24062021-161645
network_acronym_str USP
network_name_str Biblioteca Digital de Teses e Dissertações da USP
repository_id_str 2721
spelling Unsupervised Learning Approaches for Non-Stationary Data StreamsAbordagens de aprendizagem não supervisionada para fluxos de dados não estacionáriosAprendizado de Máquina não-supervisionadoAprendizado incrementalData streamsFluxo Continuo de DadosIncremental learningUnsupervised learningModern society is surrounded by several applications which are daily generating large volumes of data. Nowadays, anyone can monitor their physical activities in real-time by using smartphones or wearable devices. Also, business and governments can learn more about their clients and citizens by analysing information from social media, for example. This data is called data streams when it is a sequence of data generated continuously, usually at high speed. This data is also potentially unbounded in size and may not be strictly stationary. Extracting useful knowledge from data streams is challenged due to several constraints. A data stream requires that a learning algorithm acts in dynamic environments. Meaning that the learning algorithm should allow for real-time processing. Moreover, it should be able to adapt to changes over time, considering the non-stationary nature of the data stream. In the last few decades, many machine learning approaches have been proposed for data streams. Most of them are based on supervised learning. These approaches rely on labelled data to adapt their models to the changes in data streams. However, the process of labelling data is usually costly and can require domain expertise. Besides, if the data is collected at high speed, it may be the case that there will not be enough time to label it. In this thesis, we aim to propose unsupervised and incremental machine learning algorithms for data streams. We focus on algorithms able to update their classification model with few or without external feedback. We start by addressing the problem of concept drift in data streams with few labelled data. For that problem, we propose a semi-supervised approach called Sliding Window Clusters. This method learns the current patterns from the data stream by selecting and summarising the most relevant data. We also study how to learn from data streams when novelties appear over time. So, we proposed an unsupervised learning method called Higia which is able to classify data as normal, novelty or concept drift. In this thesis, we propose an approach to combine different unsupervised approaches into a classification model. We test this approach considering two scenarios. The first is called Homogeneous Ensemble Clustering for Data Streams and it is based on the combination of different runs from the same clustering algorithm. In this study, we also consider the scenario called Heterogeneous Ensemble Clustering for Data Streams, which is based on the combination of different clustering algorithms. These methods allow for the use of clustering approaches with a different bias to obtain a more robust classification model. Furthermore, we evaluate the state-of-art approaches, commonly referred to in the literature of novelty detection in data streams. Most of this thesis focus on clustering approaches. However, given the popularity of neural networks, we also propose Ensemble of Auto-Encoders. This approach is based on the combination of auto-encoders into an ensemble model. Each auto-encoder is specialised on recognising one particular class. The Ensemble of Auto-Encoders has a modular structure that has the advantage of making the model easily adapted to the changes from the data. Besides, it allows for personalised models because the model can adapt to the most request classes. This contribution is applied to the problem of Human Activity Recognition. Experimental results show the potential of the approaches mentioned.A sociedade moderna está cercada por diversos aplicativos que geram diariamente grandes volumes de dados. Atualmente, qualquer usuário pode monitorar suas atividades físicas, em tempo real, usando seus celulares ou dispositivos vestíveis. Além disso, empresas e governos podem aprender mais sobre seus clientes e cidadãos analisando dados disponíveis em mídias sociais, por exemplo. Esses dados são chamados de fluxo contínuo de dados quando são gerados em sequência e continuamente, geralmente em alta velocidade. Esses dados também são potencialmente ilimitados em tamanho e podem não ser estritamente estacionários. Extrair conhecimento de fluxos de dados é desafiador devido a várias restrições. O fluxo contínuo de dados requer que um algoritmo de aprendizagem atue em ambientes dinâmicos. O que significa que o algoritmo de aprendizagem deve permitir o processamento em tempo real. Além disso, deve ser capaz de se adaptar às mudanças ao longo do tempo, considerando a natureza não estacionária do fluxo de dados. Nas últimas décadas, muitas abordagens de aprendizado de máquina foram propostas para fluxo contínuo de dados. A maioria dessas abordagens é baseada na aprendizagem supervisionada. Essas abordagens dependem de dados rotulados para adaptar seus modelos às mudanças nos fluxos de dados. No entanto, o processo de rotular os dados costuma ser caro e pode exigir a utilização de especialistas no domínio em questão. Além disso, se os dados forem coletados em alta velocidade, pode não haver tempo suficiente para rotulá-los. Nesta tese, propomos algoritmos de aprendizado de máquina incremental e não supervisionado para fluxo contínuo de dados. Esses algoritmos são capazes de atualizar seus modelos de classificação com pouco ou sem feedback externo. Começamos abordando o problema de mudança de conceito em fluxo contínuo de dados, com poucos dados rotulados. Para esse problema, propomos uma abordagem semi-supervisionada chamada Sliding Window Clusters. Este método aprende os padrões atuais do fluxo contínuo de dados selecionando e resumindo os dados mais relevantes. A segunda abordagem é um algoritmo de aprendizagem não supervisionada chamada Higia que é capaz de classificar os dados em normal, novidade ou mudança de conceito. Na terceira abordagem presente nesta tese, propomos um algoritmo para combinar diferentes abordagens não supervisionadas em um modelo de classificação. Testamos essa abordagem considerando dois cenários. O primeiro é denominado Homogeneous Ensemble Clustering para Data Streams e é baseado na combinação de diferentes execuções do mesmo algoritmo de agrupamento. Neste estudo, também consideramos o cenário denominado Heterogeneous Ensemble Clustering para Data Streams, que se baseia na combinação de diferentes algoritmos de agrupamento de dados. Esses métodos permitem o uso de abordagens de agrupamento com um viés diferente para obter um modelo de classificação mais robusto. Além disso, avaliamos as abordagens do estado da arte, comumente citadas na literatura de detecção de novidades em fluxos de dados. A maior parte desta tese enfoca abordagens de agrupamento. Porém, dada a popularidade das redes neurais, também propomos o Ensemble of Auto-Encoders. Essa abordagem é baseada na combinação de auto-encoders em um conjunto de modelos. Cada auto-encoder é especializado em reconhecer uma classe particular. O Conjunto de auto-encoders possui uma estrutura modular que tem a vantagem de tornar o modelo facilmente adaptado às mudanças dos dados. Além disso, permite modelos personalizados, pois o modelo pode se adaptar às classes mais frequentes. Esta contribuição se aplica ao problema do Reconhecimento da Atividade Humana. Os resultados experimentais mostram o potencial das abordagens mencionadas.Biblioteca Digitais de Teses e Dissertações da USPCarvalho, André Carlos Ponce de Leon Ferreira deGarcia, Kemilly Dearo2021-04-16info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/55/55134/tde-24062021-161645/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2021-06-24T22:23:04Zoai:teses.usp.br:tde-24062021-161645Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212021-06-24T22:23:04Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv Unsupervised Learning Approaches for Non-Stationary Data Streams
Abordagens de aprendizagem não supervisionada para fluxos de dados não estacionários
title Unsupervised Learning Approaches for Non-Stationary Data Streams
spellingShingle Unsupervised Learning Approaches for Non-Stationary Data Streams
Garcia, Kemilly Dearo
Aprendizado de Máquina não-supervisionado
Aprendizado incremental
Data streams
Fluxo Continuo de Dados
Incremental learning
Unsupervised learning
title_short Unsupervised Learning Approaches for Non-Stationary Data Streams
title_full Unsupervised Learning Approaches for Non-Stationary Data Streams
title_fullStr Unsupervised Learning Approaches for Non-Stationary Data Streams
title_full_unstemmed Unsupervised Learning Approaches for Non-Stationary Data Streams
title_sort Unsupervised Learning Approaches for Non-Stationary Data Streams
author Garcia, Kemilly Dearo
author_facet Garcia, Kemilly Dearo
author_role author
dc.contributor.none.fl_str_mv Carvalho, André Carlos Ponce de Leon Ferreira de
dc.contributor.author.fl_str_mv Garcia, Kemilly Dearo
dc.subject.por.fl_str_mv Aprendizado de Máquina não-supervisionado
Aprendizado incremental
Data streams
Fluxo Continuo de Dados
Incremental learning
Unsupervised learning
topic Aprendizado de Máquina não-supervisionado
Aprendizado incremental
Data streams
Fluxo Continuo de Dados
Incremental learning
Unsupervised learning
description Modern society is surrounded by several applications which are daily generating large volumes of data. Nowadays, anyone can monitor their physical activities in real-time by using smartphones or wearable devices. Also, business and governments can learn more about their clients and citizens by analysing information from social media, for example. This data is called data streams when it is a sequence of data generated continuously, usually at high speed. This data is also potentially unbounded in size and may not be strictly stationary. Extracting useful knowledge from data streams is challenged due to several constraints. A data stream requires that a learning algorithm acts in dynamic environments. Meaning that the learning algorithm should allow for real-time processing. Moreover, it should be able to adapt to changes over time, considering the non-stationary nature of the data stream. In the last few decades, many machine learning approaches have been proposed for data streams. Most of them are based on supervised learning. These approaches rely on labelled data to adapt their models to the changes in data streams. However, the process of labelling data is usually costly and can require domain expertise. Besides, if the data is collected at high speed, it may be the case that there will not be enough time to label it. In this thesis, we aim to propose unsupervised and incremental machine learning algorithms for data streams. We focus on algorithms able to update their classification model with few or without external feedback. We start by addressing the problem of concept drift in data streams with few labelled data. For that problem, we propose a semi-supervised approach called Sliding Window Clusters. This method learns the current patterns from the data stream by selecting and summarising the most relevant data. We also study how to learn from data streams when novelties appear over time. So, we proposed an unsupervised learning method called Higia which is able to classify data as normal, novelty or concept drift. In this thesis, we propose an approach to combine different unsupervised approaches into a classification model. We test this approach considering two scenarios. The first is called Homogeneous Ensemble Clustering for Data Streams and it is based on the combination of different runs from the same clustering algorithm. In this study, we also consider the scenario called Heterogeneous Ensemble Clustering for Data Streams, which is based on the combination of different clustering algorithms. These methods allow for the use of clustering approaches with a different bias to obtain a more robust classification model. Furthermore, we evaluate the state-of-art approaches, commonly referred to in the literature of novelty detection in data streams. Most of this thesis focus on clustering approaches. However, given the popularity of neural networks, we also propose Ensemble of Auto-Encoders. This approach is based on the combination of auto-encoders into an ensemble model. Each auto-encoder is specialised on recognising one particular class. The Ensemble of Auto-Encoders has a modular structure that has the advantage of making the model easily adapted to the changes from the data. Besides, it allows for personalised models because the model can adapt to the most request classes. This contribution is applied to the problem of Human Activity Recognition. Experimental results show the potential of the approaches mentioned.
publishDate 2021
dc.date.none.fl_str_mv 2021-04-16
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://www.teses.usp.br/teses/disponiveis/55/55134/tde-24062021-161645/
url https://www.teses.usp.br/teses/disponiveis/55/55134/tde-24062021-161645/
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv Liberar o conteúdo para acesso público.
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Liberar o conteúdo para acesso público.
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv
reponame:Biblioteca Digital de Teses e Dissertações da USP
instname:Universidade de São Paulo (USP)
instacron:USP
instname_str Universidade de São Paulo (USP)
instacron_str USP
institution USP
reponame_str Biblioteca Digital de Teses e Dissertações da USP
collection Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br
_version_ 1815257307610611712