Representation learning of spatio-temporal features from video

Detalhes bibliográficos
Autor(a) principal: Costa, Gabriel de Barros Paranhos da
Data de Publicação: 2019
Tipo de documento: Tese
Idioma: eng
Título da fonte: Biblioteca Digital de Teses e Dissertações da USP
Texto Completo: https://www.teses.usp.br/teses/disponiveis/55/55134/tde-03022020-093918/
Resumo: One of the main challenges in computer vision is to encode the information present in images and videos into a feature vector that can later used, for example, to train a machine learning model. Videos include an extra challenge since both spatial and temporal information need to be considered. To address the challenges of creating new feature extraction methods, representation learning focuses on creating data-driven representations directly from raw data; these methods achieved state-of-the-art performance on many image-focused computer vision tasks. For these reasons, spatio-temporal representation learning from videos is considered a natural next step. Even though multiple architectures have been proposed for video processing, the results obtained by these methods when applied to videos are still akin to the ones obtained by hand-crafted feature extraction methods and reasonably bellow the advantages obtained by representation learning on images. We believe that to advance the area of spatio-temporal representation learning, a better understanding of how the information is encoded by these methods is required, allowing for more knowledgeable decisions regarding when each architecture should be used. For this purpose, we propose a novel evaluation protocol that looks at a synthetic problem in three different settings where the relevant information for the task appears only on spatial dimensions, temporal dimension or both. We also investigate the advantages of using a representation learning method over hand-crafted feature extraction, especially regarding their use on different (previously unknown) tasks. Lastly, we propose a data-driven regularisation method based on generative networks and knowledge transfer to improve the feature space learnt by representation learning methods. Our results show that when learning spatio-temporal representations it is important to include temporal information in every stage. We also notice that while architectures that used convolutions on the temporal dimension achieved the best results among the tested architectures, they had difficulties adapting to changes in the temporal information. When comparing the performance of hand-crafted and learnt representations on multiples tasks, hand-crafted features obtained better results on the task they were designed for, but considerably worst performance on a second unrelated task. Finally, we show that generative networks have a promising application on knowledge transfer, even though further investigation is required in a spatio-temporal setting.
id USP_6a421a8206f3d02ab19c9463e233eb08
oai_identifier_str oai:teses.usp.br:tde-03022020-093918
network_acronym_str USP
network_name_str Biblioteca Digital de Teses e Dissertações da USP
repository_id_str 2721
spelling Representation learning of spatio-temporal features from videoAprendizado de características espaço-temporais em vídeosAprendizado de característicasAprendizado de máquinaAprendizado profundoComputer visionDeep learningExtração de característicasFeature extractionMachine learningProcessamento de vídeosRepresentation learningVideo processingVisão computacionalOne of the main challenges in computer vision is to encode the information present in images and videos into a feature vector that can later used, for example, to train a machine learning model. Videos include an extra challenge since both spatial and temporal information need to be considered. To address the challenges of creating new feature extraction methods, representation learning focuses on creating data-driven representations directly from raw data; these methods achieved state-of-the-art performance on many image-focused computer vision tasks. For these reasons, spatio-temporal representation learning from videos is considered a natural next step. Even though multiple architectures have been proposed for video processing, the results obtained by these methods when applied to videos are still akin to the ones obtained by hand-crafted feature extraction methods and reasonably bellow the advantages obtained by representation learning on images. We believe that to advance the area of spatio-temporal representation learning, a better understanding of how the information is encoded by these methods is required, allowing for more knowledgeable decisions regarding when each architecture should be used. For this purpose, we propose a novel evaluation protocol that looks at a synthetic problem in three different settings where the relevant information for the task appears only on spatial dimensions, temporal dimension or both. We also investigate the advantages of using a representation learning method over hand-crafted feature extraction, especially regarding their use on different (previously unknown) tasks. Lastly, we propose a data-driven regularisation method based on generative networks and knowledge transfer to improve the feature space learnt by representation learning methods. Our results show that when learning spatio-temporal representations it is important to include temporal information in every stage. We also notice that while architectures that used convolutions on the temporal dimension achieved the best results among the tested architectures, they had difficulties adapting to changes in the temporal information. When comparing the performance of hand-crafted and learnt representations on multiples tasks, hand-crafted features obtained better results on the task they were designed for, but considerably worst performance on a second unrelated task. Finally, we show that generative networks have a promising application on knowledge transfer, even though further investigation is required in a spatio-temporal setting.Um dos principais desafios em visão computacional é codificar as informações presentes em imagens e vídeos em um vetor de características que depois pode ser utilizado, por exemplo, para treinar um modelo (aprendizado de máquina). Vídeos incluem um desafio a mais, uma vez que tanto informações espaciais quanto temporais precisam ser consideradas. Para reduzir a necessidade da criação de novos métodos de extração de características, métodos de aprendizado de características buscam criar representação diretamente a partir dos dados; esses métodos obtiveram resultados no estado da arte em diversas tarefas de visão computacionais baseadas em imagens. Por esses motivos, o aprendizado de características espaço-temporais a partir de vídeos é considerado como um próximo passo natural. Apesar de diversas arquiteturas terem sido propostas com esse objetivo, os resultados obtidos por esses métodos, quando aplicados a vídeos, são semelhantes aos obtidos pelos métodos tradicionais e apresentaram vantagens consideravelmente inferiores do que em aplicações focadas em imagens. Nós acreditamos que para encontrar melhorias na área de aprendizado de características espaço-temporais é necessário obter um maior conhecimento sobre como as informações são codificadas por esses métodos, permitindo a tomada de decisão mais bem informada sobre quando cada arquitetura deve ser usada. Com esse fim, nós propomos um novo protocolo de avaliação que utiliza um problema sintético em três diferentes configurações onde a informação relevante para a tarefa aparece somente nas dimensões espaciais, na dimensão temporal ou em ambas. Nós também investigamos as vantagens de se utilizar um método de aprendizado de características ao invés de características projetadas manualmente, em especial com relação ao seu uso em diferentes tarefas. Então, nós propomos um método de regularização baseado em redes generativas e transferência de conhecimento como forma de melhorar o espaço de características obtido por métodos de aprendizado de características. Os resultados mostram que quando realizando aprendizado de características espaço-temporais é importante incluir a informações temporal durante todos os estágios. Também notamos que apesar das arquiteturas que utilizam convolução na dimensão temporal obterem os melhores resultados dentre as arquiteturas testadas, essas têm dificuldade para se adaptar a mudanças na informação temporal. Quando comparando o desempenho de características manualmente projetadas e de características aprendidas a partir dos dados, as primeiras obtiveram resultados superiores na tarefa para o qual foram projetadas, mas seu desempenho cai significativamente em outra tarefa, obtendo desempenho inferior nesse caso. Finalmente, nós mostramos que redes generativas possuem em transferência de conhecimento uma promissora aplicação, apesar de ser necessário expandir a análise para incluir características espaço-temporais.Biblioteca Digitais de Teses e Dissertações da USPMello, Rodrigo Fernandes dePonti, Moacir AntonelliCosta, Gabriel de Barros Paranhos da2019-09-26info:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/doctoralThesisapplication/pdfhttps://www.teses.usp.br/teses/disponiveis/55/55134/tde-03022020-093918/reponame:Biblioteca Digital de Teses e Dissertações da USPinstname:Universidade de São Paulo (USP)instacron:USPLiberar o conteúdo para acesso público.info:eu-repo/semantics/openAccesseng2020-02-03T23:14:01Zoai:teses.usp.br:tde-03022020-093918Biblioteca Digital de Teses e Dissertaçõeshttp://www.teses.usp.br/PUBhttp://www.teses.usp.br/cgi-bin/mtd2br.plvirginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.bropendoar:27212020-02-03T23:14:01Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)false
dc.title.none.fl_str_mv Representation learning of spatio-temporal features from video
Aprendizado de características espaço-temporais em vídeos
title Representation learning of spatio-temporal features from video
spellingShingle Representation learning of spatio-temporal features from video
Costa, Gabriel de Barros Paranhos da
Aprendizado de características
Aprendizado de máquina
Aprendizado profundo
Computer vision
Deep learning
Extração de características
Feature extraction
Machine learning
Processamento de vídeos
Representation learning
Video processing
Visão computacional
title_short Representation learning of spatio-temporal features from video
title_full Representation learning of spatio-temporal features from video
title_fullStr Representation learning of spatio-temporal features from video
title_full_unstemmed Representation learning of spatio-temporal features from video
title_sort Representation learning of spatio-temporal features from video
author Costa, Gabriel de Barros Paranhos da
author_facet Costa, Gabriel de Barros Paranhos da
author_role author
dc.contributor.none.fl_str_mv Mello, Rodrigo Fernandes de
Ponti, Moacir Antonelli
dc.contributor.author.fl_str_mv Costa, Gabriel de Barros Paranhos da
dc.subject.por.fl_str_mv Aprendizado de características
Aprendizado de máquina
Aprendizado profundo
Computer vision
Deep learning
Extração de características
Feature extraction
Machine learning
Processamento de vídeos
Representation learning
Video processing
Visão computacional
topic Aprendizado de características
Aprendizado de máquina
Aprendizado profundo
Computer vision
Deep learning
Extração de características
Feature extraction
Machine learning
Processamento de vídeos
Representation learning
Video processing
Visão computacional
description One of the main challenges in computer vision is to encode the information present in images and videos into a feature vector that can later used, for example, to train a machine learning model. Videos include an extra challenge since both spatial and temporal information need to be considered. To address the challenges of creating new feature extraction methods, representation learning focuses on creating data-driven representations directly from raw data; these methods achieved state-of-the-art performance on many image-focused computer vision tasks. For these reasons, spatio-temporal representation learning from videos is considered a natural next step. Even though multiple architectures have been proposed for video processing, the results obtained by these methods when applied to videos are still akin to the ones obtained by hand-crafted feature extraction methods and reasonably bellow the advantages obtained by representation learning on images. We believe that to advance the area of spatio-temporal representation learning, a better understanding of how the information is encoded by these methods is required, allowing for more knowledgeable decisions regarding when each architecture should be used. For this purpose, we propose a novel evaluation protocol that looks at a synthetic problem in three different settings where the relevant information for the task appears only on spatial dimensions, temporal dimension or both. We also investigate the advantages of using a representation learning method over hand-crafted feature extraction, especially regarding their use on different (previously unknown) tasks. Lastly, we propose a data-driven regularisation method based on generative networks and knowledge transfer to improve the feature space learnt by representation learning methods. Our results show that when learning spatio-temporal representations it is important to include temporal information in every stage. We also notice that while architectures that used convolutions on the temporal dimension achieved the best results among the tested architectures, they had difficulties adapting to changes in the temporal information. When comparing the performance of hand-crafted and learnt representations on multiples tasks, hand-crafted features obtained better results on the task they were designed for, but considerably worst performance on a second unrelated task. Finally, we show that generative networks have a promising application on knowledge transfer, even though further investigation is required in a spatio-temporal setting.
publishDate 2019
dc.date.none.fl_str_mv 2019-09-26
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/doctoralThesis
format doctoralThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv https://www.teses.usp.br/teses/disponiveis/55/55134/tde-03022020-093918/
url https://www.teses.usp.br/teses/disponiveis/55/55134/tde-03022020-093918/
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv
dc.rights.driver.fl_str_mv Liberar o conteúdo para acesso público.
info:eu-repo/semantics/openAccess
rights_invalid_str_mv Liberar o conteúdo para acesso público.
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.coverage.none.fl_str_mv
dc.publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
publisher.none.fl_str_mv Biblioteca Digitais de Teses e Dissertações da USP
dc.source.none.fl_str_mv
reponame:Biblioteca Digital de Teses e Dissertações da USP
instname:Universidade de São Paulo (USP)
instacron:USP
instname_str Universidade de São Paulo (USP)
instacron_str USP
institution USP
reponame_str Biblioteca Digital de Teses e Dissertações da USP
collection Biblioteca Digital de Teses e Dissertações da USP
repository.name.fl_str_mv Biblioteca Digital de Teses e Dissertações da USP - Universidade de São Paulo (USP)
repository.mail.fl_str_mv virginia@if.usp.br|| atendimento@aguia.usp.br||virginia@if.usp.br
_version_ 1809090328294064128