Real-time Human Action Localization in the Wild

Detalhes bibliográficos
Autor(a) principal: Pereira, João Alexandre Cardeira
Data de Publicação: 2023
Tipo de documento: Dissertação
Idioma: eng
Título da fonte: Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo: http://hdl.handle.net/10362/163514
Resumo: Action Localization has been emerging as an important field of research, aiming to temporally and spatially distinguish human actions from video data. Due to the large amounts of video content available, deep convolutional networks have become the most commonly used models in Action Localization research. Usual applications for Action Localization algorithms (e.g. camera surveillance) tend to offer sub-optimal conditions for deploying these complex models, often having limited computer resources and requiring fast inference speed for online use. Therefore, research has been evolving towards developing faster and more efficient solutions. Parallel to this, the field of Frame Selection has been showing promising results, improving the perfor- mance and efficiency of action recognition models. Therefore, this thesis hypothesises that these techniques could be applied to Action Localization models to achieve the goal of efficient Spatiotemporal Action Localization. We conduct a study on the impact of reducing the size of the input for the YOWO [38] model on the UCF101-24 dataset [62]. This study leads to the definition of a function that selects one of two input variants based on the background complexity of the input clip through an extremely lightweight and fast background subtraction method. We improve the model’s inference speed by 73% while maintaining good performance, surpassing the baselines defined and approximating the reference YOWO model. Influenced by the work in Wang et al. [73], we also introduce a method that calculates the individual scores of each clip frame using a lightweight convolutional network and selects the top-scoring frames for the input. This method has also reduced the inference time of the model, achieving an improvement of 58%. Finally, we propose an Autoencoder-based solution that outputs a representation of the original clip using fewer frames without disregarding any frame of the clip.
id RCAP_2fd886e2162d857bd3b039ab406c6cac
oai_identifier_str oai:run.unl.pt:10362/163514
network_acronym_str RCAP
network_name_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str 7160
spelling Real-time Human Action Localization in the WildSpatiotemporal Action LocalizationOnline Action LocalizationFrame SelectionAutoencodersBackground SubtractionDomínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e InformáticaAction Localization has been emerging as an important field of research, aiming to temporally and spatially distinguish human actions from video data. Due to the large amounts of video content available, deep convolutional networks have become the most commonly used models in Action Localization research. Usual applications for Action Localization algorithms (e.g. camera surveillance) tend to offer sub-optimal conditions for deploying these complex models, often having limited computer resources and requiring fast inference speed for online use. Therefore, research has been evolving towards developing faster and more efficient solutions. Parallel to this, the field of Frame Selection has been showing promising results, improving the perfor- mance and efficiency of action recognition models. Therefore, this thesis hypothesises that these techniques could be applied to Action Localization models to achieve the goal of efficient Spatiotemporal Action Localization. We conduct a study on the impact of reducing the size of the input for the YOWO [38] model on the UCF101-24 dataset [62]. This study leads to the definition of a function that selects one of two input variants based on the background complexity of the input clip through an extremely lightweight and fast background subtraction method. We improve the model’s inference speed by 73% while maintaining good performance, surpassing the baselines defined and approximating the reference YOWO model. Influenced by the work in Wang et al. [73], we also introduce a method that calculates the individual scores of each clip frame using a lightweight convolutional network and selects the top-scoring frames for the input. This method has also reduced the inference time of the model, achieving an improvement of 58%. Finally, we propose an Autoencoder-based solution that outputs a representation of the original clip using fewer frames without disregarding any frame of the clip.A Localização de Ações emerge como um campo de pesquisa importante, com o objetivo de distinguir temporal e espacialmente as ações humanas em dados de vídeo. Dada a ambundância de conteúdo de vídeo disponível, as redes convolucionais profundas tornaram-se os modelos predominantes em investigação de Localização de Ações. As aplicações comuns para algoritmos de Localização de Ações (por exemplo, videovigilância) tendem a oferecer condições subótimas para a implementação de modelos complexos, sendo que muitas vezes possuem recursos computacionais limitados e exigem grande velocidade de inferência para funcionamento online. Portanto, a investigação nesta área tem evoluído no sentido de desenvolver soluções mais rápidas e eficientes. Paralelamente a isso, o campo da Seleção de Frames tem mostrado resultados promissores, melhorando o desempenho e a eficiência de modelos de reconhecimento de ações. Portanto, esta tese propõe a hipótese de que essas técnicas podem ser aplicadas a modelos de Localização de Ações para torna-los mais eficientes. Realizámos um estudo sobre o impacto da redução do tamanho do input para o modelo YOWO [38] no conjunto de dados UCF101-24 [62]. Este estudo levou à definição de uma função que seleciona uma das duas variantes de input com base na complexidade do background do clipe de video recebido pelo modelo, através de um método de subtração de background extremamente rápido. Melhoramos a velocidade de inferência do modelo em 73%, mantendo um bom desempenho e superando as bases definidas, inclusivé aproximamos o modelo YOWO de referência. Introduzimos também um método que calcula as pontuações individuais de cada frame do clipe através do uso de uma rede convolucional que seleciona os frames com melhor pontuação, obtendo sucesso na redução do tempo de inferência do modelo em 58%. Por fim, propomos uma solução baseada num Autoencoder, que gera uma representação do clipe original usando menos frames, sem desconsiderar nenhum frame do clipe."Neves, JoãoSemedo, DavidRUNPereira, João Alexandre Cardeira2024-02-14T11:24:56Z2023-122023-12-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/163514enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T05:47:04Zoai:run.unl.pt:10362/163514Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:59:26.510927Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv Real-time Human Action Localization in the Wild
title Real-time Human Action Localization in the Wild
spellingShingle Real-time Human Action Localization in the Wild
Pereira, João Alexandre Cardeira
Spatiotemporal Action Localization
Online Action Localization
Frame Selection
Autoencoders
Background Subtraction
Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
title_short Real-time Human Action Localization in the Wild
title_full Real-time Human Action Localization in the Wild
title_fullStr Real-time Human Action Localization in the Wild
title_full_unstemmed Real-time Human Action Localization in the Wild
title_sort Real-time Human Action Localization in the Wild
author Pereira, João Alexandre Cardeira
author_facet Pereira, João Alexandre Cardeira
author_role author
dc.contributor.none.fl_str_mv Neves, João
Semedo, David
RUN
dc.contributor.author.fl_str_mv Pereira, João Alexandre Cardeira
dc.subject.por.fl_str_mv Spatiotemporal Action Localization
Online Action Localization
Frame Selection
Autoencoders
Background Subtraction
Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
topic Spatiotemporal Action Localization
Online Action Localization
Frame Selection
Autoencoders
Background Subtraction
Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
description Action Localization has been emerging as an important field of research, aiming to temporally and spatially distinguish human actions from video data. Due to the large amounts of video content available, deep convolutional networks have become the most commonly used models in Action Localization research. Usual applications for Action Localization algorithms (e.g. camera surveillance) tend to offer sub-optimal conditions for deploying these complex models, often having limited computer resources and requiring fast inference speed for online use. Therefore, research has been evolving towards developing faster and more efficient solutions. Parallel to this, the field of Frame Selection has been showing promising results, improving the perfor- mance and efficiency of action recognition models. Therefore, this thesis hypothesises that these techniques could be applied to Action Localization models to achieve the goal of efficient Spatiotemporal Action Localization. We conduct a study on the impact of reducing the size of the input for the YOWO [38] model on the UCF101-24 dataset [62]. This study leads to the definition of a function that selects one of two input variants based on the background complexity of the input clip through an extremely lightweight and fast background subtraction method. We improve the model’s inference speed by 73% while maintaining good performance, surpassing the baselines defined and approximating the reference YOWO model. Influenced by the work in Wang et al. [73], we also introduce a method that calculates the individual scores of each clip frame using a lightweight convolutional network and selects the top-scoring frames for the input. This method has also reduced the inference time of the model, achieving an improvement of 58%. Finally, we propose an Autoencoder-based solution that outputs a representation of the original clip using fewer frames without disregarding any frame of the clip.
publishDate 2023
dc.date.none.fl_str_mv 2023-12
2023-12-01T00:00:00Z
2024-02-14T11:24:56Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/masterThesis
format masterThesis
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10362/163514
url http://hdl.handle.net/10362/163514
dc.language.iso.fl_str_mv eng
language eng
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron:RCAAP
instname_str Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str RCAAP
institution RCAAP
reponame_str Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_ 1799138173733830656