Real-time Human Action Localization in the Wild
Autor(a) principal: | |
---|---|
Data de Publicação: | 2023 |
Tipo de documento: | Dissertação |
Idioma: | eng |
Título da fonte: | Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
Texto Completo: | http://hdl.handle.net/10362/163514 |
Resumo: | Action Localization has been emerging as an important field of research, aiming to temporally and spatially distinguish human actions from video data. Due to the large amounts of video content available, deep convolutional networks have become the most commonly used models in Action Localization research. Usual applications for Action Localization algorithms (e.g. camera surveillance) tend to offer sub-optimal conditions for deploying these complex models, often having limited computer resources and requiring fast inference speed for online use. Therefore, research has been evolving towards developing faster and more efficient solutions. Parallel to this, the field of Frame Selection has been showing promising results, improving the perfor- mance and efficiency of action recognition models. Therefore, this thesis hypothesises that these techniques could be applied to Action Localization models to achieve the goal of efficient Spatiotemporal Action Localization. We conduct a study on the impact of reducing the size of the input for the YOWO [38] model on the UCF101-24 dataset [62]. This study leads to the definition of a function that selects one of two input variants based on the background complexity of the input clip through an extremely lightweight and fast background subtraction method. We improve the model’s inference speed by 73% while maintaining good performance, surpassing the baselines defined and approximating the reference YOWO model. Influenced by the work in Wang et al. [73], we also introduce a method that calculates the individual scores of each clip frame using a lightweight convolutional network and selects the top-scoring frames for the input. This method has also reduced the inference time of the model, achieving an improvement of 58%. Finally, we propose an Autoencoder-based solution that outputs a representation of the original clip using fewer frames without disregarding any frame of the clip. |
id |
RCAP_2fd886e2162d857bd3b039ab406c6cac |
---|---|
oai_identifier_str |
oai:run.unl.pt:10362/163514 |
network_acronym_str |
RCAP |
network_name_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository_id_str |
7160 |
spelling |
Real-time Human Action Localization in the WildSpatiotemporal Action LocalizationOnline Action LocalizationFrame SelectionAutoencodersBackground SubtractionDomínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e InformáticaAction Localization has been emerging as an important field of research, aiming to temporally and spatially distinguish human actions from video data. Due to the large amounts of video content available, deep convolutional networks have become the most commonly used models in Action Localization research. Usual applications for Action Localization algorithms (e.g. camera surveillance) tend to offer sub-optimal conditions for deploying these complex models, often having limited computer resources and requiring fast inference speed for online use. Therefore, research has been evolving towards developing faster and more efficient solutions. Parallel to this, the field of Frame Selection has been showing promising results, improving the perfor- mance and efficiency of action recognition models. Therefore, this thesis hypothesises that these techniques could be applied to Action Localization models to achieve the goal of efficient Spatiotemporal Action Localization. We conduct a study on the impact of reducing the size of the input for the YOWO [38] model on the UCF101-24 dataset [62]. This study leads to the definition of a function that selects one of two input variants based on the background complexity of the input clip through an extremely lightweight and fast background subtraction method. We improve the model’s inference speed by 73% while maintaining good performance, surpassing the baselines defined and approximating the reference YOWO model. Influenced by the work in Wang et al. [73], we also introduce a method that calculates the individual scores of each clip frame using a lightweight convolutional network and selects the top-scoring frames for the input. This method has also reduced the inference time of the model, achieving an improvement of 58%. Finally, we propose an Autoencoder-based solution that outputs a representation of the original clip using fewer frames without disregarding any frame of the clip.A Localização de Ações emerge como um campo de pesquisa importante, com o objetivo de distinguir temporal e espacialmente as ações humanas em dados de vídeo. Dada a ambundância de conteúdo de vídeo disponível, as redes convolucionais profundas tornaram-se os modelos predominantes em investigação de Localização de Ações. As aplicações comuns para algoritmos de Localização de Ações (por exemplo, videovigilância) tendem a oferecer condições subótimas para a implementação de modelos complexos, sendo que muitas vezes possuem recursos computacionais limitados e exigem grande velocidade de inferência para funcionamento online. Portanto, a investigação nesta área tem evoluído no sentido de desenvolver soluções mais rápidas e eficientes. Paralelamente a isso, o campo da Seleção de Frames tem mostrado resultados promissores, melhorando o desempenho e a eficiência de modelos de reconhecimento de ações. Portanto, esta tese propõe a hipótese de que essas técnicas podem ser aplicadas a modelos de Localização de Ações para torna-los mais eficientes. Realizámos um estudo sobre o impacto da redução do tamanho do input para o modelo YOWO [38] no conjunto de dados UCF101-24 [62]. Este estudo levou à definição de uma função que seleciona uma das duas variantes de input com base na complexidade do background do clipe de video recebido pelo modelo, através de um método de subtração de background extremamente rápido. Melhoramos a velocidade de inferência do modelo em 73%, mantendo um bom desempenho e superando as bases definidas, inclusivé aproximamos o modelo YOWO de referência. Introduzimos também um método que calcula as pontuações individuais de cada frame do clipe através do uso de uma rede convolucional que seleciona os frames com melhor pontuação, obtendo sucesso na redução do tempo de inferência do modelo em 58%. Por fim, propomos uma solução baseada num Autoencoder, que gera uma representação do clipe original usando menos frames, sem desconsiderar nenhum frame do clipe."Neves, JoãoSemedo, DavidRUNPereira, João Alexandre Cardeira2024-02-14T11:24:56Z2023-122023-12-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/163514enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T05:47:04Zoai:run.unl.pt:10362/163514Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:59:26.510927Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse |
dc.title.none.fl_str_mv |
Real-time Human Action Localization in the Wild |
title |
Real-time Human Action Localization in the Wild |
spellingShingle |
Real-time Human Action Localization in the Wild Pereira, João Alexandre Cardeira Spatiotemporal Action Localization Online Action Localization Frame Selection Autoencoders Background Subtraction Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática |
title_short |
Real-time Human Action Localization in the Wild |
title_full |
Real-time Human Action Localization in the Wild |
title_fullStr |
Real-time Human Action Localization in the Wild |
title_full_unstemmed |
Real-time Human Action Localization in the Wild |
title_sort |
Real-time Human Action Localization in the Wild |
author |
Pereira, João Alexandre Cardeira |
author_facet |
Pereira, João Alexandre Cardeira |
author_role |
author |
dc.contributor.none.fl_str_mv |
Neves, João Semedo, David RUN |
dc.contributor.author.fl_str_mv |
Pereira, João Alexandre Cardeira |
dc.subject.por.fl_str_mv |
Spatiotemporal Action Localization Online Action Localization Frame Selection Autoencoders Background Subtraction Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática |
topic |
Spatiotemporal Action Localization Online Action Localization Frame Selection Autoencoders Background Subtraction Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática |
description |
Action Localization has been emerging as an important field of research, aiming to temporally and spatially distinguish human actions from video data. Due to the large amounts of video content available, deep convolutional networks have become the most commonly used models in Action Localization research. Usual applications for Action Localization algorithms (e.g. camera surveillance) tend to offer sub-optimal conditions for deploying these complex models, often having limited computer resources and requiring fast inference speed for online use. Therefore, research has been evolving towards developing faster and more efficient solutions. Parallel to this, the field of Frame Selection has been showing promising results, improving the perfor- mance and efficiency of action recognition models. Therefore, this thesis hypothesises that these techniques could be applied to Action Localization models to achieve the goal of efficient Spatiotemporal Action Localization. We conduct a study on the impact of reducing the size of the input for the YOWO [38] model on the UCF101-24 dataset [62]. This study leads to the definition of a function that selects one of two input variants based on the background complexity of the input clip through an extremely lightweight and fast background subtraction method. We improve the model’s inference speed by 73% while maintaining good performance, surpassing the baselines defined and approximating the reference YOWO model. Influenced by the work in Wang et al. [73], we also introduce a method that calculates the individual scores of each clip frame using a lightweight convolutional network and selects the top-scoring frames for the input. This method has also reduced the inference time of the model, achieving an improvement of 58%. Finally, we propose an Autoencoder-based solution that outputs a representation of the original clip using fewer frames without disregarding any frame of the clip. |
publishDate |
2023 |
dc.date.none.fl_str_mv |
2023-12 2023-12-01T00:00:00Z 2024-02-14T11:24:56Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/masterThesis |
format |
masterThesis |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10362/163514 |
url |
http://hdl.handle.net/10362/163514 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP |
instname_str |
Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
collection |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) |
repository.name.fl_str_mv |
Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação |
repository.mail.fl_str_mv |
|
_version_ |
1799138173733830656 |