Real-time Human Action Localization in the Wild

Pereira, João Alexandre Cardeira

Real-time Human Action Localization in the Wild

Detalhes bibliográficos
Autor(a) principal:	Pereira, João Alexandre Cardeira
Data de Publicação:	2023
Tipo de documento:	Dissertação
Idioma:	eng
Título da fonte:	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
Texto Completo:	http://hdl.handle.net/10362/163514
Resumo:	Action Localization has been emerging as an important field of research, aiming to temporally and spatially distinguish human actions from video data. Due to the large amounts of video content available, deep convolutional networks have become the most commonly used models in Action Localization research. Usual applications for Action Localization algorithms (e.g. camera surveillance) tend to offer sub-optimal conditions for deploying these complex models, often having limited computer resources and requiring fast inference speed for online use. Therefore, research has been evolving towards developing faster and more efficient solutions. Parallel to this, the field of Frame Selection has been showing promising results, improving the perfor- mance and efficiency of action recognition models. Therefore, this thesis hypothesises that these techniques could be applied to Action Localization models to achieve the goal of efficient Spatiotemporal Action Localization. We conduct a study on the impact of reducing the size of the input for the YOWO [38] model on the UCF101-24 dataset [62]. This study leads to the definition of a function that selects one of two input variants based on the background complexity of the input clip through an extremely lightweight and fast background subtraction method. We improve the model’s inference speed by 73% while maintaining good performance, surpassing the baselines defined and approximating the reference YOWO model. Influenced by the work in Wang et al. [73], we also introduce a method that calculates the individual scores of each clip frame using a lightweight convolutional network and selects the top-scoring frames for the input. This method has also reduced the inference time of the model, achieving an improvement of 58%. Finally, we propose an Autoencoder-based solution that outputs a representation of the original clip using fewer frames without disregarding any frame of the clip.

Metadados do item

id	RCAP_2fd886e2162d857bd3b039ab406c6cac
oai_identifier_str	oai:run.unl.pt:10362/163514
network_acronym_str	RCAP
network_name_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository_id_str	7160
spelling	Real-time Human Action Localization in the WildSpatiotemporal Action LocalizationOnline Action LocalizationFrame SelectionAutoencodersBackground SubtractionDomínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e InformáticaAction Localization has been emerging as an important field of research, aiming to temporally and spatially distinguish human actions from video data. Due to the large amounts of video content available, deep convolutional networks have become the most commonly used models in Action Localization research. Usual applications for Action Localization algorithms (e.g. camera surveillance) tend to offer sub-optimal conditions for deploying these complex models, often having limited computer resources and requiring fast inference speed for online use. Therefore, research has been evolving towards developing faster and more efficient solutions. Parallel to this, the field of Frame Selection has been showing promising results, improving the perfor- mance and efficiency of action recognition models. Therefore, this thesis hypothesises that these techniques could be applied to Action Localization models to achieve the goal of efficient Spatiotemporal Action Localization. We conduct a study on the impact of reducing the size of the input for the YOWO [38] model on the UCF101-24 dataset [62]. This study leads to the definition of a function that selects one of two input variants based on the background complexity of the input clip through an extremely lightweight and fast background subtraction method. We improve the model’s inference speed by 73% while maintaining good performance, surpassing the baselines defined and approximating the reference YOWO model. Influenced by the work in Wang et al. [73], we also introduce a method that calculates the individual scores of each clip frame using a lightweight convolutional network and selects the top-scoring frames for the input. This method has also reduced the inference time of the model, achieving an improvement of 58%. Finally, we propose an Autoencoder-based solution that outputs a representation of the original clip using fewer frames without disregarding any frame of the clip.A Localização de Ações emerge como um campo de pesquisa importante, com o objetivo de distinguir temporal e espacialmente as ações humanas em dados de vídeo. Dada a ambundância de conteúdo de vídeo disponível, as redes convolucionais profundas tornaram-se os modelos predominantes em investigação de Localização de Ações. As aplicações comuns para algoritmos de Localização de Ações (por exemplo, videovigilância) tendem a oferecer condições subótimas para a implementação de modelos complexos, sendo que muitas vezes possuem recursos computacionais limitados e exigem grande velocidade de inferência para funcionamento online. Portanto, a investigação nesta área tem evoluído no sentido de desenvolver soluções mais rápidas e eficientes. Paralelamente a isso, o campo da Seleção de Frames tem mostrado resultados promissores, melhorando o desempenho e a eficiência de modelos de reconhecimento de ações. Portanto, esta tese propõe a hipótese de que essas técnicas podem ser aplicadas a modelos de Localização de Ações para torna-los mais eficientes. Realizámos um estudo sobre o impacto da redução do tamanho do input para o modelo YOWO [38] no conjunto de dados UCF101-24 [62]. Este estudo levou à definição de uma função que seleciona uma das duas variantes de input com base na complexidade do background do clipe de video recebido pelo modelo, através de um método de subtração de background extremamente rápido. Melhoramos a velocidade de inferência do modelo em 73%, mantendo um bom desempenho e superando as bases definidas, inclusivé aproximamos o modelo YOWO de referência. Introduzimos também um método que calcula as pontuações individuais de cada frame do clipe através do uso de uma rede convolucional que seleciona os frames com melhor pontuação, obtendo sucesso na redução do tempo de inferência do modelo em 58%. Por fim, propomos uma solução baseada num Autoencoder, que gera uma representação do clipe original usando menos frames, sem desconsiderar nenhum frame do clipe."Neves, JoãoSemedo, DavidRUNPereira, João Alexandre Cardeira2024-02-14T11:24:56Z2023-122023-12-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/masterThesisapplication/pdfhttp://hdl.handle.net/10362/163514enginfo:eu-repo/semantics/openAccessreponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãoinstacron:RCAAP2024-03-11T05:47:04Zoai:run.unl.pt:10362/163514Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireopendoar:71602024-03-20T03:59:26.510927Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informaçãofalse
dc.title.none.fl_str_mv	Real-time Human Action Localization in the Wild
title	Real-time Human Action Localization in the Wild
spellingShingle	Real-time Human Action Localization in the Wild Pereira, João Alexandre Cardeira Spatiotemporal Action Localization Online Action Localization Frame Selection Autoencoders Background Subtraction Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
title_short	Real-time Human Action Localization in the Wild
title_full	Real-time Human Action Localization in the Wild
title_fullStr	Real-time Human Action Localization in the Wild
title_full_unstemmed	Real-time Human Action Localization in the Wild
title_sort	Real-time Human Action Localization in the Wild
author	Pereira, João Alexandre Cardeira
author_facet	Pereira, João Alexandre Cardeira
author_role	author
dc.contributor.none.fl_str_mv	Neves, João Semedo, David RUN
dc.contributor.author.fl_str_mv	Pereira, João Alexandre Cardeira
dc.subject.por.fl_str_mv	Spatiotemporal Action Localization Online Action Localization Frame Selection Autoencoders Background Subtraction Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
topic	Spatiotemporal Action Localization Online Action Localization Frame Selection Autoencoders Background Subtraction Domínio/Área Científica::Engenharia e Tecnologia::Engenharia Eletrotécnica, Eletrónica e Informática
description	Action Localization has been emerging as an important field of research, aiming to temporally and spatially distinguish human actions from video data. Due to the large amounts of video content available, deep convolutional networks have become the most commonly used models in Action Localization research. Usual applications for Action Localization algorithms (e.g. camera surveillance) tend to offer sub-optimal conditions for deploying these complex models, often having limited computer resources and requiring fast inference speed for online use. Therefore, research has been evolving towards developing faster and more efficient solutions. Parallel to this, the field of Frame Selection has been showing promising results, improving the perfor- mance and efficiency of action recognition models. Therefore, this thesis hypothesises that these techniques could be applied to Action Localization models to achieve the goal of efficient Spatiotemporal Action Localization. We conduct a study on the impact of reducing the size of the input for the YOWO [38] model on the UCF101-24 dataset [62]. This study leads to the definition of a function that selects one of two input variants based on the background complexity of the input clip through an extremely lightweight and fast background subtraction method. We improve the model’s inference speed by 73% while maintaining good performance, surpassing the baselines defined and approximating the reference YOWO model. Influenced by the work in Wang et al. [73], we also introduce a method that calculates the individual scores of each clip frame using a lightweight convolutional network and selects the top-scoring frames for the input. This method has also reduced the inference time of the model, achieving an improvement of 58%. Finally, we propose an Autoencoder-based solution that outputs a representation of the original clip using fewer frames without disregarding any frame of the clip.
publishDate	2023
dc.date.none.fl_str_mv	2023-12 2023-12-01T00:00:00Z 2024-02-14T11:24:56Z
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv	info:eu-repo/semantics/masterThesis
format	masterThesis
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10362/163514
url	http://hdl.handle.net/10362/163514
dc.language.iso.fl_str_mv	eng
language	eng
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	application/pdf
dc.source.none.fl_str_mv	reponame:Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) instname:Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação instacron:RCAAP
instname_str	Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
collection	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos)
repository.name.fl_str_mv	Repositório Científico de Acesso Aberto de Portugal (Repositórios Cientìficos) - Agência para a Sociedade do Conhecimento (UMIC) - FCT - Sociedade da Informação
repository.mail.fl_str_mv
_version_	1799138173733830656

Real-time Human Action Localization in the Wild

Registros relacionados