Less is More: Decoupled High-Semantic Encoding for Action Recognition - Details

Author：

Zhang, C. (Zhang, C..) | Ren, K. (Ren, K..) | Bian, Q. (Bian, Q..) | Shi, Y. (Shi, Y..)

Indexed by：

Scopus

Abstract：

This　paper　focuses　on　how　to　improve　the　efficiency　of　the　action　recognition　framework　by　optimizing　its　complicated　feature　extraction　pipelines　and　enhancing　explainability,　benefiting　future　adaptation　to　more　complex　visual　understanding　tasks　(e.g.　video　captioning).　To　achieve　this　task,　we　propose　a　novel　decoupled　two-stream　framework　for　action　recognition　-　HSAR,　which　utilizes　high-semantic　features　for　increased　efficiency　and　provides　well-founded　explanations　in　terms　of　spatial-temporal　perceptions　that　will　benefit　further　expansions　on　visual　understanding　tasks.　The　inputs　are　decoupled　into　spatial　and　temporal　streams　with　designated　encoders　aiming　to　extract　only　the　pinnacle　of　representations,　gaining　high-semantic　features　while　reducing　computation　costs　greatly.　A　lightweight　Temporal　Motion　Transformer　(TMT)　module　is　proposed　for　globally　modeling　temporal　features　through　self-attention,　omitting　redundant　spatial　features.　Decoupled　spatial-temporal　embeddings　are　further　merged　dynamically　by　an　attention　fusion　model　to　form　a　joint　high-semantic　representation.　The　visualization　of　the　attention　in　each　module　offers　intuitive　interpretations　of　HSAR＇s　explainability.　Extensive　experiments　on　three　widely-used　benchmarks　(Kinetics400,　600,　and　Sthv2)　show　that　our　framework　achieves　high　prediction　accuracy　with　significantly　reduced　computation　(only　64.07　GFLOPs　per　clip),　offering　a　great　trade-off　between　accuracy　and　computational　costs.　　©　2023　ACM.

Keyword：

High-Semantics Action Recognition Computer Vision Explainable AI Machine Learning Decoupled Feature Extraction

Author Community：

[ 1 ] [Zhang C.]Faculty of Information Technology, Beijing University of Technology, Beijing, China
[ 2 ] [Ren K.]Faculty of Information Technology, Beijing University of Technology, Beijing, China
[ 3 ] [Bian Q.]Faculty of Information Technology, Beijing University of Technology, Beijing, China
[ 4 ] [Shi Y.]Faculty of Information Technology, Beijing University of Technology, Beijing, China

Reprint Author's Address：

Email：

Show more details

Related Keywords：

Less is More: Decoupled High-Semantic Encoding for Action Recognition
2023，PROCEEDINGS OF THE 2023 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2023
A Survey of Human Action Recognition Based on Graph Convolutional Networks
2023，2nd International Conference on Artificial Intelligence, Human-Computer Interaction and Robotics, AIHCIR 2023
Artificial Intelligence for Art Creation and Understanding
2024，Artificial Intelligence for Art Creation and Understanding
Morse signal automatic detection based on computer vision
2019，Journal of Information Hiding and Multimedia Signal Processing
An improved body action recognition method based on manifold learning
2015，12th IEEE International Conference on Mechatronics and Automation, ICMA 2015

Source ：

Year： 2023

Page： 262-271

Language： English

Cited Count：

WoS CC Cited Count： 0

SCOPUS Cited Count：

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 8

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search SCOPUS

Type
Departments

All Years Choose Year From to