Indexed by:
Abstract:
This paper focuses on how to improve the efficiency of the action recognition framework by optimizing its complicated feature extraction pipelines and enhancing explainability, benefiting future adaptation to more complex visual understanding tasks (e.g. video captioning). To achieve this task, we propose a novel decoupled two-stream framework for action recognition - HSAR, which utilizes high-semantic features for increased efficiency and provides well-founded explanations in terms of spatial-temporal perceptions that will benefit further expansions on visual understanding tasks. The inputs are decoupled into spatial and temporal streams with designated encoders aiming to extract only the pinnacle of representations, gaining high-semantic features while reducing computation costs greatly. A lightweight Temporal Motion Transformer (TMT) module is proposed for globally modeling temporal features through self-attention, omitting redundant spatial features. Decoupled spatial-temporal embeddings are further merged dynamically by an attention fusion model to form a joint high-semantic representation. The visualization of the attention in each module offers intuitive interpretations of HSAR's explainability. Extensive experiments on three widely-used benchmarks (Kinetics400, 600, and Sthv2) show that our framework achieves high prediction accuracy with significantly reduced computation (only 64.07 GFLOPs per clip), offering a great trade-off between accuracy and computational costs. © 2023 ACM.
Keyword:
Reprint Author's Address:
Email:
Source :
Year: 2023
Page: 262-271
Language: English
Cited Count:
WoS CC Cited Count: 0
SCOPUS Cited Count:
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 8
Affiliated Colleges: