Indexed by:
Abstract:
Recently, the Convolutional Networks (ConvNet) has become the dominated approach to the human activity classification problem. We investigate current standard ConvNet architectures and pinpoint one of their main limitations: the spatial-temporal dependency is simply captured by global pooling operation, which may not well capture the complex long term spatial-temporal relationships in videos. For this work, we propose a Spatial Temporal Attentional Glimpse (STAG) module to overcome this shortcoming. Specifically, the input to this STAG module is a 3D tensor which is first processed by a spatial-temporal attention block. Spatial Temporal Glimpse block decomposes the resulting tensor into two low dimensional tensors and then fuses their operation results. The proposed STAG module is pluggable, easy to learn, and effective in computation. We conduct extended ablation studies to show that our model incorporated with the STAG block substantially improves the performance over the state-of-the-art. All the experimental results, the trained models, and the complete source codes will be released to facilitate further studies on this problem.
Keyword:
Reprint Author's Address:
Email:
Source :
2019 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP)
ISSN: 1522-4880
Year: 2019
Page: 4040-4044
Language: English
Cited Count:
WoS CC Cited Count: 0
SCOPUS Cited Count:
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 4
Affiliated Colleges: