Indexed by:
Abstract:
Motion estimation in videos primarily concerns global and local motion, derived from different subjects but mixed in video frames. In fact, in most scenes, such as action recognition, it is necessary to estimate the global and local motion respectively in order to obtain accurate motion representation. Due to the lack of ground-Truth labels, estimating global and local motion simultaneously poses a challenge. In this work, we address these issues with an end-To-end two-stream network for global and local motion estimation. This network utilizes the mixed motion as supervision, employs the attention mechanism based on the Transformer, and adopts a two-stage training strategy for mutual enhancement of the two motions during training. Additionally, we introduce a motion-based feature decoder for the global stream and a SIR mask to remove scene-irrelevant regions for the local stream. We verify the effectiveness of our method on the deep homography estimation dataset DHE, action recognition dataset NCAA, and group activity recognition dataset UCF-101. Results demonstrate improved performance over previous methods in regular scenes and recognition tasks. © 2023 IEEE.
Keyword:
Reprint Author's Address:
Email:
Source :
Year: 2023
Page: 328-334
Language: English
Cited Count:
SCOPUS Cited Count: 1
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 10
Affiliated Colleges: