Indexed by:
Abstract:
Traffic video question answering (TrafficVQA) constitutes a specialized VideoQA task designed to enhance the basic comprehension and intricate reasoning capacities of videos, specifically focusing on traffic events. Recent VideoQA models employ pretrained visual and textual encoder models to bridge the feature space gap between visual and textual data. However, in addressing the unique challenges inherent to the TrafficVQA task, three pivotal issues must be addressed: (i) Dimension Gap: Between the pretrained image (appearance feature) and video (motion feature) models, there exists a conspicuous dimension difference in static and dynamic visual data; (ii) Scene Gap: The common real-world datasets and the traffic event datasets differ in visual scene content; (iii) Modality Gap: A pronounced feature distribution discrepancy emerges between traffic video and text data. To alleviate these challenges, we introduce the coarse-fine multimodal contrastive alignment network (CFMMC-Align). This model leverages sequence-level and token-level multimodal features, grounded in an unsupervised visual multimodal contrastive loss to mitigate dimension and scene gaps and a supervised visual-textual contrastive loss to alleviate modality discrepancies. Finally, the model is validated on the challenging public TrafficVQA dataset SUTD-TrafficQA and outperforms the state-of-the-art method by a substantial margin (50.2% compared to 46.0%). The code is available at https://github.com/guokan987/CFMMC-Align. IEEE
Keyword:
Reprint Author's Address:
Email:
Source :
IEEE Transactions on Circuits and Systems for Video Technology
ISSN: 1051-8215
Year: 2024
Issue: 11
Volume: 34
Page: 1-1
8 . 4 0 0
JCR@2022
Cited Count:
SCOPUS Cited Count: 5
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 12
Affiliated Colleges: