CFMMC-Align: Coarse-Fine Multi-Modal Contrastive Alignment Network for Traffic Event Video Question Answering - Details

Author：

Indexed by：

EI Scopus SCIE

Abstract：

Traffic　video　question　answering　(TrafficVQA)　constitutes　a　specialized　VideoQA　task　designed　to　enhance　the　basic　comprehension　and　intricate　reasoning　capacities　of　videos,　specifically　focusing　on　traffic　events.　Recent　VideoQA　models　employ　pretrained　visual　and　textual　encoder　models　to　bridge　the　feature　space　gap　between　visual　and　textual　data.　However,　in　addressing　the　unique　challenges　inherent　to　the　TrafficVQA　task,　three　pivotal　issues　must　be　addressed:　(i)　Dimension　Gap:　Between　the　pretrained　image　(appearance　feature)　and　video　(motion　feature)　models,　there　exists　a　conspicuous　dimension　difference　in　static　and　dynamic　visual　data;　(ii)　Scene　Gap:　The　common　real-world　datasets　and　the　traffic　event　datasets　differ　in　visual　scene　content;　(iii)　Modality　Gap:　A　pronounced　feature　distribution　discrepancy　emerges　between　traffic　video　and　text　data.　To　alleviate　these　challenges,　we　introduce　the　coarse-fine　multimodal　contrastive　alignment　network　(CFMMC-Align).　This　model　leverages　sequence-level　and　token-level　multimodal　features,　grounded　in　an　unsupervised　visual　multimodal　contrastive　loss　to　mitigate　dimension　and　scene　gaps　and　a　supervised　visual-textual　contrastive　loss　to　alleviate　modality　discrepancies.　Finally,　the　model　is　validated　on　the　challenging　public　TrafficVQA　dataset　SUTD-TrafficQA　and　outperforms　the　state-of-the-art　method　by　a　substantial　margin　(50.2%　compared　to　46.0%).　The　code　is　available　at　https://github.com/guokan987/CFMMC-Align.　IEEE

Keyword：

Semantics Cognition Task analysis Contrastive Learning Question answering (information retrieval) Visualization Video Question Answering Roads Transformers

Author Community：

[ 1 ] [Guo K.]School of Transportation Science and Engineering, Beijing Advanced Innovation Center for Big Data and Brain Computing, Beijing Key Laboratory for Cooperative Vehicle Infrastructure Systems &
[ 2 ] Safety Control, Beihang University, Beijing, China
[ 3 ] [Tian D.]School of Transportation Science and Engineering, Beijing Advanced Innovation Center for Big Data and Brain Computing, Beijing Key Laboratory for Cooperative Vehicle Infrastructure Systems &
[ 4 ] Safety Control, Beihang University, Beijing, China
[ 5 ] [Hu Y.]the Faculty of Information Technology, Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing University of Technology, Beijing, China
[ 6 ] [Lin C.]School of Transportation Science and Engineering, Beijing Advanced Innovation Center for Big Data and Brain Computing, Beijing Key Laboratory for Cooperative Vehicle Infrastructure Systems &
[ 7 ] Safety Control, Beihang University, Beijing, China
[ 8 ] [Qian Z.]Civil and Environmental Engineering Department, H. John Heinz III College, Carnegie Mellon University, Pittsburgh, PA, USA
[ 9 ] [Sun Y.]the Faculty of Information Technology, Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing University of Technology, Beijing, China
[ 10 ] [Zhou J.]School of Transportation Science and Engineering, Beijing Advanced Innovation Center for Big Data and Brain Computing, Beijing Key Laboratory for Cooperative Vehicle Infrastructure Systems &
[ 11 ] Safety Control, Beihang University, Beijing, China
[ 12 ] [Duan X.]School of Transportation Science and Engineering, Beijing Advanced Innovation Center for Big Data and Brain Computing, Beijing Key Laboratory for Cooperative Vehicle Infrastructure Systems &
[ 13 ] Safety Control, Beihang University, Beijing, China
[ 14 ] [Gao J.]Discipline of Business Analytics, The University of Sydney Business School, The University of Sydney, NSW, Australia
[ 15 ] [Yin B.]the Faculty of Information Technology, Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing University of Technology, Beijing, China

Reprint Author's Address：

Email：

Show more details

Related Keywords：

DHHG-TAC: Fusion of Dynamic Heterogeneous Hypergraphs and Transformer Attention Mechanism for Visual Question Answering Tasks
2024，IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS
See and Learn More: Dense caption-aware Representation for Visual Question Answering
2023，IEEE Transactions on Circuits and Systems for Video Technology
A multi-scale self-supervised hypergraph contrastive learning framework for video question answering
2023，Neural Networks
Domain-aware Prototype Network for Generalized Zero-Shot Learning
2023，IEEE Transactions on Circuits and Systems for Video Technology

Source ：

IEEE Transactions on Circuits and Systems for Video Technology

ISSN： 1051-8215

Year： 2024

Issue： 11

Volume： 34

Page： 1-1

8 . 4 0 0

JCR@2022

Cited Count：

WoS CC Cited Count：

SCOPUS Cited Count： 5

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 12

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search SCOPUS

Type
Departments

All Years Choose Year From to