See and Learn More: Dense Caption-Aware Representation for Visual Question Answering - Details

Author：

Bi, Yandong (Bi, Yandong.) | Jiang, Huajie (Jiang, Huajie.) | Hu, Yongli (Hu, Yongli.) | Sun, Yanfeng (Sun, Yanfeng.) | Yin, Baocai (Yin, Baocai.)

Indexed by：

EI Scopus SCIE

Abstract：

With　the　rapid　development　of　deep　learning　models,　great　improvements　have　been　achieved　in　the　Visual　Question　Answering　(VQA)　field.　However,　modern　VQA　models　are　easily　affected　by　language　priors,　which　ignore　image　information　and　learn　the　superficial　relationship　between　questions　and　answers,　even　in　the　optimal　pre-training　model.　The　main　reason　is　that　visual　information　is　not　fully　extracted　and　utilized,　which　results　in　a　domain　gap　between　vision　and　language　modalities　to　a　certain　extent.　In　order　to　mitigate　the　circumstances,　we　propose　to　extract　dense　captions　(auxiliary　semantic　information)　from　images　to　enhance　the　visual　information　for　reasoning　and　utilize　them　to　release　the　gap　between　vision　and　language　since　the　dense　captions　and　the　questions　are　from　the　same　language　modality　(i.e.,　phrase　or　sentence).　In　this　paper,　we　propose　a　novel　dense　caption-aware　visual　question　answering　model　called　DenseCapBert　to　enhance　visual　reasoning.　Specifically,　we　generate　dense　captions　for　the　images　and　propose　a　multimodal　interaction　mechanism　to　fuse　dense　captions,　images,　and　questions　in　a　unified　framework,　which　makes　the　VQA　models　more　robust.　The　experimental　results　on　GQA,　GQA-OOD,　VQA　v2,　and　VQA-CP　v2　datasets　show　that　dense　captions　are　beneficial　to　improving　the　model　generalization　and　our　model　effectively　mitigates　the　language　bias　problem.

Keyword：

dense caption Cognition Visual question answering Question answering (information retrieval) Semantics Feature extraction Detectors language prior cross-modal fusion Data mining Visualization

Author Community：

[ 1 ] [Bi, Yandong]Beijing Univ Technol, Beijing Inst Artificial Intelligence, Fac Informat Technol, Beijing Key Lab Multimedia & Intelligent Software, Beijing 100124, Peoples R China
[ 2 ] [Jiang, Huajie]Beijing Univ Technol, Beijing Inst Artificial Intelligence, Fac Informat Technol, Beijing Key Lab Multimedia & Intelligent Software, Beijing 100124, Peoples R China
[ 3 ] [Hu, Yongli]Beijing Univ Technol, Beijing Inst Artificial Intelligence, Fac Informat Technol, Beijing Key Lab Multimedia & Intelligent Software, Beijing 100124, Peoples R China
[ 4 ] [Sun, Yanfeng]Beijing Univ Technol, Beijing Inst Artificial Intelligence, Fac Informat Technol, Beijing Key Lab Multimedia & Intelligent Software, Beijing 100124, Peoples R China
[ 5 ] [Yin, Baocai]Beijing Univ Technol, Beijing Inst Artificial Intelligence, Fac Informat Technol, Beijing Key Lab Multimedia & Intelligent Software, Beijing 100124, Peoples R China

Reprint Author's Address：

[Jiang, Huajie]Beijing Univ Technol, Beijing Inst Artificial Intelligence, Fac Informat Technol, Beijing Key Lab Multimedia & Intelligent Software, Beijing 100124, Peoples R China

Email：

biyandong@emails.bjut.edu.cn |
jianghj@bjut.edu.cn |
huyongli@bjut.edu.cn |
yfsun@bjut.edu.cn |
ybc@bjut.edu.cn

Show more details

Related Keywords：

DHHG-TAC: Fusion of Dynamic Heterogeneous Hypergraphs and Transformer Attention Mechanism for Visual Question Answering Tasks
2024，IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS
Counterfactual Dual-Bias VQA: A Multimodality Debias Learning for Robust Visual Question Answering
2025，IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
DSGEM: Dual scene graph enhancement module-based visual question answering
2023，IET COMPUTER VISION
Bridging the Cross-Modality Semantic Gap in Visual Question Answering
2024，IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Source ：

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY

ISSN： 1051-8215

Year： 2024

Issue： 2

Volume： 34

Page： 1135-1146

8 . 4 0 0

JCR@2022

Cited Count：

WoS CC Cited Count：

SCOPUS Cited Count：

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 0

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search Web of Science

Type
Departments

All Years Choose Year From to