See and Learn More: Dense caption-aware Representation for Visual Question Answering - Details

Author：

Bi, Y. (Bi, Y..) | Jiang, H. (Jiang, H..) | Hu, Y. (Hu, Y..) | Sun, Y. (Sun, Y..) | Yin, B. (Yin, B..)

Indexed by：

Scopus

Abstract：

With　the　rapid　development　of　deep　learning　models,　great　improvements　have　been　achieved　in　the　Visual　Question　Answering　(VQA)　field.　However,　modern　VQA　models　are　easily　affected　by　language　priors,　which　ignore　image　information　and　learn　the　superficial　relationship　between　questions　and　answers,　even　in　the　optimal　pre-training　model.　The　main　reason　is　that　visual　information　is　not　fully　extracted　and　utilized,　which　results　in　a　domain　gap　between　vision　and　language　modalities　to　a　certain　extent.　In　order　to　mitigate　the　circumstances,　we　propose　to　extract　dense　captions　(auxiliary　semantic　information)　from　images　to　enhance　the　visual　information　for　reasoning　and　utilize　them　to　release　the　gap　between　vision　and　language　since　the　dense　captions　and　the　questions　are　from　the　same　language　modality　(i.e.,　phrase　or　sentence).　In　this　paper,　we　propose　a　novel　dense　caption-aware　visual　question　answering　model　called　DenseCapBert　to　enhance　visual　reasoning.　Specifically,　we　generate　dense　captions　for　the　images　and　propose　a　multimodal　interaction　mechanism　to　fuse　dense　captions,　images,　and　questions　in　a　unified　framework,　which　makes　the　VQA　models　more　robust.　The　experimental　results　on　GQA,　GQA-OOD,　VQA　v2,　and　VQA-CP　v2　datasets　show　that　dense　captions　are　beneficial　to　improving　the　model　generalization　and　our　model　effectively　mitigates　the　language　bias　problem.　IEEE

Keyword：

cross-modal fusion Visual question answering Semantics Feature extraction language prior Data mining Question answering (information retrieval) dense caption Detectors Cognition Visualization

Author Community：

[ 1 ] [Bi Y.]Beijing Institute of Artificial Intelligence, Faculty of Information Technology, Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing University of Technology, Beijing, China
[ 2 ] [Jiang H.]Beijing Institute of Artificial Intelligence, Faculty of Information Technology, Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing University of Technology, Beijing, China
[ 3 ] [Hu Y.]Beijing Institute of Artificial Intelligence, Faculty of Information Technology, Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing University of Technology, Beijing, China
[ 4 ] [Sun Y.]Beijing Institute of Artificial Intelligence, Faculty of Information Technology, Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing University of Technology, Beijing, China
[ 5 ] [Yin B.]Beijing Institute of Artificial Intelligence, Faculty of Information Technology, Beijing Key Laboratory of Multimedia and Intelligent Software Technology, Beijing University of Technology, Beijing, China

Reprint Author's Address：

Email：

Show more details

Related Keywords：

See and Learn More: Dense Caption-Aware Representation for Visual Question Answering
2024，IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
DHHG-TAC: Fusion of Dynamic Heterogeneous Hypergraphs and Transformer Attention Mechanism for Visual Question Answering Tasks
2024，IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS
Counterfactual Dual-Bias VQA: A Multimodality Debias Learning for Robust Visual Question Answering
2025，IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS
DSGEM: Dual scene graph enhancement module-based visual question answering
2023，IET COMPUTER VISION
Bridging the Cross-Modality Semantic Gap in Visual Question Answering
2024，IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

Source ：

IEEE Transactions on Circuits and Systems for Video Technology

ISSN： 1051-8215

Year： 2023

Issue： 2

Volume： 34

Page： 1-1

8 . 4 0 0

JCR@2022

ESI HC Threshold：19

Cited Count：

WoS CC Cited Count：

SCOPUS Cited Count： 13

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 12

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search SCOPUS

Type
Departments

All Years Choose Year From to