Indexed by:
Abstract:
As a prevailing cross-modal reasoning task, Visual Question Answering (VQA) has achieved impressive progress in the last few years, where the language bias is widely studied to learn more robust VQA models. However, the visual bias, which also influences the robustness of VQA models, is seldomly considered, resulting in weak inference ability. Therefore, how to balance the effect of language bias and visual bias has become essential in the current VQA task. In this paper, we devise a new reweighting strategy taking both the language bias and visual bias into account, and propose a Fair Attention Network for Robust Visual Question Answering (named as FAN-VQA). It first constructs a question bias branch and a visual bias branch to estimate the bias information from two modalities, which are utilized to judge the importance of samples. Then, adaptive importance weights are learned from the bias information and assigned to the candidate answers to adjust the training losses, enabling the model to shift more attention to the difficult samples that need less-salient visual clues to infer the correct answer. In order to improve the robustness of the VQA model, we design a progressive strategy to balance the influence of original training loss and adjusted training loss. Extensive experiments on the VQA-CP v2, VQA v2, and VQA-CE datasets demonstrate the effectiveness of the proposed FAN-VQA method. IEEE
Keyword:
Reprint Author's Address:
Email:
Source :
IEEE Transactions on Circuits and Systems for Video Technology
ISSN: 1051-8215
Year: 2024
Issue: 9
Volume: 34
Page: 1-1
8 . 4 0 0
JCR@2022
Cited Count:
WoS CC Cited Count: 0
SCOPUS Cited Count: 5
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 10
Affiliated Colleges: