Counterfactual Dual-Bias VQA: A Multimodality Debias Learning for Robust Visual Question Answering - Details

Author：

Indexed by：

Scopus SCIE

Abstract：

Visual　question　answering　(VQA)　models　often　face　two　language　bias　challenges.　First,　they　tend　to　rely　solely　on　the　question　to　predict　the　answer,　often　overlooking　relevant　information　in　the　accompanying　images.　Second,　even　when　considering　the　question,　they　may　focus　only　on　the　wh-words,　neglecting　other　crucial　keywords　that　could　enhance　interpretability　and　the　question　sensitivity.　Existing　debiasing　methods　attempt　to　address　this　by　training　a　bias　model　using　question-only　inputs　to　enhance　the　robustness　of　the　target　VQA　model.　However,　this　approach　may　not　fully　capture　the　language　bias　present.　In　this　article,　we　propose　a　multimodality　counterfactual　dual-bias　model　to　mitigate　the　linguistic　bias　issue　in　target　VQA　models.　Our　approach　involves　designing　a　shared-parameterized　dual-bias　model　that　incorporates　both　visual　and　question　counterfactual　samples　as　inputs.　By　doing　so,　we　aim　to　fully　model　language　biases,　with　visual　and　question　counterfactual　samples,　respectively,　emphasizing　important　objects　and　keywords　to　relevant　the　answers.　To　ensure　that　our　dual-bias　model　behaves　similarly　to　an　ordinary　model,　we　freeze　the　parameters　of　the　target　VQA　model,　meanwhile　using　the　cross-entropy　and　Kullback-Leibler　(KL)　divergence　as　the　loss　function　to　train　the　dual-bias　model.　Subsequently,　to　mitigate　language　bias　in　the　target　VQA　model,　we　freeze　the　parameters　of　the　dual-bias　model　to　generate　pseudo-labels　and　then　incorporate　a　margin　loss　to　re-train　the　target　VQA　model.　Experimental　results　on　the　VQA-CP　datasets　demonstrate　the　superior　effectiveness　of　our　proposed　counterfactual　dual-bias　model.　Additionally,　we　conduct　an　analysis　of　the　unsatisfactory　performance　on　the　VQA　v2　dataset.　The　origin　code　of　the　proposed　model　is　available　at　https://github.com/Arrow2022jv/MCD

Keyword：

Visualization Analytical models multimodality analysis Predictive models Training Counterfactual samples Image color analysis Linguistics Sensitivity Reviews debiasing model visual question answering (VQA) Question answering (information retrieval) Cognition

Author Community：

[ 1 ] [Wang, Boyue]Beijing Univ Technol, Beijing Artificial Intelligence Inst, Sch Informat Sci & Technol, Beijing Key Lab Multimedia & Intelligent Software, Beijing 100124, Peoples R China
[ 2 ] [Ju, Xiaoqian]Beijing Univ Technol, Beijing Artificial Intelligence Inst, Sch Informat Sci & Technol, Beijing Key Lab Multimedia & Intelligent Software, Beijing 100124, Peoples R China
[ 3 ] [Li, Xiaoyan]Beijing Univ Technol, Beijing Artificial Intelligence Inst, Sch Informat Sci & Technol, Beijing Key Lab Multimedia & Intelligent Software, Beijing 100124, Peoples R China
[ 4 ] [Hu, Yongli]Beijing Univ Technol, Beijing Artificial Intelligence Inst, Sch Informat Sci & Technol, Beijing Key Lab Multimedia & Intelligent Software, Beijing 100124, Peoples R China
[ 5 ] [Yin, Baocai]Beijing Univ Technol, Beijing Artificial Intelligence Inst, Sch Informat Sci & Technol, Beijing Key Lab Multimedia & Intelligent Software, Beijing 100124, Peoples R China
[ 6 ] [Gao, Junbin]Univ Sydney, Business Sch, Discipline Business Analyt, Sydney, NSW 2006, Australia

Reprint Author's Address：

[Li, Xiaoyan]Beijing Univ Technol, Beijing Artificial Intelligence Inst, Sch Informat Sci & Technol, Beijing Key Lab Multimedia & Intelligent Software, Beijing 100124, Peoples R China

Email：

Show more details

Related Keywords：

See and Learn More: Dense Caption-Aware Representation for Visual Question Answering
2024，IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY
DHHG-TAC: Fusion of Dynamic Heterogeneous Hypergraphs and Transformer Attention Mechanism for Visual Question Answering Tasks
2024，IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS
Contrastive Visual-Question-Caption Counterfactuals on Biased Samples for Visual Question Answering
2024，43rd Chinese Control Conference, CCC 2024
UIQI: A Comprehensive Quality Evaluation Index for Underwater Images
2024，IEEE TRANSACTIONS ON MULTIMEDIA

Source ：

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS

ISSN： 2162-237X

Year： 2025

1 0 . 4 0 0

JCR@2022

Cited Count：

WoS CC Cited Count：

SCOPUS Cited Count：

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 2

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search Web of Science

Type
Departments

All Years Choose Year From to