Indexed by:
Abstract:
Accurately locating the question-related regions in one given image is crucial for visual question answering (VQA). The current approaches suffer two limitations: (1) Dividing one image into multiple regions may lose parts of semantic information and original relationships between regions; (2) Choosing only one or all image regions to predict the answer may correspondingly result in the insufficiency or redundancy of information. Therefore, how to effectively mine the relationship between image regions and choose the relevant image regions are vital. In this paper, we propose a novel Multi-granularity feature interaction and Multi-region selection-based triplet VQA model (M2TVQA). To tackle the first limitation, we propose the multi-granularity feature interaction strategy that adaptively supplements the global coarse-granularity features with the regional fine-granularity features. To overcome the second limitation, we design the Top-K learning strategy to adaptively select K most relevant image regions to the question, even if the selected regions are far away in space. Such a strategy can select as many relevant image regions as possible and reduce introducing noise. Finally, we construct the multi-modality triplet to predict the answer of VQA. Extended experiments on two public outside knowledge datasets OK-VQA and KRVQA verify the effectiveness of the proposed model.
Keyword:
Reprint Author's Address:
Email:
Source :
IEEE TRANSACTIONS ON BIG DATA
ISSN: 2332-7790
Year: 2025
Issue: 3
Volume: 11
Page: 1346-1356
7 . 2 0 0
JCR@2022
Cited Count:
SCOPUS Cited Count:
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 0
Affiliated Colleges: