DSGEM: Dual scene graph enhancement module-based visual question answering - Details

Author：

Indexed by：

EI Scopus SCIE

Abstract：

Visual　Question　Answering　(VQA)　aims　to　appropriately　answer　a　text　question　by　understanding　the　image　content.　Attention-based　VQA　models　mine　the　implicit　relationships　between　objects　according　to　the　feature　similarity,　which　neglects　the　explicit　relationships　between　objects,　for　example,　the　relative　position.　Most　Visual　Scene　Graph-based　VQA　models　exploit　the　relative　positions　or　visual　relationships　between　objects　to　construct　the　visual　scene　graph,　while　they　suffer　from　the　semantic　insufficiency　of　visual　edge　relations.　Besides,　the　scene　graph　of　text　modality　is　often　ignored　in　these　works.　In　this　article,　a　novel　Dual　Scene　Graph　Enhancement　Module　(DSGEM)　is　proposed　that　exploits　the　relevant　external　knowledge　to　simultaneously　construct　two　interpretable　scene　graph　structures　of　image　and　text　modalities,　which　makes　the　reasoning　process　more　logical　and　precise.　Specifically,　the　authors　respectively　build　the　visual　and　textual　scene　graphs　with　the　help　of　commonsense　knowledge　and　syntactic　structure,　which　explicitly　endows　the　specific　semantics　to　each　edge　relation.　Then,　two　scene　graph　enhancement　modules　are　proposed　to　propagate　the　involved　external　and　structural　knowledge　to　explicitly　guide　the　feature　interaction　between　objects　(nodes).　Finally,　the　authors　embed　such　two　scene　graph　enhancement　modules　to　existing　VQA　models　to　introduce　the　explicit　relation　reasoning　ability.　Experimental　results　on　both　VQA　V2　and　OK-VQA　datasets　show　that　the　proposed　DSGEM　is　effective　and　compatible　to　various　VQA　architectures.

Keyword：

question answering (information retrieval) image representation

Author Community：

[ 1 ] [Wang, Boyue]Beijing Univ Technol, 100 Pingleyuan, Beijing 100124, Peoples R China
[ 2 ] [Ma, Yujian]Beijing Univ Technol, 100 Pingleyuan, Beijing 100124, Peoples R China
[ 3 ] [Li, Xiaoyan]Beijing Univ Technol, 100 Pingleyuan, Beijing 100124, Peoples R China
[ 4 ] [Liu, Heng]Beijing Univ Technol, 100 Pingleyuan, Beijing 100124, Peoples R China
[ 5 ] [Hu, Yongli]Beijing Univ Technol, 100 Pingleyuan, Beijing 100124, Peoples R China
[ 6 ] [Yin, Baocai]Beijing Univ Technol, 100 Pingleyuan, Beijing 100124, Peoples R China

Reprint Author's Address：

[Li, Xiaoyan]Beijing Univ Technol, 100 Pingleyuan, Beijing 100124, Peoples R China;;

Email：

xiaoyan.li@bjut.edu.cn

Show more details

Related Keywords：

Learning Adaptive Filter Banks for Hierarchical Image Representation
2014，IEEE Visual Communications and Image Processing Conference
Learning adaptive filter banks for hierarchical image representation
2014，2014 IEEE Visual Communications and Image Processing Conference, VCIP 2014
Consensus Sparsity: Multi-Context Sparse Image Representation via L∞-Induced Matrix Variate
2023，IEEE Transactions on Image Processing
Multi-channel and multi-scale mid-level image representation for scene classification
2017，JOURNAL OF ELECTRONIC IMAGING

Source ：

IET COMPUTER VISION

ISSN： 1751-9632

Year： 2023

Issue： 6

Volume： 17

Page： 638-651

1 . 7 0 0

JCR@2022

ESI Discipline： COMPUTER SCIENCE;

ESI HC Threshold：19

Cited Count：

WoS CC Cited Count： 1

SCOPUS Cited Count： 1

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 9

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search Web of Science

Type
Departments

All Years Choose Year From to