Scale-aware Multi-head Information Aggregation for Image Captioning - Details

Author：

Jia, A. (Jia, A..) | Zhang, X. (Zhang, X..)

Indexed by：

EI Scopus

Abstract：

Current　mainstream　image　captioning　models　are　based　on　the　encoder-decoder　framework　with　multi-head　attention,　which　commonly　employs　grid　image　features　as　the　input　and　has　shown　superior　performance.　However,　self-attention　in　the　encoder　only　models　the　visual　relations　of　fixed-scale　grid　features,　the　multi-head　attention　mechanism　is　not　fully　exploited　to　capture　diverse　information　for　more　efficient　feature　representation,　thus　affecting　the　quality　of　the　generated　captions.　To　solve　this　problem,　we　propose　a　novel　Scale-aware　Multi-head　Information　Aggregation　(SMIA)　model　for　image　captioning.　SMIA　introduces　multi-scale　visual　features　to　improve　the　feature　representation　from　the　perspective　of　attention　heads.　Specifically,　a　scale　expansion　algorithm　is　proposed　to　extract　multi-scale　visual　features.　Then,　for　different　heads　of　the　multi-head　attention,　different　high-scale　features　are　integrated　into　the　fixed　low-scale　grid　features　to　capture　diverse　and　richer　information.　In　addition,　different　high-scale　features　are　introduced　for　shallow　and　deep　layers　of　encoder　to　further　improve　the　feature　representation.　Besides,　SMIA　is　flexible　to　combine　with　existing　Transformer　models　to　further　improve　performance.　Experimental　results　on　the　MS　COCO　dataset　demonstrate　the　effectiveness　of　our　proposed　SMIA.　©　2024　Technical　Committee　on　Control　Theory,　Chinese　Association　of　Automation.

Keyword：

multi-scale features image captioning multi-head attention information aggregation

Author Community：

[ 1 ] [Jia A.]Beijing University of Technology, Beijing, 100124, China
[ 2 ] [Zhang X.]Beijing University of Technology, Beijing, 100124, China

Reprint Author's Address：

Email：

Show more details

Related Keywords：

LST-EMG-Net: Long short-term transformer feature fusion network for sEMG gesture recognition
2023，FRONTIERS IN NEUROROBOTICS
Salient Object Detection Based on Transformer and Multi-scale Feature Fusion
2023，
Intra- and Inter-Head Orthogonal Attention for Image Captioning
2025，IEEE TRANSACTIONS ON IMAGE PROCESSING
Spatio-Temporal Memory Attention for Image Captioning
2020，IEEE TRANSACTIONS ON IMAGE PROCESSING

Source ：

ISSN： 1934-1768

Year： 2024

Page： 8771-8777

Language： English

Cited Count：

WoS CC Cited Count：

SCOPUS Cited Count：

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 9

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search SCOPUS

Type
Departments

All Years Choose Year From to