Intra- and Inter-Head Orthogonal Attention for Image Captioning - Details

Author：

Zhang, Xiaodan (Zhang, Xiaodan.) | Jia, Aozhe (Jia, Aozhe.) | Ji, Junzhong (Ji, Junzhong.) (Scholars：冀俊忠) | Qu, Liangqiong (Qu, Liangqiong.) | Ye, Qixiang (Ye, Qixiang.)

Indexed by：

EI Scopus SCIE

Abstract：

Multi-head　attention　(MA),　which　allows　the　model　to　jointly　attend　to　crucial　information　from　diverse　representation　subspaces　through　its　heads,　has　yielded　remarkable　achievement　in　image　captioning.　However,　there　is　no　explicit　mechanism　to　ensure　MA　attends　to　appropriate　positions　in　diverse　subspaces,　resulting　in　overfocused　attention　for　each　head　and　redundancy　between　heads.　In　this　paper,　we　propose　a　novel　Intra-　and　Inter-Head　Orthogonal　Attention　(I(2)OA)　to　efficiently　improve　MA　in　image　captioning　by　introducing　a　concise　orthogonal　regularization　to　heads.　Specifically,　Intra-Head　Orthogonal　Attention　enhances　the　attention　learning　of　MA　by　introducing　orthogonal　constraint　to　each　head,　which　decentralizes　the　object-centric　attention　to　more　comprehensive　content-aware　attention.　Inter-Head　Orthogonal　Attention　reduces　the　heads　redundancy　by　applying　orthogonal　constraint　between　heads,　which　enlarges　the　diversity　of　representation　subspaces　and　improves　the　representation　ability　for　MA.　Moreover,　the　proposed　I(2)OA　is　flexible　to　combine　with　various　multi-head　attention　based　image　captioning　methods　and　improve　the　performances　without　increasing　model　complexity　and　parameters.　Experiments　on　the　MS　COCO　dataset　demonstrate　the　effectiveness　of　the　proposed　model.

Keyword：

Visualization Transformers multi-head attention (MA) Redundancy Accuracy orthogonal constraint Head Decoding Feature extraction Optimization Dogs Correlation Image captioning

Author Community：

[ 1 ] [Zhang, Xiaodan]Beijing Univ Technol, Coll Comp Sci, Beijing 100124, Peoples R China
[ 2 ] [Jia, Aozhe]Beijing Univ Technol, Coll Comp Sci, Beijing 100124, Peoples R China
[ 3 ] [Ji, Junzhong]Beijing Univ Technol, Coll Comp Sci, Beijing 100124, Peoples R China
[ 4 ] [Qu, Liangqiong]Univ Hong Kong, Sch Comp & Data Sci, Hong Kong, Peoples R China
[ 5 ] [Ye, Qixiang]Univ Chinese Acad Sci, Sch Elect Elect & Commun Engn, Beijing 100049, Peoples R China

Reprint Author's Address：

[Ye, Qixiang]Univ Chinese Acad Sci, Sch Elect Elect & Commun Engn, Beijing 100049, Peoples R China

Email：

zhangxiaodan@bjut.edu.cn |
jiaaozhe1999@outlook.com |
jjz01@bjut.edu.cn |
liangqqu@hku.hk |
qxye@ucas.ac.cn

Show more details

Related Keywords：

Learning Label Semantics for Weakly Supervised Group Activity Recognition
2024，IEEE Transactions on Multimedia
DHHG-TAC: Fusion of Dynamic Heterogeneous Hypergraphs and Transformer Attention Mechanism for Visual Question Answering Tasks
2024，IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS
Hierarchical Multi-modal Prompting Transformer for Multi-modal Long Document Classification
2024，IEEE Transactions on Circuits and Systems for Video Technology
Reliable Open-Set Network Traffic Classification
2025，IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY

Source ：

IEEE TRANSACTIONS ON IMAGE PROCESSING

ISSN： 1057-7149

Year： 2025

Volume： 34

Page： 594-607

1 0 . 6 0 0

JCR@2022

Cited Count：

WoS CC Cited Count：

SCOPUS Cited Count：

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 17

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search Web of Science

Type
Departments

All Years Choose Year From to