Swin-Caption: Swin Transformer-Based Image Captioning with Feature Enhancement and Multi-Stage Fusion - Details

Author：

Indexed by：

Scopus

Abstract：

The　objective　of　image　captioning　is　to　empower　computers　to　generate　human-like　sentences　autonomously,　describing　a　provided　image.　To　tackle　the　challenges　of　insufficient　accuracy　in　image　feature　extraction　and　underutilization　of　visual　information,　we　present　a　Swin　Transformer-based　model　for　image　captioning　with　feature　enhancement　and　multi-stage　fusion　(Swin-Caption).　Initially,　the　Swin　Transformer　is　employed　in　the　capacity　of　an　encoder　for　extracting　images,　while　feature　enhancement　is　adopted　to　gather　additional　image　feature　information.　Subsequently,　a　multi-stage　image　and　semantic　fusion　module　is　constructed　to　utilize　the　semantic　information　from　past　time　steps.　Lastly,　a　two-layer　LSTM　is　utilized　to　decode　semantic　and　image　data,　generating　captions.　The　proposed　model　outperforms　the　baseline　model　in　experimental　tests　and　instance　analysis　on　the　public　datasets　Flickr8K,　Flickr30K,　and　MS-COCO.

Keyword：

Deep learning attention mechanism LSTM swin transformer image captioning

Author Community：

[ 1 ] [Liu, Lei]Beijing Univ Technol, Fac Sci, Beijing, Peoples R China
[ 2 ] [Jiao, Yidi]Beijing Univ Technol, Fac Sci, Beijing, Peoples R China
[ 3 ] [Li, Xiaoran]Beijing Univ Technol, Fac Sci, Beijing, Peoples R China
[ 4 ] [Li, Jing]Beijing Univ Technol, Fac Sci, Beijing, Peoples R China
[ 5 ] [Liu, Lei]Beijing Univ Technol, Beijing Inst Sci & Engn Comp, Beijing, Peoples R China
[ 6 ] [Wang, Haitao]China Natl Inst Standardizat, Fundamental Standardizat, Beijing, Peoples R China
[ 7 ] [Cao, Xinyu]China Natl Inst Standardizat, Fundamental Standardizat, Beijing, Peoples R China

Reprint Author's Address：

[Cao, Xinyu]China Natl Inst Standardizat, Fundamental Standardizat, Beijing, Peoples R China

Email：

liului_leilei@bjut.edu.cn

Show more details

Related Keywords：

Swin Transformer-based Image Captioning with Feature Enhancement and Multi-stage Fusion
2023，
CA-LSTM: Unit Operational State Prediction Based on Multi-head Attention Mechanism and Sequence Modeling
2025，3rd International Conference on Machine Learning, Cloud Computing and Intelligent Mining, MLCCIM 2024
A decision-making algorithm combining the aspect-based sentiment analysis and intuitionistic fuzzy-VIKOR for online hotel reservation
2021，ANNALS OF OPERATIONS RESEARCH
An improved graph neural network integrating indicator attention and spatio-temporal correlation for dissolved oxygen prediction
2025，ECOLOGICAL INFORMATICS

Source ：

INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS

ISSN： 1469-0268

Year： 2024

Cited Count：

WoS CC Cited Count：

SCOPUS Cited Count：

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 6

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search Web of Science

Type
Departments

All Years Choose Year From to