Indexed by:
Abstract:
The objective of image captioning involves empowering computers to autonomously produce human-like sentences that depict a provided image. To address the issues of insufficient accuracy in image feature extraction and underutilization of visual information, we propose a Swin Transformer-based image captioning model with feature enhancement and multi-stage fusion. First, the Swin Transformer is employed in the capacity of an encoder for the purpose of extracting image features, and feature enhancement is adopted to capture more information about image features. Then, a multi-stage image and semantic fusion module is constructed to utilize the semantic information from past time steps. Finally, LSTM is used to decode the semantic and image information and generate captions. The proposed model achieves better results in baseline tests on the public datasets Flickr8K and Flickr30K. © 2023 IEEE.
Keyword:
Reprint Author's Address:
Email:
Source :
Year: 2023
Language: English
Cited Count:
WoS CC Cited Count: 0
SCOPUS Cited Count: 1
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 2
Affiliated Colleges: