Indexed by:
Abstract:
Image-text retrieval is a challenging task due to image and text are heterogeneous cross-modal data, which possess semantic gap. The key issue of image-text retrieval is how to learn a common feature space while semantic correspondence between image and text remains. Some existing works extract region feature in image and word feature in text to implement cross-modal alignment between local elements, the other works integrate relation-aware information to local elements to compute cross-modal similarity, while these methods not utilize the semantic information in different semantic-level. In order to address this issue, we propose a Bottom-up Progressive Semantic Alignment (BPSA) network, in which precise fine-grained alignment is carried out on diverse semantic-levels progressively. Specifically, the feature of the cross-modal data are extracted from bottom element to local-group, and global-representation by graph convolution and attention mechanism. We conduct extensive experiments on Flickr30K and MS-COCO datasets, compared with the related state-of-the-art methods. The results show that our network achieves competitive performance. © 2021, Springer Nature Switzerland AG.
Keyword:
Reprint Author's Address:
Email:
Source :
ISSN: 1865-0929
Year: 2021
Volume: 1517 CCIS
Page: 417-424
Language: English
Cited Count:
WoS CC Cited Count: 0
SCOPUS Cited Count:
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 3
Affiliated Colleges: