Indexed by:
Abstract:
As an indispensable process of cross-media analyzing, comprehending heterogeneous data faces challenges in the fields of visual question answering (VQA), visual captioning, and cross-modality retrieval. Bridging the semantic gap between the two modalities is still difficult. In this article, to address the problem in cross-modality retrieval, we propose a cross-modal learning model with joint correlative calculation learning. First, an auto-encoder is used to embed the visual features by minimizing the error of feature reconstruction and a multi-layer perceptron (MLP) is utilized to model the textual features embedding. Then we design a joint loss function to optimize both the intra- and the inter-correlations among the image-sentence pairs, i.e., the reconstruction loss of visual features, the relevant similarity loss of paired samples, and the triplet relation loss between positive and negative examples. In the proposed method, we optimize the joint loss based on a batch score matrix and utilize all mutual mismatched paired samples to enhance its performance. Our experiments in the retrieval tasks demonstrate the effectiveness of the proposed method. It achieves comparable performance to the state-of-the-art on three benchmarks, i.e., Flickr8k, Flickr30k, and MS-COCO.
Keyword:
Reprint Author's Address:
Email:
Source :
ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS
ISSN: 1551-6857
Year: 2019
Issue: 2
Volume: 15
5 . 1 0 0
JCR@2022
ESI Discipline: COMPUTER SCIENCE;
ESI HC Threshold:147
JCR Journal Grade:1
Cited Count:
WoS CC Cited Count: 16
SCOPUS Cited Count: 16
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 10
Affiliated Colleges: