Indexed by:
Abstract:
Image-text multimodal sentiment analysis aims to predict sentimental polarity by integrating visual modalities and text modalities. The key to solving the multimodal sentiment analysis task is obtaining high-quality multimodal representations of both visual and textual modalities and achieving efficient fusion of these representations. Therefore, a cross-modal multi-level fusion sentiment analysis method based on visual language model(MFVL) is proposed. Firstly, based on the pre-trained visual language model, high-quality multimodal representations and modality bridge representations are generated by freezing the parameters and a low-rank adaptation method being adopted for fine-tuning the large language model. Secondly, a cross-modal multi-head co-attention fusion module is designed to perform interactive weighted fusion of the visual and textual modality representations respectively. Finally, a mixture of experts module is designed to deeply fuse the visual, textual and modality bridging representations to achieve multimodal sentiment analysis. Experimental results indicate that MFVL achieves state-of-the-art performance on the public evaluation datasets MVSA-Single and HFM. © 2024 Science Press. All rights reserved.
Keyword:
Reprint Author's Address:
Email:
Source :
Pattern Recognition and Artificial Intelligence
ISSN: 1003-6059
Year: 2024
Issue: 5
Volume: 37
Page: 459-468
Cited Count:
SCOPUS Cited Count: 2
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 6
Affiliated Colleges: