• Complex
  • Title
  • Keyword
  • Abstract
  • Scholars
  • Journal
  • ISSN
  • Conference
搜索

Author:

Chen, H. (Chen, H..) | Shen, F. (Shen, F..) | Ding, D. (Ding, D..) | Deng, Y. (Deng, Y..) | Li, C. (Li, C..)

Indexed by:

EI Scopus SCIE

Abstract:

Previous multi-modal transformers for RGB-D salient object detection (SOD) generally directly connect all patches from two modalities to model cross-modal correlation and perform multi-modal combination without differentiation, which can lead to confusing and inefficient fusion. Instead, we disentangle the cross-modal complementarity from two views to reduce cross-modal fusion ambiguity: 1) Context disentanglement. We argue that modeling long-range dependencies across modalities as done before is uninformative due to the severe modality gap. Differently, we propose to disentangle the cross-modal complementary contexts to intra-modal self-attention to explore global complementary understanding, and spatial-aligned inter-modal attention to capture local cross-modal correlations, respectively. 2) Representation disentanglement. Unlike previous undifferentiated combination of cross-modal representations, we find that cross-modal cues complement each other by enhancing common discriminative regions and mutually supplement modal-specific highlights. On top of this, we divide the tokens into consistent and private ones in the channel dimension to disentangle the multi-modal integration path and explicitly boost two complementary ways. By progressively propagate this strategy across layers, the proposed Disentangled Feature Pyramid module (DFP) enables informative cross-modal cross-level integration and better fusion adaptivity. Comprehensive experiments on a large variety of public datasets verify the efficacy of our context and representation disentanglement and the consistent improvement over state-of-the-art models. Additionally, our cross-modal attention hierarchy can be plug-and-play for different backbone architectures (both transformer and CNN) and downstream tasks, and experiments on a CNN-based model and RGB-D semantic segmentation verify this generalization ability.  © 1992-2012 IEEE.

Keyword:

cross-modal attention disentanglement RGB-D salient object detection transformer

Author Community:

  • [ 1 ] [Chen H.]Southeast University, School of Computer Science and Engineering, Nanjing, 211189, China
  • [ 2 ] [Chen H.]Key Lab. of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Ministry of Education, Nanjing, 211189, China
  • [ 3 ] [Shen F.]Southeast University, School of Computer Science and Engineering, Nanjing, 211189, China
  • [ 4 ] [Shen F.]Key Lab. of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Ministry of Education, Nanjing, 211189, China
  • [ 5 ] [Ding D.]Southeast University, School of Computer Science and Engineering, Nanjing, 211189, China
  • [ 6 ] [Deng Y.]Beijing University of Technology, College of Computer Science, Beijing, 100124, China
  • [ 7 ] [Li C.]Alibaba Group, Hangzhou, 311121, China

Reprint Author's Address:

Email:

Show more details

Related Keywords:

Source :

IEEE Transactions on Image Processing

ISSN: 1057-7149

Year: 2024

Volume: 33

Page: 1699-1709

1 0 . 6 0 0

JCR@2022

Cited Count:

WoS CC Cited Count: 0

SCOPUS Cited Count: 10

ESI Highly Cited Papers on the List: 0 Unfold All

WanFang Cited Count:

Chinese Cited Count:

30 Days PV: 3

Affiliated Colleges:

Online/Total:1290/10545143
Address:BJUT Library(100 Pingleyuan,Chaoyang District,Beijing 100124, China Post Code:100124) Contact Us:010-67392185
Copyright:BJUT Library Technical Support:Beijing Aegean Software Co., Ltd.