Disentangled Cross-Modal Transformer for RGB-D Salient Object Detection and beyond - Details

Author：

Chen, H. (Chen, H..) | Shen, F. (Shen, F..) | Ding, D. (Ding, D..) | Deng, Y. (Deng, Y..) | Li, C. (Li, C..)

Indexed by：

EI Scopus SCIE

Abstract：

Previous　multi-modal　transformers　for　RGB-D　salient　object　detection　(SOD)　generally　directly　connect　all　patches　from　two　modalities　to　model　cross-modal　correlation　and　perform　multi-modal　combination　without　differentiation,　which　can　lead　to　confusing　and　inefficient　fusion.　Instead,　we　disentangle　the　cross-modal　complementarity　from　two　views　to　reduce　cross-modal　fusion　ambiguity:　1)　Context　disentanglement.　We　argue　that　modeling　long-range　dependencies　across　modalities　as　done　before　is　uninformative　due　to　the　severe　modality　gap.　Differently,　we　propose　to　disentangle　the　cross-modal　complementary　contexts　to　intra-modal　self-attention　to　explore　global　complementary　understanding,　and　spatial-aligned　inter-modal　attention　to　capture　local　cross-modal　correlations,　respectively.　2)　Representation　disentanglement.　Unlike　previous　undifferentiated　combination　of　cross-modal　representations,　we　find　that　cross-modal　cues　complement　each　other　by　enhancing　common　discriminative　regions　and　mutually　supplement　modal-specific　highlights.　On　top　of　this,　we　divide　the　tokens　into　consistent　and　private　ones　in　the　channel　dimension　to　disentangle　the　multi-modal　integration　path　and　explicitly　boost　two　complementary　ways.　By　progressively　propagate　this　strategy　across　layers,　the　proposed　Disentangled　Feature　Pyramid　module　(DFP)　enables　informative　cross-modal　cross-level　integration　and　better　fusion　adaptivity.　Comprehensive　experiments　on　a　large　variety　of　public　datasets　verify　the　efficacy　of　our　context　and　representation　disentanglement　and　the　consistent　improvement　over　state-of-the-art　models.　Additionally,　our　cross-modal　attention　hierarchy　can　be　plug-and-play　for　different　backbone　architectures　(both　transformer　and　CNN)　and　downstream　tasks,　and　experiments　on　a　CNN-based　model　and　RGB-D　semantic　segmentation　verify　this　generalization　ability.　　©　1992-2012　IEEE.

Keyword：

cross-modal attention disentanglement RGB-D salient object detection transformer

Author Community：

[ 1 ] [Chen H.]Southeast University, School of Computer Science and Engineering, Nanjing, 211189, China
[ 2 ] [Chen H.]Key Lab. of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Ministry of Education, Nanjing, 211189, China
[ 3 ] [Shen F.]Southeast University, School of Computer Science and Engineering, Nanjing, 211189, China
[ 4 ] [Shen F.]Key Lab. of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Ministry of Education, Nanjing, 211189, China
[ 5 ] [Ding D.]Southeast University, School of Computer Science and Engineering, Nanjing, 211189, China
[ 6 ] [Deng Y.]Beijing University of Technology, College of Computer Science, Beijing, 100124, China
[ 7 ] [Li C.]Alibaba Group, Hangzhou, 311121, China

Reprint Author's Address：

Email：

Show more details

Related Keywords：

A deep multimodal feature learning network for RGB-D salient object detection
2021，Computers and Electrical Engineering
Context-aware network for RGB-D salient object detection
2021，Pattern Recognition: The Journal of the Pattern Recognition Society
Context-aware network for RGB-D salient object detection
2021，Pattern Recognition
Augmentation to Disentangling: A Novel Approach for Improving Feature Disentanglement in Occluded Person Re-Identification
2024，

Source ：

IEEE Transactions on Image Processing

ISSN： 1057-7149

Year： 2024

Volume： 33

Page： 1699-1709

1 0 . 6 0 0

JCR@2022

Cited Count：

WoS CC Cited Count： 0

SCOPUS Cited Count： 10

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 3

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search SCOPUS

Type
Departments

All Years Choose Year From to