• Complex
  • Title
  • Keyword
  • Abstract
  • Scholars
  • Journal
  • ISSN
  • Conference
搜索

Author:

Duan, Li-Juan (Duan, Li-Juan.) (Scholars:段立娟) | Sun, Qi-Chao (Sun, Qi-Chao.) | Qiao, Yuan-Hua (Qiao, Yuan-Hua.) (Scholars:乔元华) | Chen, Jun-Cheng (Chen, Jun-Cheng.) | Cui, Guo-Qin (Cui, Guo-Qin.)

Indexed by:

EI Scopus CSCD

Abstract:

Semantic segmentation is a research hotspot in the field of computer vision. It refers to assigning all pixels into different semantic classes. As a fundamental problem in scene understanding, semantic segmentation is widely used in various intelligent tasks. In recent years, with the success of convolutional neural network (CNN) in many computer vision applications, fully convolutional networks (FCN) have shown great potential on RGB semantic segmentation task. However, semantic segmentation is still a challenging task due to the complexity of scene types, severe object occlusions and varying illuminations. In recent years, with the availability of consumer RGB-D sensors such as RealSense 3D Camera and Microsoft Kinect, we can capture both RGB image and depth information at the same time. Depth information can describe 3D geometric information which might be missed in RGB-only images. It can significantly reduce classification errors and improve the accuracy of semantic segmentation. In order to make effective use of RGB information and depth information, it is crucial to find an efficient multi-modal information fusion method. According to different fusion periods, the current RGB-D feature fusion methods can be divided into three types: early fusion, late fusion and middle fusion. However, most of previous studies fail to make effective use of complementary information between RGB information and depth information. They simply fuse RGB features and depth features with equal-weight concatenating or summing, which failed to extract complementary information between two modals and will suppressed the modality specific information. In addition, semantic information in high level features between different modals is not taken into account, which is very important for the fine-grained semantic segmentation task. To solve the above problems, in this paper, we present a novel Attention-aware and Semantic-aware Multi-modal Fusion Network (ASNet) for RGB-D semantic segmentation. Our network is able to effectively fuse multi-level RGB-D features by including Attention-aware Multi-modal Fusion blocks(AMF) and Semantic-aware Multi-modal Fusion blocks(SMF). Specifically, in Attention-aware Multi-modal Fusion blocks, a cross-modal attention mechanism is designed to make RGB features and depth features guide and optimize each other through their complementary characteristics in order to obtain the feature representation with rich spatial location information. In addition, Semantic-aware Multi-modal Fusion blocks model the semantic interdependencies between multi-modal features by integrating semantic associated feature channels among the RGB and depth features and extract more precise semantic feature representation. The two blocks are integrated into a two-branch encoder-decoder architecture, which can restore image resolution gradually by using consecutive up-sampling operation and combine low level features and high level features through skip-connections to achieve high-resolution prediction. In order to optimize the training process, we using deeply supervised learning over multi-level decoding features. Our network is able to effectively learn the complementary characteristics of two modalities and models the semantic context interdependencies between RGB features and depth features. Experimental results with two challenging public RGB-D indoor semantic segmentation datasets, i.e., SUN RGB-D and NYU Depth v2, show that our network outperforms existing RGB-D semantic segmentation methods and improves the segmentation performance by 1.9% and 1.2% for mean accuracy and mean IoU respectively. © 2021, Science Press. All right reserved.

Keyword:

Semantics Convolutional neural networks Image resolution Convolution Computer vision Decoding Cameras Semantic Web

Author Community:

  • [ 1 ] [Duan, Li-Juan]Faculty of Information Technology, Beijing University of Technology, Beijing; 100124, China
  • [ 2 ] [Duan, Li-Juan]Beijing Key Laboratory of Trusted Computing, Beijing; 100124, China
  • [ 3 ] [Duan, Li-Juan]National Engineering Laboratory for Key Technologies of Information Security Level Protection, Beijing; 100124, China
  • [ 4 ] [Sun, Qi-Chao]Faculty of Information Technology, Beijing University of Technology, Beijing; 100124, China
  • [ 5 ] [Sun, Qi-Chao]Beijing Key Laboratory of Trusted Computing, Beijing; 100124, China
  • [ 6 ] [Sun, Qi-Chao]Advanced Institute of Information Technology, Peking University, Hangzhou; 311200, China
  • [ 7 ] [Qiao, Yuan-Hua]College of Applied Sciences, Beijing University of Technology, Beijing; 100124, China
  • [ 8 ] [Chen, Jun-Cheng]Faculty of Information Technology, Beijing University of Technology, Beijing; 100124, China
  • [ 9 ] [Cui, Guo-Qin]State Key Laboratory of Digital Multi-media Chip Technology, Vimicro Corporation, Beijing; 100191, China

Reprint Author's Address:

  • [chen, jun-cheng]faculty of information technology, beijing university of technology, beijing; 100124, china

Show more details

Related Keywords:

Related Article:

Source :

Chinese Journal of Computers

ISSN: 0254-4164

Year: 2021

Issue: 2

Volume: 44

Page: 275-291

Cited Count:

WoS CC Cited Count:

SCOPUS Cited Count: 5

ESI Highly Cited Papers on the List: 0 Unfold All

WanFang Cited Count:

Chinese Cited Count:

30 Days PV: 6

Online/Total:1168/10542725
Address:BJUT Library(100 Pingleyuan,Chaoyang District,Beijing 100124, China Post Code:100124) Contact Us:010-67392185
Copyright:BJUT Library Technical Support:Beijing Aegean Software Co., Ltd.