• Complex
  • Title
  • Keyword
  • Abstract
  • Scholars
  • Journal
  • ISSN
  • Conference
搜索

Author:

Qing, C. (Qing, C..) | Yu, J. (Yu, J..) | Xiao, C. (Xiao, C..) | Duan, J. (Duan, J..)

Indexed by:

Scopus

Abstract:

Semantic segmentation is a fundamental task in computer vision applications, such as scene analysis and behavior recognition. The recent years have witnessed significant progress in semantic image segmentation based on deep convolutional neural network(DCNN). Semantic segmentation is a type of pixel-level image understanding with the objective of assigning a semantic label for each pixel of a given image. Object detection only locates the bounding box of the object, while the task of semantic segmentation is to segment an image into several meaningful objects and then assign a specific semantic label to each object. The difficulty of image semantic segmentation mostly originates from three aspects: object, category, and background. From the perspective of objects, when an object is in different lighting, angle of view, and distance, or when it is still or moving, the image taken will significantly differ. Occlusion may also occur between adjacent objects. In terms of categories, objects from the same category have dissimilarities and objects from different categories have similarities. From the background perspective, a simple background helps output accurate semantic segmentation results, but the background of real scenes is complex. In this study, we provide a systematic review of recent advances in DCNN methods for semantic segmentation. In this paper, we first discuss the difficulties and challenges in semantic segmentation and provide datasets and quantitative metrics for evaluating the performance of these methods. Then, we detail how recent CNN-based semantic segmentation methods work and analyze their strengths and limitations. According to whether to use pixel-level labeled images to train the network, these methods are grouped into two categories: supervised and weakly supervised learning-based semantic segmentation. Supervised semantic segmentation requires pixel-level annotations. By contrast, weakly supervised semantic segmentation aims to segment images by class labels, bounding boxes, and scribbles. In this study, we divide supervised semantic segmentation models into four groups: encoder-decoder methods, feature map-based methods, probability map-based methods, and various strategies. In an encoder-decoder network, an encoder module gradually reduces feature maps and captures high semantic information, while a decoder module gradually recovers spatial information. At present, most state-of-the-art deep CNN for semantic segmentation originate from a common forerunner, i.e., the fully convolutional network (FCN), which is an encoder-decoder network. FCN transforms existing and well-known classification models, such as AlexNet, visual geometry group 16-layer net (VGG16), GoogLeNet, and ResNet, into fully convolutional models by replacing fully connected layers with convolutional ones to output spatial maps instead of classification scores. Such maps are upsampled using deconvolutions to produce dense per-pixel labeled outputs. A feature map-based method aims to take complete advantage of the context information of a feature map, including its spatial context (position) and scale context (size), facilitating the segmentation and parsing of an image. These methods obtain the spatial and scale contexts by increasing the receptive field and fusing multiscale information, effectively improving the performance of the network. Some models, such as the pyramid scene parsing network or Deeplab v3, perform spatial pyramid pooling at several different scales (including image-level pooling) or apply several parallel atrous convolutions with different rates. These models have presented promising results by involving the spatial and scale contexts. A probability map-based method combines the semantic context (probability) and the spatial context (location) with postprocess probability score maps and semantic label predictions primarily through the use of a probabilistic graph model. A probabilistic graph is a probabilistic model that uses a graph to present conditional dependence between random variables. It is the combination of probability and graph theories. Probabilistic graph models have several types, such as conditional random fields (CRFs), Markov random fields, and Bayesian networks. Object boundary is refined and network performance is improved by establishing semantic relationships between pixels. This family of approaches typically includes CRF-recurrent neural networks, deep parsing networks, and EncNet. Some methods combine two or more of the aforementioned strategies to significantly improve the segmentation performance of a network, such as a global convolutional network, DeepLab v1, DeepLab v2, DeepLab v3+, and a discriminative feature network. In accordance with the type of weak supervision used by a training network, weakly supervised semantic segmentation methods are divided into four groups: class label-based, bounding box-based, scribble-based, and various forms of annotations. Class-label annotations only indicate the presence of an object. Thus, the substantial problem in class label-based methods is accurately assigning image-level labels to their corresponding pixels. In general, this problem can be solved by using the multiple instance learning-based strategy to train models for semantic segmentation or adopting an alternative training procedure based on the expectation-maximization algorithm to dynamically predict semantic foreground and background pixels. A recent work attempted to increase the quality of an object localization map by integrating a seed region growing technique into the segmentation network, significantly increasing pixel accuracy. Bounding box-based methods use bounding boxes and class labels as supervision information. By using region proposal methods and the traditional image segmentation theory to generate candidate segmentation masks, a convolutional network is trained under the supervision of these approximate segmentation masks. BoxSup proposes a recursive training procedure wherein a convolutional network is trained under the supervision of segment object proposals. In turn, the updated network improves the segmentation mask used for training. Scribble-supervised training methods apply a graphical model to propagate information from scribbles to unmarked pixels on the basis of spatial constraints, appearance, and semantic content, accounting for two tasks. The first task is to propagate the class labels from scribbles to other pixels and fully annotate an image. The second task is to learn a convolutional network for semantic segmentation. We compare some semantic segmentation methods of supervised learning and weakly supervised learning on the PASCAL VOC (pattern analysis, statistical modelling and computational learning visual object classes) 2012 dataset. We also give the optimal methods of supervised learning methools and wedakly supervised learning methods, and the corresponding MIoU(mean intersection-over-union). Lastly, we present related research areas, including video semantic segmentation, 3D dataset semantic segmentation, real-time semantic segmentation, and instance segmentation. Image semantic segmentation is a popular topic in the fields of computer vision and artificial intelligence. Many applications require accurate and efficient segmentation models, e.g., autonomous driving, indoor navigation, and smart medicine. Thus, further work should be conducted on semantic segmentation to improve the accuracy of object boundaries and the performance of semantic segmentation. © 2020, Editorial and Publishing Board of Journal of Image and Graphics. All right reserved.

Keyword:

Supervised learning Semantic segmentation Weakly supervised learning Convolutional neural network (CNN)

Author Community:

  • [ 1 ] [Qing C.]Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
  • [ 2 ] [Yu J.]Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
  • [ 3 ] [Xiao C.]Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
  • [ 4 ] [Duan J.]Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China

Reprint Author's Address:

Email:

Show more details

Related Keywords:

Related Article:

Source :

Journal of Image and Graphics

ISSN: 1006-8961

Year: 2020

Issue: 6

Volume: 25

Page: 1069-1090

Cited Count:

WoS CC Cited Count:

SCOPUS Cited Count: 18

ESI Highly Cited Papers on the List: 0 Unfold All

WanFang Cited Count:

Chinese Cited Count:

30 Days PV: 13

Affiliated Colleges:

Online/Total:433/10804681
Address:BJUT Library(100 Pingleyuan,Chaoyang District,Beijing 100124, China Post Code:100124) Contact Us:010-67392185
Copyright:BJUT Library Technical Support:Beijing Aegean Software Co., Ltd.