Deep convolutional neural network for semantic image segmentation; [深度卷积神经网络图像语义分割研究进展] - Details

Author：

Qing, C. (Qing, C..) | Yu, J. (Yu, J..) | Xiao, C. (Xiao, C..) | Duan, J. (Duan, J..)

Indexed by：

Scopus

Abstract：

Semantic　segmentation　is　a　fundamental　task　in　computer　vision　applications,　such　as　scene　analysis　and　behavior　recognition.　The　recent　years　have　witnessed　significant　progress　in　semantic　image　segmentation　based　on　deep　convolutional　neural　network(DCNN).　Semantic　segmentation　is　a　type　of　pixel-level　image　understanding　with　the　objective　of　assigning　a　semantic　label　for　each　pixel　of　a　given　image.　Object　detection　only　locates　the　bounding　box　of　the　object,　while　the　task　of　semantic　segmentation　is　to　segment　an　image　into　several　meaningful　objects　and　then　assign　a　specific　semantic　label　to　each　object.　The　difficulty　of　image　semantic　segmentation　mostly　originates　from　three　aspects:　object,　category,　and　background.　From　the　perspective　of　objects,　when　an　object　is　in　different　lighting,　angle　of　view,　and　distance,　or　when　it　is　still　or　moving,　the　image　taken　will　significantly　differ.　Occlusion　may　also　occur　between　adjacent　objects.　In　terms　of　categories,　objects　from　the　same　category　have　dissimilarities　and　objects　from　different　categories　have　similarities.　From　the　background　perspective,　a　simple　background　helps　output　accurate　semantic　segmentation　results,　but　the　background　of　real　scenes　is　complex.　In　this　study,　we　provide　a　systematic　review　of　recent　advances　in　DCNN　methods　for　semantic　segmentation.　In　this　paper,　we　first　discuss　the　difficulties　and　challenges　in　semantic　segmentation　and　provide　datasets　and　quantitative　metrics　for　evaluating　the　performance　of　these　methods.　Then,　we　detail　how　recent　CNN-based　semantic　segmentation　methods　work　and　analyze　their　strengths　and　limitations.　According　to　whether　to　use　pixel-level　labeled　images　to　train　the　network,　these　methods　are　grouped　into　two　categories:　supervised　and　weakly　supervised　learning-based　semantic　segmentation.　Supervised　semantic　segmentation　requires　pixel-level　annotations.　By　contrast,　weakly　supervised　semantic　segmentation　aims　to　segment　images　by　class　labels,　bounding　boxes,　and　scribbles.　In　this　study,　we　divide　supervised　semantic　segmentation　models　into　four　groups:　encoder-decoder　methods,　feature　map-based　methods,　probability　map-based　methods,　and　various　strategies.　In　an　encoder-decoder　network,　an　encoder　module　gradually　reduces　feature　maps　and　captures　high　semantic　information,　while　a　decoder　module　gradually　recovers　spatial　information.　At　present,　most　state-of-the-art　deep　CNN　for　semantic　segmentation　originate　from　a　common　forerunner,　i.e.,　the　fully　convolutional　network　(FCN),　which　is　an　encoder-decoder　network.　FCN　transforms　existing　and　well-known　classification　models,　such　as　AlexNet,　visual　geometry　group　16-layer　net　(VGG16),　GoogLeNet,　and　ResNet,　into　fully　convolutional　models　by　replacing　fully　connected　layers　with　convolutional　ones　to　output　spatial　maps　instead　of　classification　scores.　Such　maps　are　upsampled　using　deconvolutions　to　produce　dense　per-pixel　labeled　outputs.　A　feature　map-based　method　aims　to　take　complete　advantage　of　the　context　information　of　a　feature　map,　including　its　spatial　context　(position)　and　scale　context　(size),　facilitating　the　segmentation　and　parsing　of　an　image.　These　methods　obtain　the　spatial　and　scale　contexts　by　increasing　the　receptive　field　and　fusing　multiscale　information,　effectively　improving　the　performance　of　the　network.　Some　models,　such　as　the　pyramid　scene　parsing　network　or　Deeplab　v3,　perform　spatial　pyramid　pooling　at　several　different　scales　(including　image-level　pooling)　or　apply　several　parallel　atrous　convolutions　with　different　rates.　These　models　have　presented　promising　results　by　involving　the　spatial　and　scale　contexts.　A　probability　map-based　method　combines　the　semantic　context　(probability)　and　the　spatial　context　(location)　with　postprocess　probability　score　maps　and　semantic　label　predictions　primarily　through　the　use　of　a　probabilistic　graph　model.　A　probabilistic　graph　is　a　probabilistic　model　that　uses　a　graph　to　present　conditional　dependence　between　random　variables.　It　is　the　combination　of　probability　and　graph　theories.　Probabilistic　graph　models　have　several　types,　such　as　conditional　random　fields　(CRFs),　Markov　random　fields,　and　Bayesian　networks.　Object　boundary　is　refined　and　network　performance　is　improved　by　establishing　semantic　relationships　between　pixels.　This　family　of　approaches　typically　includes　CRF-recurrent　neural　networks,　deep　parsing　networks,　and　EncNet.　Some　methods　combine　two　or　more　of　the　aforementioned　strategies　to　significantly　improve　the　segmentation　performance　of　a　network,　such　as　a　global　convolutional　network,　DeepLab　v1,　DeepLab　v2,　DeepLab　v3+,　and　a　discriminative　feature　network.　In　accordance　with　the　type　of　weak　supervision　used　by　a　training　network,　weakly　supervised　semantic　segmentation　methods　are　divided　into　four　groups:　class　label-based,　bounding　box-based,　scribble-based,　and　various　forms　of　annotations.　Class-label　annotations　only　indicate　the　presence　of　an　object.　Thus,　the　substantial　problem　in　class　label-based　methods　is　accurately　assigning　image-level　labels　to　their　corresponding　pixels.　In　general,　this　problem　can　be　solved　by　using　the　multiple　instance　learning-based　strategy　to　train　models　for　semantic　segmentation　or　adopting　an　alternative　training　procedure　based　on　the　expectation-maximization　algorithm　to　dynamically　predict　semantic　foreground　and　background　pixels.　A　recent　work　attempted　to　increase　the　quality　of　an　object　localization　map　by　integrating　a　seed　region　growing　technique　into　the　segmentation　network,　significantly　increasing　pixel　accuracy.　Bounding　box-based　methods　use　bounding　boxes　and　class　labels　as　supervision　information.　By　using　region　proposal　methods　and　the　traditional　image　segmentation　theory　to　generate　candidate　segmentation　masks,　a　convolutional　network　is　trained　under　the　supervision　of　these　approximate　segmentation　masks.　BoxSup　proposes　a　recursive　training　procedure　wherein　a　convolutional　network　is　trained　under　the　supervision　of　segment　object　proposals.　In　turn,　the　updated　network　improves　the　segmentation　mask　used　for　training.　Scribble-supervised　training　methods　apply　a　graphical　model　to　propagate　information　from　scribbles　to　unmarked　pixels　on　the　basis　of　spatial　constraints,　appearance,　and　semantic　content,　accounting　for　two　tasks.　The　first　task　is　to　propagate　the　class　labels　from　scribbles　to　other　pixels　and　fully　annotate　an　image.　The　second　task　is　to　learn　a　convolutional　network　for　semantic　segmentation.　We　compare　some　semantic　segmentation　methods　of　supervised　learning　and　weakly　supervised　learning　on　the　PASCAL　VOC　(pattern　analysis,　statistical　modelling　and　computational　learning　visual　object　classes)　2012　dataset.　We　also　give　the　optimal　methods　of　supervised　learning　methools　and　wedakly　supervised　learning　methods,　and　the　corresponding　MIoU(mean　intersection-over-union).　Lastly,　we　present　related　research　areas,　including　video　semantic　segmentation,　3D　dataset　semantic　segmentation,　real-time　semantic　segmentation,　and　instance　segmentation.　Image　semantic　segmentation　is　a　popular　topic　in　the　fields　of　computer　vision　and　artificial　intelligence.　Many　applications　require　accurate　and　efficient　segmentation　models,　e.g.,　autonomous　driving,　indoor　navigation,　and　smart　medicine.　Thus,　further　work　should　be　conducted　on　semantic　segmentation　to　improve　the　accuracy　of　object　boundaries　and　the　performance　of　semantic　segmentation.　©　2020,　Editorial　and　Publishing　Board　of　Journal　of　Image　and　Graphics.　All　right　reserved.

Keyword：

Supervised learning Semantic segmentation Weakly supervised learning Convolutional neural network (CNN)

Author Community：

[ 1 ] [Qing C.]Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
[ 2 ] [Yu J.]Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
[ 3 ] [Xiao C.]Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
[ 4 ] [Duan J.]Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China

Reprint Author's Address：

Email：

Related Keywords：

Three-dimensional human pose estimation based on multi-source image weakly-supervised learning
2019，Journal of Beijing University of Aeronautics and Astronautics
Weakly supervised locality sensitive hashing for duplicate image retrieval
2011，2011 18th IEEE International Conference on Image Processing, ICIP 2011
Joint multi-scale discrimination and region segmentation for person re-ID
2020，PATTERN RECOGNITION LETTERS
A Domain-aware Language-supervised Method for Image Emotion Classification
2023，4th IEEE International Conference on Pattern Recognition and Machine Learning, PRML 2023

Type
Departments

All Years Choose Year From to