• Complex
  • Title
  • Keyword
  • Abstract
  • Scholars
  • Journal
  • ISSN
  • Conference
搜索

Author:

Yu, W. (Yu, W..) | Yu, J. (Yu, J..) | Bai, M. (Bai, M..) | Xiao, C. (Xiao, C..)

Indexed by:

Scopus

Abstract:

Objective: Object detection is a fundamental task in computer vision applications, which provides support for subsequent object tracking, semantic segmentation, and behavior recognition. Recent years have witnessed substantial progress in still image object detection based on deep convolutional neural network (DCNN). The task of still image object detection is to determine the category and position of each object in an image. Video object detection aims to locate a moving object in sequential images and assign a specific category label to each object. The accuracy of video object detection suffers from degenerated object appearances in videos, such as motion blur, multiobject occlusion, and rare poses. The methods of still image object detection achieve excellent results, but directly applying them to video object detection is challenging. According to the temporal and spatial information in videos, most existing video object detection methods improve the accuracy of moving object detection by considering spatiotemporal consistency based on still image object detection. Method: In this paper, we propose a video object detection method using fusion of single shot multibox detector (SSD) and spatiotemporal features. Under the framework of SSD, temporal and spatial information of the video are applied to video object detection through the optical flow network and the feature pyramid network. On the one hand, the network combining residual network (ResNet) 101 with four extra convolutional layers is used for feature extraction to produce the feature map in each frame of the video. An optical flow network estimates the optical flow fields between the current frame and multiple adjacent frames to enhance the feature of the current frame. The feature maps from adjacent frames are compensated to the current frame according to the optical flow fields. The multiple compensated feature maps as well the feature map of the current frame are aggregated according to adaptive weights. The adaptive weights indicate the importance of all compensated feature maps to the current frame. Here, the cosine similarity metric is utilized to measure the similarity between the compensated feature map and the feature map extracted from the current frame. If the compensated feature map is close to the feature map of the current frame, then the compensated feature map is assigned a larger weight; otherwise, it is assigned a smaller weight. Moreover, an embedding network that consists of three convolutional layers is applied on the compensated feature maps and the current feature map to produce the embedding feature maps, and the embedding feature maps are used to compute the adaptive weights. On the other hand, the feature pyramid network is used to extract multiscale feature maps that are used to detect the object of different sizes. The low-and high-level feature maps are used to detect smaller and larger objects, respectively. For the problem of small object detection in the original SSD network, the low-level feature map is combined with the high-level feature map to enhance the semantic information of the low-level feature map via upsampling operation and a 1×1 convolutional layer. The upsampling operation is used to extend the high-level feature map to the same resolution as the low-level feature map, and the 1×1 convolution layer is used to reduce the channel dimensions of the low-level feature map to be consistent with those of the high-level feature map. Then, multiscale feature maps are input into the detection network to predict bounding boxes, and nonmaximum suppression is carried out to filter the redundant bounding boxes and obtain the final bounding boxes. Result: Experimental results show that the mean average precision (mAP) score of the proposed method on the ImageNet VID(ImageNet for video object detection) dataset can reach 72.0%, which is 24.5%, 3.6%, and 2.5% higher than those of the temporal convolutional network, the method combining tubelet proposal network with long short memory network, and the method combining SSD and siamese network, respectively. In addition, an ablation experiment is conducted with five network structures, namely, 16-layer visual geometry group(VGG16) network, ResNet101 network, the network combining ResNet101 with feature pyramid network, and the network combining ResNet101 with spatiotemporal fusion. The network structure combining ResNet101 with spatiotemporal fusion improves the mAP score by 11.8%, 7.0%, and 1.2% compared with the first four network structures. For further analysis, the mAP scores of the slow, medium, and fast objects are reported in addition to the standard mAP score. Our method combined with optical flow improves the mAP score of slow, medium, and fast objects by 0.6%, 1.9%, and 2.3%, respectively, compared with the network structure combining ResNet101 with feature pyramid network. Experimental results show that the proposed method can improve the accuracy of video object detection, especially the performance of fast object detection. Conclusion: Temporal and spatial correlation of the video by spatiotemporal fusion are used to improve the accuracy of video object detection in the proposed method. Using the optical flow network in video object detection can compensate the feature map of the current frame according to the feature maps of multiple adjacent frames. False negatives and false positives can be reduced through temporal feature fusion in video object detection. In addition, multiscale feature maps produced by the feature pyramid network can detect the object of different sizes, and the multiscale feature map fusion can enhance the semantic information of the low-level feature map, which improves the detection ability of the low-level feature map for small objects. © 2021, Editorial Office of Journal of Image and Graphics. All right reserved.

Keyword:

Feature pyramid network Single shot multibox detector (SSD) Object detection Feature fusion Optical flow

Author Community:

  • [ 1 ] [Yu W.]Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
  • [ 2 ] [Yu J.]Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
  • [ 3 ] [Bai M.]Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
  • [ 4 ] [Xiao C.]Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China

Reprint Author's Address:

Email:

Show more details

Related Keywords:

Source :

Journal of Image and Graphics

ISSN: 1006-8961

Year: 2021

Issue: 3

Volume: 26

Page: 542-555

Cited Count:

WoS CC Cited Count: 0

SCOPUS Cited Count: 6

ESI Highly Cited Papers on the List: 0 Unfold All

WanFang Cited Count:

Chinese Cited Count:

30 Days PV: 6

Affiliated Colleges:

Online/Total:428/10633922
Address:BJUT Library(100 Pingleyuan,Chaoyang District,Beijing 100124, China Post Code:100124) Contact Us:010-67392185
Copyright:BJUT Library Technical Support:Beijing Aegean Software Co., Ltd.