Deep Spatial-Temporal Joint Feature Representation for Video Object Detection

Baojun Zhao; Boya Zhao; Linbo Tang; Yuqi Han; Wenzheng Wang

doi:10.3390/s18030774

Deep Spatial-Temporal Joint Feature Representation for Video Object Detection

Sensors (Basel). 2018 Mar 4;18(3):774. doi: 10.3390/s18030774.

Authors

Baojun Zhao^{1

2}, Boya Zhao^{3

4}, Linbo Tang^{5

6}, Yuqi Han^{7

8}, Wenzheng Wang^{9

10}

Affiliations

¹ School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China. zbj@bit.edu.cn.
² Beijing Key Laboratory of Embedded Real-time Information Processing Technology, Beijing Institute of Technology, Beijing 100081, China. zbj@bit.edu.cn.
³ School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China. zhaoboya@bit.edu.cn.
⁴ Beijing Key Laboratory of Embedded Real-time Information Processing Technology, Beijing Institute of Technology, Beijing 100081, China. zhaoboya@bit.edu.cn.
⁵ School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China. tanglinbo@bit.edu.cn.
⁶ Beijing Key Laboratory of Embedded Real-time Information Processing Technology, Beijing Institute of Technology, Beijing 100081, China. tanglinbo@bit.edu.cn.
⁷ School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China. yuqi_han@bit.edu.cn.
⁸ Beijing Key Laboratory of Embedded Real-time Information Processing Technology, Beijing Institute of Technology, Beijing 100081, China. yuqi_han@bit.edu.cn.
⁹ School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China. wwz@bit.edu.cn.
¹⁰ Beijing Key Laboratory of Embedded Real-time Information Processing Technology, Beijing Institute of Technology, Beijing 100081, China. wwz@bit.edu.cn.

Abstract

With the development of deep neural networks, many object detection frameworks have shown great success in the fields of smart surveillance, self-driving cars, and facial recognition. However, the data sources are usually videos, and the object detection frameworks are mostly established on still images and only use the spatial information, which means that the feature consistency cannot be ensured because the training procedure loses temporal information. To address these problems, we propose a single, fully-convolutional neural network-based object detection framework that involves temporal information by using Siamese networks. In the training procedure, first, the prediction network combines the multiscale feature map to handle objects of various sizes. Second, we introduce a correlation loss by using the Siamese network, which provides neighboring frame features. This correlation loss represents object co-occurrences across time to aid the consistent feature generation. Since the correlation loss should use the information of the track ID and detection label, our video object detection network has been evaluated on the large-scale ImageNet VID dataset where it achieves a 69.5% mean average precision (mAP).

Keywords: Siamese network; deep neural network; multiscale feature representation; temporal information; video object detection.

MeSH terms

Information Storage and Retrieval
Neural Networks, Computer*