Adaptive Local Spatiotemporal Features from RGB-D Data for One-Shot Learning Gesture Recognition

Jia Lin; Xiaogang Ruan; Naigong Yu; Yee-Hong Yang

doi:10.3390/s16122171

Adaptive Local Spatiotemporal Features from RGB-D Data for One-Shot Learning Gesture Recognition

Sensors (Basel). 2016 Dec 17;16(12):2171. doi: 10.3390/s16122171.

Authors

Jia Lin^{1

2}, Xiaogang Ruan^{3

4}, Naigong Yu^{5

6}, Yee-Hong Yang⁷

Affiliations

¹ Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China. linjia.bjut@gmail.com.
² Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing 100124, China. linjia.bjut@gmail.com.
³ Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China. adrxg@bjut.edu.cn.
⁴ Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing 100124, China. adrxg@bjut.edu.cn.
⁵ Faculty of Information Technology, Beijing University of Technology, Beijing 100124, China. yunaigong@bjut.edu.cn.
⁶ Beijing Key Laboratory of Computational Intelligence and Intelligent System, Beijing 100124, China. yunaigong@bjut.edu.cn.
⁷ Department of Computing Science, University of Alberta, Edmonton, AB T6G2E8, Canada. herberty@ualberta.ca.

Abstract

Noise and constant empirical motion constraints affect the extraction of distinctive spatiotemporal features from one or a few samples per gesture class. To tackle these problems, an adaptive local spatiotemporal feature (ALSTF) using fused RGB-D data is proposed. First, motion regions of interest (MRoIs) are adaptively extracted using grayscale and depth velocity variance information to greatly reduce the impact of noise. Then, corners are used as keypoints if their depth, and velocities of grayscale and of depth meet several adaptive local constraints in each MRoI. With further filtering of noise, an accurate and sufficient number of keypoints is obtained within the desired moving body parts (MBPs). Finally, four kinds of multiple descriptors are calculated and combined in extended gradient and motion spaces to represent the appearance and motion features of gestures. The experimental results on the ChaLearn gesture, CAD-60 and MSRDailyActivity3D datasets demonstrate that the proposed feature achieves higher performance compared with published state-of-the-art approaches under the one-shot learning setting and comparable accuracy under the leave-one-out cross validation.

Keywords: adaptive; gesture recognition; motion region of interest; one-shot learning; optical flow; spatiotemporal feature.