On Space-Time Filtering Framework for Matching Human Actions Across Different Viewpoints

IEEE Trans Image Process. 2018 Mar;27(3):1230-1242. doi: 10.1109/TIP.2017.2765821. Epub 2017 Oct 23.

Abstract

Space-time template matching is considered as a promising approach for human action recognition. However, a major drawback of template-based methods is computational overhead due to matching in spatial domain. Recently, space-time correlation-based action filters have been proposed for recognizing human actions in frequency domain. These action filters present reduction in time complexity as Fourier transform-based matching is faster than spatial template matching. However, the utility of such action filters is challenged due to a number of factors: 1) inability to deal with view variations due to implicit lack of support for view-invariance; 2) these filters can be trained only for one action class at a time, and separate filters are required for each action class with increased computational overhead; 3) these filters simply take average of similar action instances and behave no better than average filters; and 4) slightly misaligned action data sets create problems as these filters are not shift-invariant. In this paper, we try to address these shortcomings by proposing an advanced space-time filtering framework for recognizing human actions despite large viewpoint variations. Rather than using crude intensity values, we use 3D tensor structure at each pixel, which characterizes the most common local motion in action sequences. Discrete tensor Fourier transform is then applied to achieve frequency domain representations. Then, we form view clusters from multiple view action data and use space-time correlation filtering to achieve discriminative view representations. These representations are used in an innovative way to achieve action recognition despite viewpoint variations. Extensive experimentation is performed on well-known multiple view action data sets, including IXMAS, WVU, and N-UCLA action data set. A detailed performance comparison with the existing view-invariant action recognition techniques indicates that our approach works equally well for RGB and RGB-D video data with increased accuracy and efficiency.