Gaze Estimation by Attention-Induced Hierarchical Variational Auto-Encoder

IEEE Trans Cybern. 2024 Apr;54(4):2592-2605. doi: 10.1109/TCYB.2023.3312392. Epub 2024 Mar 18.

Abstract

Appearance-based gaze estimation has been widely studied recently with promising performance. The majority of appearance-based gaze estimation methods are developed under the deterministic frameworks. However, the deterministic gaze estimation methods suffer from large performance drop upon challenging eye images in low-resolution, darkness, partial occlusions, etc. To alleviate this problem, in this article, we alternatively reformulate the appearance-based gaze estimation problem under a generative framework. Specifically, we propose a variational inference model, that is, variational gaze estimation network (VGE-Net), to generate multiple gaze maps as complimentary candidates simultaneously supervised by the ground-truth gaze map. To achieve robust estimation, we adaptively fuse the gaze directions predicted on these candidate gaze maps by a regression network through a simple attention mechanism. Experiments on three benchmarks, that is, MPIIGaze, EYEDIAP, and Columbia, demonstrate that our VGE-Net outperforms state-of-the-art gaze estimation methods, especially on challenging cases. Comprehensive ablation studies also validate the effectiveness of our contributions. The code will be publicly released.