Improving ultrasound-based multimodal speech recognition with predictive features from representation learning

JASA Express Lett. 2021 Jan;1(1):015205. doi: 10.1121/10.0003062.

Abstract

Representation learning is believed to produce high-level representations of underlying dynamics in temporal sequences. A three-dimensional convolutional neural network trained to predict future frames in ultrasound tongue and optical lip images creates features for a continuous hidden Markov model based speech recognition system. Predictive tongue features are found to generate lower word error rates than those obtained from an auto-encoder without future frames, or from discrete cosine transforms. Improvement is apparent for the monophone/triphone Gaussian mixture model and deep neural network acoustic models. When tongue and lip modalities are combined, the advantage of the predictive features is reduced.