Discrimination Between Native and Non-Native Speech Using Visual Features Only

IEEE Trans Cybern. 2016 Dec;46(12):2758-2771. doi: 10.1109/TCYB.2015.2488592. Epub 2015 Oct 26.

Abstract

Accent is a soft biometric trait that can be inferred from pronunciation and articulation patterns characterizing the speaking style of an individual. Past research has addressed the task of classifying accent, as belonging to a native language speaker or a foreign language speaker, by means of the audio modality only. However, features extracted from the visual stream of speech have been successfully used to extend or substitute audio-only approaches that target speech or language recognition. Motivated by these findings, we investigate to what extent temporal visual speech dynamics attributed to accent can be modeled and identified when the audio stream is missing or noisy, and the speech content is unknown. We present here a fully automated approach to discriminating native from non-native English speech, based exclusively on visual cues. A systematic evaluation of various appearance and shape features for the target problem is conducted, with the former consistently yielding superior performance. Subject-independent cross-validation experiments are conducted on mobile phone recordings of continuous speech and isolated word utterances spoken by 56 subjects from the challenging MOBIO database. High performance is achieved on a text-dependent (TD) protocol, with the best score of 76.5% yielded by fusion of five hidden Markov models trained on appearance features. Our framework is also efficient even when tested on examples of speech unseen in the training phase, although performing less accurately compared to the TD case.