Evaluation of Glottal Inverse Filtering Algorithms Using a Physiologically Based Articulatory Speech Synthesizer

Yu-Ren Chien; Daryush D Mehta; Jón Guðnason; Matías Zañartu; Thomas F Quatieri

doi:10.1109/taslp.2017.2714839

Evaluation of Glottal Inverse Filtering Algorithms Using a Physiologically Based Articulatory Speech Synthesizer

IEEE/ACM Trans Audio Speech Lang Process. 2017 Aug;25(8):1718-1730. doi: 10.1109/taslp.2017.2714839. Epub 2017 Jun 12.

Authors

Yu-Ren Chien¹, Daryush D Mehta², Jón Guðnason¹, Matías Zañartu³, Thomas F Quatieri⁴

Affiliations

¹ Center for Analysis and Design of Intelligent Agents, Reykjavik University, Menntavegur 1, Iceland.
² Center for Laryngeal Surgery and Voice Rehabilitation, and Institute of Health Professions, Massachusetts General Hospital, Boston MA 02114 USA, with the Department of Surgery, Harvard Medical School, Boston, MA 02115 USA, and also with MIT Lincoln Laboratory, Lexington, MA.
³ Department of Electronic Engineering, Universidad Técnica Federico Santa María, Valparaíso, Chile, 2390123.
⁴ MIT Lincoln Laboratory, Lexington, MA.

Abstract

Glottal inverse filtering aims to estimate the glottal airflow signal from a speech signal for applications such as speaker recognition and clinical voice assessment. Nonetheless, evaluation of inverse filtering algorithms has been challenging due to the practical difficulties of directly measuring glottal airflow. Apart from this, it is acknowledged that the performance of many methods degrade in voice conditions that are of great interest, such as breathiness, high pitch, soft voice, and running speech. This paper presents a comprehensive, objective, and comparative evaluation of state-of-the-art inverse filtering algorithms that takes advantage of speech and glottal airflow signals generated by a physiological speech synthesizer. The synthesizer provides a physics-based simulation of the voice production process and thus an adequate test bed for revealing the temporal and spectral performance characteristics of each algorithm. Included in the synthetic data are continuous speech utterances and sustained vowels, which are produced with multiple voice qualities (pressed, slightly pressed, modal, slightly breathy, and breathy), fundamental frequencies, and subglottal pressures to simulate the natural variations in real speech. In evaluating the accuracy of a glottal flow estimate, multiple error measures are used, including an error in the estimated signal that measures overall waveform deviation, as well as an error in each of several clinically relevant features extracted from the glottal flow estimate. Waveform errors calculated from glottal flow estimation experiments exhibited mean values around 30% for sustained vowels, and around 40% for continuous speech, of the amplitude of true glottal flow derivative. Closed-phase approaches showed remarkable stability across different voice qualities and subglottal pressures. The algorithms of choice, as suggested by significance tests, are closed-phase covariance analysis for the analysis of sustained vowels, and sparse linear prediction for the analysis of continuous speech. Results of data subset analysis suggest that analysis of close rounded vowels is an additional challenge in glottal flow estimation.

Keywords: Performance evaluation; glottal excitation; glottal flow estimation; inverse filtering; speech analysis; speech synthesis; voice production.

Abstract

Grants and funding