Speech quality estimation with deep lattice networks

Michael Chinen; Jan Skoglund; Andrew Hines

doi:10.1121/10.0005130

Speech quality estimation with deep lattice networks

J Acoust Soc Am. 2021 Jun;149(6):3851. doi: 10.1121/10.0005130.

Authors

Michael Chinen¹, Jan Skoglund¹, Andrew Hines²

Affiliations

¹ Chrome Media Audio, Google LLC, San Francisco, USA.
² School of Computer Science, University College Dublin, Dublin, Ireland.

PMID: 34241460
DOI: 10.1121/10.0005130

Abstract

Intrusive subjective speech quality estimation of mean opinion score (MOS) often involves mapping a raw similarity score extracted from differences between the clean and degraded utterance onto MOS with a fitted mapping function. More recent models such as support vector regression (SVR) or deep neural networks use multidimensional input, which allows for a more accurate prediction than one-dimensional (1-D) mappings but does not provide the monotonic property that is expected between similarity and quality. We investigate a multidimensional mapping function using deep lattice networks (DLNs) to provide monotonic constraints with input features provided by ViSQOL. The DLN improved the speech mapping to 0.24 mean-square error on a mixture of datasets that include voice over IP and codec degradations, outperforming the 1-D fitted functions and SVR as well as PESQ and POLQA. Additionally, we show that the DLN can be used to learn a quantile function that is well-calibrated and a useful measure of uncertainty. The quantile function provides an improved mapping of data driven similarity representations to human interpretable scales, such as quantile intervals for predictions instead of point estimates.

MeSH terms

Humans
Neural Networks, Computer
Speech Perception*
Speech*
Uncertainty