NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality

Xu Tan; Jiawei Chen; Haohe Liu; Jian Cong; Chen Zhang; Yanqing Liu; Xi Wang; Yichong Leng; Yuanhao Yi; Lei He; Sheng Zhao; Tao Qin; Frank Soong; Tie-Yan Liu

doi:10.1109/TPAMI.2024.3356232

NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality

IEEE Trans Pattern Anal Mach Intell. 2024 Jun;46(6):4234-4245. doi: 10.1109/TPAMI.2024.3356232. Epub 2024 May 7.

Authors

Xu Tan, Jiawei Chen, Haohe Liu, Jian Cong, Chen Zhang, Yanqing Liu, Xi Wang, Yichong Leng, Yuanhao Yi, Lei He, Sheng Zhao, Tao Qin, Frank Soong, Tie-Yan Liu

PMID: 38241115
DOI: 10.1109/TPAMI.2024.3356232

Abstract

Text-to-speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality, and how to achieve it. In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then developing a TTS system called NaturalSpeech that achieves human-level quality on benchmark datasets. Specifically, we leverage a variational auto-encoder (VAE) for end-to-end text-to-waveform generation, with several key modules to enhance the capacity of the prior from text and reduce the complexity of the posterior from speech, including phoneme pre-training, differentiable duration modeling, bidirectional prior/posterior modeling, and a memory mechanism in VAE. Experimental evaluations on the popular LJSpeech dataset show that our proposed NaturalSpeech achieves -0.01 CMOS (comparative mean opinion score) to human recordings at the sentence level, with Wilcoxon signed rank test at p-level p >> 0.05, which demonstrates no statistically significant difference from human recordings for the first time.

Publication types

Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.

MeSH terms

Algorithms*
Databases, Factual
Humans
Natural Language Processing
Signal Processing, Computer-Assisted
Sound Spectrography / methods
Speech / physiology