Unsupervised joint prosody labeling and modeling for Mandarin speech

Chen-Yu Chiang; Sin-Horng Chen; Hsiu-Min Yu; Yih-Ru Wang

doi:10.1121/1.3056559

Unsupervised joint prosody labeling and modeling for Mandarin speech

J Acoust Soc Am. 2009 Feb;125(2):1164-83. doi: 10.1121/1.3056559.

Authors

Chen-Yu Chiang¹, Sin-Horng Chen, Hsiu-Min Yu, Yih-Ru Wang

Affiliation

¹ Department of Communication Engineering, National Chiao Tung University, Hsinchu 300, Taiwan, Republic of China. gene.cm91g@nctu.edu.tw

PMID: 19206890
DOI: 10.1121/1.3056559

Abstract

An unsupervised joint prosody labeling and modeling method for Mandarin speech is proposed, a new scheme intended to construct statistical prosodic models and to label prosodic tags consistently for Mandarin speech. Two types of prosodic tags are determined by four prosodic models designed to illustrate the hierarchy of Mandarin prosody: the break of a syllable juncture to demarcate prosodic constituents and the prosodic state to represent any prosodic domain's pitch-level variation resulting from its upper-layered prosodic constituents' influences. The performance of the proposed method was evaluated using an unlabeled read-speech corpus articulated by an experienced female announcer. Experimental results showed that the estimated parameters of the four prosodic models were able to explore and describe the structures and patterns of Mandarin prosody. Besides, certain corresponding relationships between the break indices labeled and the associated words were found, and manifested the connections between prosodic and linguistic parameters, a finding further verifying the capability of the method presented. Finally, a quantitative comparison in labeling results between the proposed method and human labelers indicated that the former was more consistent and discriminative than the latter in prosodic feature distributions, a merit of the method developed here on the applications of prosody modeling.

Publication types

Comparative Study
Evaluation Study

MeSH terms

Algorithms
Cues*
Female
Humans
Language*
Models, Statistical*
Pattern Recognition, Physiological
Pitch Perception
Speech Acoustics*
Speech Perception*