Improve word embedding using both writing and pronunciation

Wenhao Zhu; Xin Jin; Jianyue Ni; Baogang Wei; Zhiguo Lu

doi:10.1371/journal.pone.0208785

Improve word embedding using both writing and pronunciation

PLoS One. 2018 Dec 10;13(12):e0208785. doi: 10.1371/journal.pone.0208785. eCollection 2018.

Authors

Wenhao Zhu¹, Xin Jin¹, Jianyue Ni¹, Baogang Wei², Zhiguo Lu³

Affiliations

¹ School of Computer Engineering and Science, Shanghai University, Shanghai, China.
² College of Computer Science and Technology, Zhejiang University, Zhejiang, China.
³ Library of Shanghai University, Shanghai University, Shanghai, China.

Abstract

Text representation can map text into a vector space for subsequent use in numerical calculations and processing tasks. Word embedding is an important component of text representation. Most existing word embedding models focus on writing and utilize context, weight, dependency, morphology, etc., to optimize the training. However, from the linguistic point of view, spoken language is a more direct expression of semantics; writing has meaning only as a recording of spoken language. Therefore, this paper proposes the concept of a pronunciation-enhanced word embedding model (PWE) that integrates speech information into training to fully apply the roles of both speech and writing to meaning. This paper uses the Chinese language, English language and Spanish language as examples and presents several models that integrate word pronunciation characteristics into word embedding. Word similarity and text classification experiments show that the PWE outperforms the baseline model that does not include speech information. Language is a storehouse of sound-images; therefore, the PWE can be applied to most languages.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Humans
Models, Theoretical
Natural Language Processing
Phonetics*
Semantics*
Speech*
Vocabulary*
Writing*

Grants and funding

This work is supported by National Natural Science Foundation of China (No. 61572434 and No. 91630206) and Shanghai Science and Technology Committee (No. 16DZ2293600). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.