Objective speech intelligibility prediction using a deep learning model with continuous speech-evoked cortical auditory responses

Youngmin Na; Hyosung Joo; Le Thi Trang; Luong Do Anh Quan; Jihwan Woo

doi:10.3389/fnins.2022.906616

Objective speech intelligibility prediction using a deep learning model with continuous speech-evoked cortical auditory responses

Front Neurosci. 2022 Aug 18:16:906616. doi: 10.3389/fnins.2022.906616. eCollection 2022.

Authors

Youngmin Na¹, Hyosung Joo², Le Thi Trang², Luong Do Anh Quan², Jihwan Woo^{1

2}

Affiliations

¹ Department of Biomedical Engineering, University of Ulsan, Ulsan, South Korea.
² Department of Electrical, Electronic and Computer Engineering, University of Ulsan, Ulsan, South Korea.

Abstract

Auditory prostheses provide an opportunity for rehabilitation of hearing-impaired patients. Speech intelligibility can be used to estimate the extent to which the auditory prosthesis improves the user's speech comprehension. Although behavior-based speech intelligibility is the gold standard, precise evaluation is limited due to its subjectiveness. Here, we used a convolutional neural network to predict speech intelligibility from electroencephalography (EEG). Sixty-four-channel EEGs were recorded from 87 adult participants with normal hearing. Sentences spectrally degraded by a 2-, 3-, 4-, 5-, and 8-channel vocoder were used to set relatively low speech intelligibility conditions. A Korean sentence recognition test was used. The speech intelligibility scores were divided into 41 discrete levels ranging from 0 to 100%, with a step of 2.5%. Three scores, namely 30.0, 37.5, and 40.0%, were not collected. The speech features, i.e., the speech temporal envelope (ENV) and phoneme (PH) onset, were used to extract continuous-speech EEGs for speech intelligibility prediction. The deep learning model was trained by a dataset of event-related potentials (ERP), correlation coefficients between the ERPs and ENVs, between the ERPs and PH onset, or between ERPs and the product of the multiplication of PH and ENV (PHENV). The speech intelligibility prediction accuracies were 97.33% (ERP), 99.42% (ENV), 99.55% (PH), and 99.91% (PHENV). The models were interpreted using the occlusion sensitivity approach. While the ENV models' informative electrodes were located in the occipital area, the informative electrodes of the phoneme models, i.e., PH and PHENV, were based on the occlusion sensitivity map located in the language processing area. Of the models tested, the PHENV model obtained the best speech intelligibility prediction accuracy. This model may promote clinical prediction of speech intelligibility with a comfort speech intelligibility test.

Keywords: EEG; continuous speech; deep-learning; occlusion sensitivity; speech intelligibility.