Deep-learning-based segmentation of the vocal tract and articulators in real-time magnetic resonance images of speech

Matthieu Ruthven; Marc E Miquel; Andrew P King

doi:10.1016/j.cmpb.2020.105814

Deep-learning-based segmentation of the vocal tract and articulators in real-time magnetic resonance images of speech

Comput Methods Programs Biomed. 2021 Jan:198:105814. doi: 10.1016/j.cmpb.2020.105814. Epub 2020 Oct 26.

Authors

Matthieu Ruthven¹, Marc E Miquel², Andrew P King³

Affiliations

¹ Clinical Physics, Barts Health NHS Trust, West Smithfield, London EC1A 7BE, United Kingdom; School of Biomedical Engineering & Imaging Sciences, King's College London, King's Health Partners, St Thomas' Hospital, London SE1 7EH, United Kingdom. Electronic address: matthieuruthven@nhs.net.
² Clinical Physics, Barts Health NHS Trust, West Smithfield, London EC1A 7BE, United Kingdom; Centre for Advanced Cardiovascular Imaging, NIHR Barts Biomedical Research Centre, William Harvey Institute, Queen Mary University of London, London EC1M 6BQ, United Kingdom.
³ School of Biomedical Engineering & Imaging Sciences, King's College London, King's Health Partners, St Thomas' Hospital, London SE1 7EH, United Kingdom.

Abstract

Background and objective: Magnetic resonance (MR) imaging is increasingly used in studies of speech as it enables non-invasive visualisation of the vocal tract and articulators, thus providing information about their shape, size, motion and position. Extraction of this information for quantitative analysis is achieved using segmentation. Methods have been developed to segment the vocal tract, however, none of these also fully segment any articulators. The objective of this work was to develop a method to fully segment multiple groups of articulators as well as the vocal tract in two-dimensional MR images of speech, thus overcoming the limitations of existing methods.

Methods: Five speech MR image sets (392 MR images in total), each of a different healthy adult volunteer, were used in this work. A fully convolutional network with an architecture similar to the original U-Net was developed to segment the following six regions in the image sets: the head, soft palate, jaw, tongue, vocal tract and tooth space. A five-fold cross-validation was performed to investigate the segmentation accuracy and generalisability of the network. The segmentation accuracy was assessed using standard overlap-based metrics (Dice coefficient and general Hausdorff distance) and a novel clinically relevant metric based on velopharyngeal closure.

Results: The segmentations created by the method had a median Dice coefficient of 0.92 and a median general Hausdorff distance of 5mm. The method segmented the head most accurately (median Dice coefficient of 0.99), and the soft palate and tooth space least accurately (median Dice coefficients of 0.92 and 0.93 respectively). The segmentations created by the method correctly showed 90% (27 out of 30) of the velopharyngeal closures in the MR image sets.

Conclusions: An automatic method to fully segment multiple groups of articulators as well as the vocal tract in two-dimensional MR images of speech was successfully developed. The method is intended for use in clinical and non-clinical speech studies which involve quantitative analysis of the shape, size, motion and position of the vocal tract and articulators. In addition, a novel clinically relevant metric for assessing the accuracy of vocal tract and articulator segmentation methods was developed.

Keywords: Articulators; Convolutional neural networks; Dynamic magnetic resonance imaging; Segmentation; Speech; Vocal tract.

MeSH terms

Adult
Deep Learning*
Dental Articulators*
Humans
Image Processing, Computer-Assisted
Magnetic Resonance Imaging
Speech

Abstract

MeSH terms

Grants and funding