Speech Recognition for the iCub Platform

Bertrand Higy; Alessio Mereta; Giorgio Metta; Leonardo Badino

doi:10.3389/frobt.2018.00010

Speech Recognition for the iCub Platform

Front Robot AI. 2018 Feb 12:5:10. doi: 10.3389/frobt.2018.00010. eCollection 2018.

Authors

Bertrand Higy^{1

2}, Alessio Mereta³, Giorgio Metta¹, Leonardo Badino⁴

Affiliations

¹ iCub Facility, Istituto Italiano di Tecnologia, Genoa, Italy.
² Università di Genova, Genoa, Italy.
³ Advanced Concepts Team, European Space Agency, Noordwijk, Netherlands.
⁴ Center for Translational Neurophysiology of Speech and Communication, Istituto Italiano di Tecnologia, Ferrara, Italy.

Abstract

This paper describes open source software (available at https://github.com/robotology/natural-speech) to build automatic speech recognition (ASR) systems and run them within the YARP platform. The toolkit is designed (i) to allow non-ASR experts to easily create their own ASR system and run it on iCub and (ii) to build deep learning-based models specifically addressing the main challenges an ASR system faces in the context of verbal human-iCub interactions. The toolkit mostly consists of Python, C++ code and shell scripts integrated in YARP. As additional contribution, a second codebase (written in Matlab) is provided for more expert ASR users who want to experiment with bio-inspired and developmental learning-inspired ASR systems. Specifically, we provide code for two distinct kinds of speech recognition: "articulatory" and "unsupervised" speech recognition. The first is largely inspired by influential neurobiological theories of speech perception which assume speech perception to be mediated by brain motor cortex activities. Our articulatory systems have been shown to outperform strong deep learning-based baselines. The second type of recognition systems, the "unsupervised" systems, do not use any supervised information (contrary to most ASR systems, including our articulatory systems). To some extent, they mimic an infant who has to discover the basic speech units of a language by herself. In addition, we provide resources consisting of pre-trained deep learning models for ASR, and a 2.5-h speech dataset of spoken commands, the VoCub dataset, which can be used to adapt an ASR system to the typical acoustic environments in which iCub operates.

Keywords: automatic speech recognition; code:C++; code:matlab; code:python; tensorflow; yarp.