Streaming cascade-based speech translation leveraged by a direct segmentation model

Javier Iranzo-Sánchez; Javier Jorge; Pau Baquero-Arnal; Joan Albert Silvestre-Cerdà; Adrià Giménez; Jorge Civera; Albert Sanchis; Alfons Juan

doi:10.1016/j.neunet.2021.05.013

Streaming cascade-based speech translation leveraged by a direct segmentation model

Neural Netw. 2021 Oct:142:303-315. doi: 10.1016/j.neunet.2021.05.013. Epub 2021 May 17.

Authors

Javier Iranzo-Sánchez¹, Javier Jorge¹, Pau Baquero-Arnal¹, Joan Albert Silvestre-Cerdà¹, Adrià Giménez¹, Jorge Civera², Albert Sanchis¹, Alfons Juan¹

Affiliations

¹ Machine Learning and Language Processing Group, Valencian Research Institute for Artificial Intelligence, Universitat Politècnica de València, Camí de Vera s/n, 46022 València, Spain.
² Machine Learning and Language Processing Group, Valencian Research Institute for Artificial Intelligence, Universitat Politècnica de València, Camí de Vera s/n, 46022 València, Spain. Electronic address: jorcisai@vrain.upv.es.

PMID: 34082286
DOI: 10.1016/j.neunet.2021.05.013

Abstract

The cascade approach to Speech Translation (ST) is based on a pipeline that concatenates an Automatic Speech Recognition (ASR) system followed by a Machine Translation (MT) system. Nowadays, state-of-the-art ST systems are populated with deep neural networks that are conceived to work in an offline setup in which the audio input to be translated is fully available in advance. However, a streaming setup defines a completely different picture, in which an unbounded audio input gradually becomes available and at the same time the translation needs to be generated under real-time constraints. In this work, we present a state-of-the-art streaming ST system in which neural-based models integrated in the ASR and MT components are carefully adapted in terms of their training and decoding procedures in order to run under a streaming setup. In addition, a direct segmentation model that adapts the continuous ASR output to the capacity of simultaneous MT systems trained at the sentence level is introduced to guarantee low latency while preserving the translation quality of the complete ST system. The resulting ST system is thoroughly evaluated on the real-life streaming Europarl-ST benchmark to gauge the trade-off between quality and latency for each component individually as well as for the complete ST system.

Keywords: Segmentation Model; Streaming Cascade Speech Translation.

MeSH terms

Language
Neural Networks, Computer*
Speech Recognition Software
Speech*