Robot gaining robust pouring skills through fusing vision and audio

ISA Trans. 2023 Apr:135:428-437. doi: 10.1016/j.isatra.2022.09.022. Epub 2022 Sep 17.

Abstract

In the pouring task of service robots, the robust and accurate estimate of liquid height is a crucial step. However, neither vision nor audio alone can achieve better liquid height estimation. We instead propose a visual-audio information fusion network to enable robots with good pouring skills. Visual and audio information are used as information sources. Firstly, visual features are extracted by residual network based on attention model. Secondly, the Fourier characteristic matrix of audio information is obtained by fast Fourier transform, and then the audio feature is extracted by long-short term memory. Thirdly, visual features and audio features are fused by fully connected network to output the liquid height and state of the cup. Finally, a sinusoidal and transient fusion control method is proposed, which takes the liquid height and cup state as inputs, outputs the angle of the gripper, and provides an implementation method for the pouring task. Experiments are carried out to evaluate the performance of multimodal information fusion method and verify the effectiveness of the algorithm for pouring tasks of service robots.

Keywords: Pouring skills; Pouring task; Service robots; Visual–audio information.