End-to-end neural speaker diarization with an iterative adaptive attractor estimation

Fengyuan Hao; Xiaodong Li; Chengshi Zheng

doi:10.1016/j.neunet.2023.07.043

End-to-end neural speaker diarization with an iterative adaptive attractor estimation

Neural Netw. 2023 Sep:166:566-578. doi: 10.1016/j.neunet.2023.07.043. Epub 2023 Aug 1.

Authors

Fengyuan Hao¹, Xiaodong Li¹, Chengshi Zheng²

Affiliations

¹ Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, 100190, China; University of Chinese Academy of Sciences, Beijing, 100049, China.
² Key Laboratory of Noise and Vibration Research, Institute of Acoustics, Chinese Academy of Sciences, Beijing, 100190, China; University of Chinese Academy of Sciences, Beijing, 100049, China. Electronic address: cszheng@mail.ioa.ac.cn.

PMID: 37586257
DOI: 10.1016/j.neunet.2023.07.043

Abstract

End-to-end neural diarization (EEND) which has the capability to directly output speaker diarization results and handle overlapping speech has attracted more and more attention due to its promising performance. Although existing EEND-based methods often outperform clustering-based methods, they cannot generalize well to unseen test sets because fixed attractors are often utilized to estimate speech activities of each speaker. An iterative adaptive attractor estimation (IAAE) network was proposed to refine diarization results, in which the self-attentive EEND (SA-EEND) was implemented to initialize diarization results and frame-wise embeddings. There are two main parts in the proposed IAAE network: an attention-based pooling was designed to obtain a rough estimation of the attractors based on the diarization results of the previous iteration, and an adaptive attractor was then calculated by using transformer decoder blocks. A unified training framework was proposed to further improve the diarization performance, making the embeddings more discriminable based on the well separated attractors. We evaluated the proposed method on both the simulated mixtures and the real CALLHOME dataset using the diarization error rate (DER). Our proposed method provides relative reductions in DER by up to 44.8% on simulated 2-speaker mixtures and 23.6% on the CALLHOME dataset over the baseline SA-EEND at the 2nd iteration step. We also demonstrated that with an increasing number of refinement steps applied, the DER on the CALLHOME dataset could be further reduced to 7.36%, achieving the state-of-the-art diarization results when compared with other methods.

Keywords: Adaptive attractor estimation; End-to-end; Iterative refinement; Speaker diarization; Unified training.

MeSH terms

Cluster Analysis
Speech*