Исследование алгоритмов онлайн-диаризации спикеров для задачи автоматического распознавания речи

Anton V. Polevoi; Natalia V. Loukachevitch

doi:10.26089/NumMet.v27r213

https://doi.org/10.26089/NumMet.v27r213

Research of online speaker diarization algorithms for the task of automatic speech recognition

Authors

Anton V. Polevoi
Natalia V. Loukachevitch

Keywords:

online speaker diarization

Russian speech

PSDA

streaming processing

speech embeddings

PCA

VAD

Abstract

Despite the advances in the field of automatic speech recognition, there are a large number of scenarios (spontaneous speech, an overlapping of speakers, interruptions, etc.) in which modern systems work unstable, failing to accurately convey the intended meaning of the utterance. The task of speaker diarization, which involves splitting an audio stream into segments corresponding to individual speakers, remains one of the most difficult and relevant problems in real-time speech processing. Particularly challenging are cases of prolonged multi-speaker recordings, which are typical for forums, corporate events, and panel discussions. Existing streaming solutions are often limited in the number of speakers, require high computing resources, or have a significant delay in processing the audio stream. This paper presents an algorithm for cascaded processing of streaming speech recognition with online diarization, consisting of a voice activity detector (VAD), an embedding extraction model, and online clustering based on Probabilistic Spherical Discriminant Analysis (PSDA). To increase reliability in streaming mode, stability heuristics are proposed that reduce the number of false speaker switches and ensure consistent model behavior under a limited temporal context. The results obtained using the proposed cascade algorithm for online speaker diarization significantly outperform those of existing systems based on own Russian-language multi-speaker dataset. The effectiveness of cascade approaches for stream diarization in conditions of limited computing resources is also demonstrated.

Downloads

PDF (Русский)

Published

2026-04-24

Issue

Vol. 27 (2026): Issue 2.

Section

Methods and algorithms of computational mathematics and their applications

Authors

Anton V. Polevoi

Lomonosov Moscow State University, Faculty of Computational Mathematics and Cybernetics

• PhD Student

Natalia V. Loukachevitch

Lomonosov Moscow State University, Faculty of Computational Mathematics and Cybernetics

• Leading Researcher

References

D. Liang and X. Li, “LS-EEND: Long-Form Streaming End-to-End Neural Diarization with Online Attractor Extraction,” IEEE Trans. Audio Speech Lang. Process. 33, 3568–3581 (2025).
doi 10.1109/TASLPRO.2025.3597446
T. Park, I. Medennikov, K. Dhawan, et al., “Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens,” arXiv preprint arXiv: 2409.06656, 2024.
doi 10.48550/arXiv.2409.06656
V. Noroozi, S. Majumdar, A. Kumar, et al., “Stateful Conformer with Cache-Based Inference for Streaming Automatic Speech Recognition,” in 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, 2024(IEEE Press, 2024), pp. 12041–12045.
doi 10.1109/ICASSP48485.2024.10446861
J. Carletta, S. Ashby, S. Bourban, et al., “The AMI Meeting Corpus: A Pre-announcement,” in Int. Workshop on Machine Learning for Multimodal Interaction, 2005Lecture Notes in Computer Science, Vol. 3869. (Springer, Berlin, 2006), pp. 28–39.
doi 10.1007/11677482_3
J. S. Chung, J. Huh, A. Nagrani, et al., “Spot the Conversation: Speaker Diarisation in the Wild,” in Proc. 21th Interspeech Conference, Shanghai, October 25–29, 2020.(Interspeech Press, 2020), pp. 299-303.
doi 10.21437/Interspeech.2020-2337
R. Aperdannier, S. Schacht, and A. Piazza, “A Review of Common Online Speaker Diarization Methods,” arXiv preprint arXiv: 2406.14464, 2024.
doi 10.48550/arXiv.2406.14464
N. Dehak, P. J. Kenny, R. Dehak, et al., “Front-End Factor Analysis for Speaker Verification,” IEEE Transactions on Audio, Speech, and Language Processing 19 (4), 788–798 (2011).
doi 10.1109/TASL.2010.2064307
G. Sell and D. Garcia-Romero, “Speaker diarization with PLDA i-vector scoring and unsupervised calibration,” in 2014 IEEE Spoken Language Technology Workshop (SLT), South Lake Tahoe, NV, USA, December 07–10, 2014(IEEE Press, 2014), pp. 413–417.
doi 10.1109/SLT.2014.7078610
E. Variani, X. Lei, E. McDermott, et al., “Deep neural networks for small footprint text-dependent speaker verification,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, May 04–09, 2014(IEEE Press, 2014), pp. 4052–4056.
doi 10.1109/ICASSP.2014.6854363
B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA–TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” in Proc. 21th Interspeech Conference, Shanghai, October 25–29, 2020.(Interspeech Press, 2020), pp. 3830–3834.
doi 10.21437/Interspeech.2020-2650
L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized End-to-End Loss for Speaker Verification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, April 15–20, 2018. (IEEE Press, 2018), pp. 4879–4883.
doi 10.1109/ICASSP.2018.8462665
A. Sholokhov, N. Kuzmin, K. A. Lee, and E. S. Chng, “Probabilistic Back-Ends for Online Speaker Recognition and Clustering,” in 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, June 04–10, 2023. (IEEE Press, 2023), pp. 1–5.
doi 10.1109/ICASSP49357.2023.10097032
H. Bredin, R. Yin, J. M. Coria, et al., “Pyannote.Audio: Neural Building Blocks for Speaker Diarization,” ICASSP 2020 – 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, May 04–08, 2020, (IEEE Press, 2020), pp. 7124–7128.
doi 10.1109/ICASSP40776.2020.9052974
A. Zhang, Q. Wang, Z. Zhu, et al., “Fully Supervised Speaker Diarization,” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, May 12–17, 2019. (IEEE Press, 2019), pp. 6301–6305.
doi 10.1109/ICASSP.2019.8683892
J. M. Coria, H. Bredin, S. Ghannay, and S. Rosset, “Overlap-Aware Low-Latency Online Speaker Diarization Based on End-to-End Local Segmentation,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, December 13–17, 2021. (IEEE Press, 2021), pp. 1139–1146.
doi 10.1109/ASRU51503.2021.9688044
H. W. Kuhn, “The Hungarian method for the assignment problem,” Naval Research Logistics Quarterly 2 (1–2), 83–97 (1955).
doi 10.1002/nav.3800020109
N. Ryant, P. Singh, V. Krishnamohan, et al., “The Third DIHARD Diarization Challenge,” in 22nd Annual Conference of the International Speech Communication Association (INTERSPEECH 2021), Brno, Czechia, August 30–September 3, 2021.(Interspeech Press, 2021), pp. 3570–3574.
doi 10.21437/Interspeech.2021-1208
Automatic multimedia annotation service TagMe.
https://developers.sber.ru/portal/products/tagme Cited April 5, 2026.
N. Brümmer, A. Swart, L. Mošner, et al., “Probabilistic Spherical Discriminant Analysis: An Alternative to PLDA for length-normalized embeddings,” arXiv preprint arXiv: 2203.14893, 2022.
doi 10.48550/arXiv.2203.14893
A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: A Large-Scale Speaker Identification Dataset,” in Proceedings Interspeech, Stockholm, Sweden, August 20–24, 2017.(Interspeech Press, 2017), pp. 2616–2620.
doi 10.21437/Interspeech.2017-950
H. Wang, C. Liang, S. Wang, et al., “Wespeaker: A research and production oriented speaker embedding learning toolkit,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, June 4–10, 2023. (IEEE Press, 2023), pp. 1–5.
doi 10.1109/ICASSP49357.2023.10096626

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

https://doi.org/10.26089/NumMet.v27r213

Research of online speaker diarization algorithms for the task of automatic speech recognition

Authors

Keywords:

Abstract

Downloads

Published

Issue

Section

Authors

Anton V. Polevoi

Natalia V. Loukachevitch

References

License

Language

Information

Make a Submission