https://doi.org/10.26089/NumMet.v27r213

Research of online speaker diarization algorithms for the task of automatic speech recognition

Authors

  • Anton V. Polevoi
  • Natalia V. Loukachevitch

Keywords:

online speaker diarization
Russian speech
PSDA
streaming processing
speech embeddings
PCA
VAD

Abstract

Despite the advances in the field of automatic speech recognition, there are a large number of scenarios (spontaneous speech, an overlapping of speakers, interruptions, etc.) in which modern systems work unstable, failing to accurately convey the intended meaning of the utterance. The task of speaker diarization, which involves splitting an audio stream into segments corresponding to individual speakers, remains one of the most difficult and relevant problems in real-time speech processing. Particularly challenging are cases of prolonged multi-speaker recordings, which are typical for forums, corporate events, and panel discussions. Existing streaming solutions are often limited in the number of speakers, require high computing resources, or have a significant delay in processing the audio stream. This paper presents an algorithm for cascaded processing of streaming speech recognition with online diarization, consisting of a voice activity detector (VAD), an embedding extraction model, and online clustering based on Probabilistic Spherical Discriminant Analysis (PSDA). To increase reliability in streaming mode, stability heuristics are proposed that reduce the number of false speaker switches and ensure consistent model behavior under a limited temporal context. The results obtained using the proposed cascade algorithm for online speaker diarization significantly outperform those of existing systems based on own Russian-language multi-speaker dataset. The effectiveness of cascade approaches for stream diarization in conditions of limited computing resources is also demonstrated.



Downloads

Published

2026-04-24

Issue

Section

Methods and algorithms of computational mathematics and their applications

Authors

Anton V. Polevoi

Natalia V. Loukachevitch


References

  1. D. Liang and X. Li, “LS-EEND: Long-Form Streaming End-to-End Neural Diarization with Online Attractor Extraction,” IEEE Trans. Audio Speech Lang. Process. 33, 3568–3581 (2025).
    doi 10.1109/TASLPRO.2025.3597446
  2. T. Park, I. Medennikov, K. Dhawan, et al., “Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens,” arXiv preprint arXiv: 2409.06656, 2024.
    doi 10.48550/arXiv.2409.06656
  3. V. Noroozi, S. Majumdar, A. Kumar, et al., “Stateful Conformer with Cache-Based Inference for Streaming Automatic Speech Recognition,” in 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Korea, 2024(IEEE Press, 2024), pp. 12041–12045.
    doi 10.1109/ICASSP48485.2024.10446861
  4. J. Carletta, S. Ashby, S. Bourban, et al., “The AMI Meeting Corpus: A Pre-announcement,” in Int. Workshop on Machine Learning for Multimodal Interaction, 2005Lecture Notes in Computer Science, Vol. 3869. (Springer, Berlin, 2006), pp. 28–39.
    doi 10.1007/11677482_3
  5. J. S. Chung, J. Huh, A. Nagrani, et al., “Spot the Conversation: Speaker Diarisation in the Wild,” in Proc. 21th Interspeech Conference, Shanghai, October 25–29, 2020.(Interspeech Press, 2020), pp. 299-303.
    doi 10.21437/Interspeech.2020-2337
  6. R. Aperdannier, S. Schacht, and A. Piazza, “A Review of Common Online Speaker Diarization Methods,” arXiv preprint arXiv: 2406.14464, 2024.
    doi 10.48550/arXiv.2406.14464
  7. N. Dehak, P. J. Kenny, R. Dehak, et al., “Front-End Factor Analysis for Speaker Verification,” IEEE Transactions on Audio, Speech, and Language Processing 19 (4), 788–798 (2011).
    doi 10.1109/TASL.2010.2064307
  8. G. Sell and D. Garcia-Romero, “Speaker diarization with PLDA i-vector scoring and unsupervised calibration,” in 2014 IEEE Spoken Language Technology Workshop (SLT), South Lake Tahoe, NV, USA, December 07–10, 2014(IEEE Press, 2014), pp. 413–417.
    doi 10.1109/SLT.2014.7078610
  9. E. Variani, X. Lei, E. McDermott, et al., “Deep neural networks for small footprint text-dependent speaker verification,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, May 04–09, 2014(IEEE Press, 2014), pp. 4052–4056.
    doi 10.1109/ICASSP.2014.6854363
  10. B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA–TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” in Proc. 21th Interspeech Conference, Shanghai, October 25–29, 2020.(Interspeech Press, 2020), pp. 3830–3834.
    doi 10.21437/Interspeech.2020-2650
  11. L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized End-to-End Loss for Speaker Verification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, April 15–20, 2018. (IEEE Press, 2018), pp. 4879–4883.
    doi 10.1109/ICASSP.2018.8462665
  12. A. Sholokhov, N. Kuzmin, K. A. Lee, and E. S. Chng, “Probabilistic Back-Ends for Online Speaker Recognition and Clustering,” in 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, June 04–10, 2023. (IEEE Press, 2023), pp. 1–5.
    doi 10.1109/ICASSP49357.2023.10097032
  13. H. Bredin, R. Yin, J. M. Coria, et al., “Pyannote.Audio: Neural Building Blocks for Speaker Diarization,” ICASSP 2020 – 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, May 04–08, 2020, (IEEE Press, 2020), pp. 7124–7128.
    doi 10.1109/ICASSP40776.2020.9052974
  14. A. Zhang, Q. Wang, Z. Zhu, et al., “Fully Supervised Speaker Diarization,” in 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, May 12–17, 2019. (IEEE Press, 2019), pp. 6301–6305.
    doi 10.1109/ICASSP.2019.8683892
  15. J. M. Coria, H. Bredin, S. Ghannay, and S. Rosset, “Overlap-Aware Low-Latency Online Speaker Diarization Based on End-to-End Local Segmentation,” in 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Cartagena, Colombia, December 13–17, 2021. (IEEE Press, 2021), pp. 1139–1146.
    doi 10.1109/ASRU51503.2021.9688044
  16. H. W. Kuhn, “The Hungarian method for the assignment problem,” Naval Research Logistics Quarterly 2 (1–2), 83–97 (1955).
    doi 10.1002/nav.3800020109
  17. N. Ryant, P. Singh, V. Krishnamohan, et al., “The Third DIHARD Diarization Challenge,” in 22nd Annual Conference of the International Speech Communication Association (INTERSPEECH 2021), Brno, Czechia, August 30–September 3, 2021.(Interspeech Press, 2021), pp. 3570–3574.
    doi 10.21437/Interspeech.2021-1208
  18. Automatic multimedia annotation service TagMe.
    https://developers.sber.ru/portal/products/tagme Cited April 5, 2026.
  19. N. Brümmer, A. Swart, L. Mošner, et al., “Probabilistic Spherical Discriminant Analysis: An Alternative to PLDA for length-normalized embeddings,” arXiv preprint arXiv: 2203.14893, 2022.
    doi 10.48550/arXiv.2203.14893
  20. A. Nagrani, J. S. Chung, and A. Zisserman, “VoxCeleb: A Large-Scale Speaker Identification Dataset,” in Proceedings Interspeech, Stockholm, Sweden, August 20–24, 2017.(Interspeech Press, 2017), pp. 2616–2620.
    doi 10.21437/Interspeech.2017-950
  21. H. Wang, C. Liang, S. Wang, et al., “Wespeaker: A research and production oriented speaker embedding learning toolkit,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, June 4–10, 2023. (IEEE Press, 2023), pp. 1–5.
    doi 10.1109/ICASSP49357.2023.10096626