Speaker Diarization Using Bi-LSTM and Spectral Clustering

PDF (568KB), PP.27-35

Views: 0 Downloads: 0

Author(s)

Trisiladevi C Nagavi 1,* Samanvitha Sateesha 1 Shreya Sudhanva 1 Sukirth Shivakumar 1 Vibha Hullur 1

1. S. J. College of Engineering, JSS Science and Technology University, Mysore, Karnataka, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijem.2024.03.03

Received: 11 Jan. 2023 / Revised: 20 Feb. 2023 / Accepted: 17 Mar. 2023 / Published: 8 Jun. 2024

Index Terms

Speaker Diarization, Bi-LSTM, MFCC, Spectral Clustering, Diarization Error Rate

Abstract

Speaker diarization is the ability to compare, recognize, comprehend and segregate different sound waves on the basis of the identity of the speaker. This work aims to accomplish this process by segmenting, embedding and clustering the extracted features from the speech sample. In this work, Mel-Frequency Cepstral Coefficients (MFCC) are extracted and fed into Bi-Directional Long Short-Term Memory (Bi-LSTM) model for segmentation. Then d- vectors are extracted using pre-trained models from pyannote libraries. Spectral Clustering is used to group and segregate the audio of one speaker from another. The experimentation is carried out on two speaker speech audio files and the results indicate that the diarization is successful. The diarization error rate of 9.4% for a 2-speaker audio file is the lowest DER achieved for the given data set. This indicates the efficiency of the system and also justifies the combination of methods chosen at each step. By considering such exciting technical trends, we believe the work presented in the paper represents a valuable contribution for the community by providing the recent developments using Bi-LSTM and spectral clustering methods, which enables the future development towards speaker diarization.

Cite This Paper

Trisiladevi C Nagavi, Samanvitha Sateesha, Shreya Sudhanva, Sukirth Shivakumar, Vibha Hullur, "Speaker Diarization Using Bi-LSTM and Spectral Clustering", International Journal of Engineering and Manufacturing (IJEM), Vol.14, No.3, pp. 27-35, 2024. DOI:10.5815/ijem.2024.03.03

Reference

[1]D. Garcia-Romero, D. Snyder, G. Sell, D. Povey, and A. McCree, “Speaker diarization using deep neural network embeddings,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 4930–4934.
[2]Q. Wang, C. Downey, L. Wan, P. A. Mansfield, and I. L. Moreno, “Speaker diarization with LSTM,” in 2018 IEEE International conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 5239–5243.
[3]W. Wang, D. Cai, Q. Lin, L. Yang, J. Wang, J. Wang, and M. Li, “The dku-dukeece-lenovo system for the diarization task of the 2021 voxceleb speaker recognition challenge,” arXiv preprint arXiv: 2109.02002, 2021.
[4]T. Nagavi, S. Anusha, P. Monisha, and S. Poornima, “Content based audio retrieval with MFCC feature extraction, clustering and sort-merge techniques,” 07 2013, pp. 1–6.
[5]F. Landini, O. Glembek, P. Matˇejka, J. Rohdin, L. Burget, M. Diez, and A. Silnova, “Analysis of the but diarization system for voxconverse challenge,” in ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5819–5823.
[6]H. H. Mao, S. Li, J. McAuley, and G. Cottrell, “Speech recognition and multi-speaker diarization of long conversations,” arXiv preprint arXiv: 2005.08072, 2020.
[7]L. Bullock, H. Bredin, and L. P. Garcia-Perera, “Overlap-aware diarization: Resegmentation using neural end-to-end overlapped speech detection,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7114–7118.
[8]J. S. Chung, J. Huh, A. Nagrani, T. Afouras, and A. Zisserman, “Spot the conversation: speaker diarization in the wild,” arXiv preprint arXiv: 2007.01216, 2020.
[9]U. Shrawankar and V. M. Thakare, “Techniques for feature extraction in speech recognition system: A comparative study,” arXiv preprint arXiv: 1305.1145, 2013.
[10]T. Ganchev, N. Fakotakis, and G. Kokkinakis, “Comparative evaluation of various mfcc implementations on the speaker verification task,” in Proceedings of the SPECOM, vol. 1, no. 2005, 2005, pp. 191–194.
[11]M. Schuster and K. Paliwal, “Bidirectional recurrent neural networks,” Signal Processing, IEEE Transactions on, vol. 45, pp. 2673 – 2681, 12 1997.
[12]S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[13]H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, and M.-P. Gill, “pyannote.audio: neural building blocks for speaker diarization,” in ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing, 2020.
[14]H. Bredin and A. Laurent, “End-to-end speaker segmentation for overlap-aware resegmentation,” in Proc. Interspeech 2021, 2021.
[15]U. Von Luxburg, “A tutorial on spectral clustering,” Statistics and computing, vol. 17, no. 4, pp. 395–416, 2007.
[16]D. Verma and M. Meila, “A comparison of spectral clustering algorithms,” University of Washington Tech Rep UWCSE030501, vol. 1, pp. 1–18, 2003.
[17]O. Galibert, “Methodologies for the evaluation of speaker diarization and automatic speech recognition in the presence of overlapping speech.” in INTERSPEECH, 2013, pp. 1131–1134.