Speech Enhancement Using Joint Time and DCT Processing for Real Time Applications

PDF (530KB), PP.14-24

Views: 0 Downloads: 0

Author(s)

Ravi Kumar Kandagatla 1,* V. Jayachandra Naidu 2 P. S. Sreenivasa Reddy 3 Sivaprasad Nandyala 4

1. Department of ECE, Lakireddy Bali Reddy College of Engineering, Mylavaram, Andhra Pradesh, India

2. Department of ECE, Sri venkateswara College of Engineering & Technology, Chittoor, India

3. Department of ECE, Nalla Narasimha Reddy Education Society’s Group of Institutions, Telangana, India

4. Technology Innovation Institute (TII), Abu Dhabi, U.A.E.

* Corresponding author.

DOI: https://doi.org/10.5815/ijigsp.2024.05.02

Received: 20 Jan. 2024 / Revised: 14 Feb. 2024 / Accepted: 10 Mar. 2024 / Published: 8 Oct. 2024

Index Terms

Speech Enhancement, Perceptual Evaluation of Speech Quality, Noise Reduction, Discrete transform, Recurrent Neural Network, Gated Unit, Signal to Noise Ratio

Abstract

Deep learning based speech enhancement approaches provides better perceptual quality and better intelligibility. But most of the speech enhancement methods available in literature estimates enhanced speech using processed amplitude, energy, MFCC spectrum, etc along with noisy phase. Because of difficult in estimating clean speech phase from noisy speech the noisy phase is still using in reconstruction of enhanced speech. Some methods are developed for estimating clean speech phase and it is observed that it is complex for estimation. To avoid difficulty and for better performance rather than using Discrete Fourier Transform (DFT) the Discrete Cosine Transform (DCT) and Discrete Sine Transform (DST) based convolution neural networks are proposed for better intelligibility and improved performance. However, the algorithms work either features of time domain or features of frequency domain. To have advantage of both time domain and frequency domain here the fusion of  DCT and time domain approach is proposed.  In this work DCT Dense Convolutional Recurrent Network (DCTDCRN), DST Convolutional Gated Recurrent Neural Network (DSTCGRU), DST Convolution Long Short term Memory (DSTCLSTM) and DST Convolutional Gated Recurrent Neural Network (DSTDCRN) are proposed for speech enhancement. These methods are providing superior performance and less processing difficulty when compared to the state of art methods. The proposed DCT based methods are used further in developing joint time and magnitude based speech enhancement method. Simulation results show superior performance than baseline methods for joint time and frequency based processing. Also results are analyzed using objective performance measures like Signal to Noise Ratio (SNR), Perceptual Evaluation of Speech Quality (PESQ) and Short-Time Objective Intelligibility (STOI).

Cite This Paper

Ravi Kumar Kandagatla, V. Jayachandra Naidu, P. S. Sreenivasa Reddy, Sivaprasad Nandyala, "Speech Enhancement Using Joint Time and DCT Processing for Real Time Applications", International Journal of Image, Graphics and Signal Processing(IJIGSP), Vol.16, No.5, pp. 14-24, 2024. DOI:10.5815/ijigsp.2024.05.02

Reference

[1]Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An experimental study on speech enhancement based on deep neural networks,” IEEE Signal processing letters, vol. 21, no. 1, pp. 65–68, 2013
[2]S. Srinivasan, N. Roman, and D. Wang, “Binary and ratio time-frequency masks for robust speech recognition,” Speech Communication, vol. 48, no. 11, pp. 1486–1501, 2006
[3]K. Paliwal, K. Wojcicki, and B. Shannon, “The importance of phase in ´ speech enhancement,” speech communication, vol. 53, no. 4, pp. 465– 494, 2011
[4]Scalart P et al (1996) Speech enhancement based on a priori signal to noise estimation. IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings. 2:629–663
[5]Jansson A, Humphrey E, Montecchio N, Bittner R, Kumar A, Weyde T (2017) Singing voice separation with deep U-NET convolutional networks. Proceedings of the 18th ISMIR Conference, Suzhou, China, 23-27
[6]Tan K, Wang D (2018) A convolutional recurrent neural network for real-time speech enhancement.” in Interspeech. 3229–3233
[7]N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transform,” IEEE transactions on Computers, vol. 100, no. 1, pp. 90–93, 1974.
[8]Q. Liu, W. Wang, P. J. Jackson, and Y. Tang, “A perceptually-weighted deep neural network for monaural speech enhancement in various background noise conditions,” in 2017 25th European Signal Processing Conference (EUSIPCO). IEEE, 2017, pp. 1270–1274
[9]Jassim, W.A., Harte, N.: Comparison of discrete transforms for deep-neural-networks-based speech enhancement. IET Signal Process. 16( 4), 438– 448 (2022)
[10]O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241
[11]J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “Sdr–half-baked or well done?” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 626–630.
[12]K. Tan, D. Wang, A convolutional recurrent neural network for real-time speech enhancement, in Interspeech (2018), pp 3229–3233
[13]A. Pandey, D. Wang, Tcnn: temporal convolutional neural network for real-time speech enhancement in the time domain, in ICASSP 2019–2019 IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP) (IEEE, 2019), pp. 6875–6879
[14]Q. Zhang, A. Nicolson, M. Wang et al., Deepmmse: A deep learning approach to mmse-based noise power spectral density estimation. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 1404–1415 (2020)
[15]K. Tan, D. Wang, Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 380–390 (2019)
[16]X. Xiang, X. Zhang, H. Chen, A nested u-net with self-attention and dense connectivity for monaural speech enhancement. IEEE Signal Process.Lett. 29, 105–109 (2021)
[17]Xiang X, Zhang X, Chen H. A convolutional network with multi-scale and attention mechanisms for end-to-end single-channel speech enhancement.IEEE Signal Process.Lett. 2021;28:1455–9
[18]Xiaoxiao Xiang, XiaojuanZhang,Joint waveform and magnitude processing for monaural speech enhancement,AppliedAcoustics,Volume 200,2022,109077
[19]Ravi Kumar Kandagatla, P.V. Subbaiah,Speech enhancement using MMSE estimation of amplitude and complex speech spectral coefficients under phase-uncertainty,Speech Communication,96,10-27,(2018)
[20]Yecchuri, S., Vanambathina, S. Sub-convolutional U-Net with transformer attention network for end-to-end single-channel speech enhancement. J Audio Speech Music Proc. 2024, 8 (2024)
[21]Kandagatla, R., Subbaiah, P.V. Speech enhancement using MMSE estimation under phase uncertainty. Int J Speech Technol 20, 373–385 (2017)
[22]Yechuri, S., Vanambathina, S. A Nested U-Net with Efficient Channel Attention and D3Net for Speech Enhancement. Circuits Syst Signal Process 42, 4051–4071 (2023)
[23]Kandagatla, R.K., Potluri, V.S. Performance analysis of neural network, NMF and statistical approaches for speech enhancement. Int J Speech Technol 23, 917–937 (2020)
[24]Hu, Y. and Loizou, P. (2007). “Subjective evaluation and comparison of speech enhancement algorithms,” Speech Communication, 49, 588-601