Estimating the Effects of Voice Quality and Speech Intelligibility of Audio Compression in Automatic Emotion Recognition

Full Text (PDF, 572KB), PP.69-80

Views: 0 Downloads: 0

Author(s)

A. Pramod Reddy 1 Dileep kumar Ravikanti 2,* Rakesh Betala 3 K. Venkatesh Sharma 4 K. Shirisha Reddy 1

1. TKR College of Engineering and Technology, Hyderabad, 500097, India

2. BVRIT, Hyderabad College of Engineering for Women, India

3. Engineering Department, University of Technology and Applied Sciences-AlMusannah, AlMusannah, Sultanate of OMAN, India

4. CVR College of Engineering TS, India

* Corresponding author.

DOI: https://doi.org/10.5815/ijigsp.2023.03.06

Received: 9 May 2022 / Revised: 11 Jun. 2022 / Accepted: 13 Aug. 2022 / Published: 8 Jun. 2023

Index Terms

Speech Compression, speech intelligibility, emotion recognition, CER

Abstract

This paper projects, the impact & accuracy of speech compression on AER systems. The effects of various codecs like MP3, Speex, and Adaptive multi-rate(NB & WB) are compared with the uncompressed speech signal. Loudness enlistment, or a steeper-than-normal increase in perceived loudness with presentation level, is associated with sensorineural hearing loss. Amplitude compression is frequently used to compensate for this abnormality, such as in a hearing aid. As an alternative, one may enlarge these by methods of expansion as speech intelligibility has been represented as the perception of rapid energy changes, may make communication more understandable. However, even if these signal-processing methods improve speech understanding, their design and implementation may be constrained by insufficient sound quality. Therefore, syllabic compression and temporal envelope expansion were assessed for in speech intelligibility and sound quality. An adaptive technique based on brief, commonplace words either in noise or with another speaker competing was used to assess the speech intelligibility. Speech intelligibility was tested in steady-state noise with a single competing speaker using everyday sentences. The sound quality of four artistic excerpts and quiet speech was evaluated using a rating scale. With a state-of-art, spectral error, compression error ratio, and human labeling effects, The experiments are carried out using the Telugu dataset and well-known EMO-DB. The results showed that all speech compression techniques resulted in reduce of emotion recognition accuracy. It is observed that human labeling has better recognition accuracy. For high compression, it is advised to use the overall mean of the unweighted average recall for the AMR-WB and SPEEX codecs with 6.6 bit rates to provide the optimum quality for data storage.

Cite This Paper

A. Pramod Reddy, Dileep kumar Ravikanti, Rakesh Betala, K. Venkatesh Sharma, K. Shirisha Reddy, "Estimating the Effects of Voice Quality and Speech Intelligibility of Audio Compression in Automatic Emotion Recognition", International Journal of Image, Graphics and Signal Processing(IJIGSP), Vol.15, No.3, pp. 69-80, 2023. DOI:10.5815/ijigsp.2023.03.06

Reference

[1]A. Pramod Reddy and V. Vijayarajan, “Recognition of human emotion with spectral features using multi layer-perceptron,” Int. J. Knowledge-Based Intell. Eng. Syst., vol. 24, no. 3, 2020, doi: 10.3233/KES-200044.
[2]A. P. Reddy and V. Vijayarajan, “Audio compression with multi-algorithm fusion and its impact in speech emotion recognition,” Int. J. Speech Technol., pp. 1–9, 2020.
[3]E. Villchur, “Signal processing to improve speech intelligibility in perceptive deafness,” J. Acoust. Soc. Am., vol. 53, no. 6, pp. 1646–1657, 1973.
[4]K. Bengtsson, “Talandet som levd erfarenhet.: En studie av fyra barn med Downs syndrom.,” Estetisk-filosofiska fakulteten, 2006.
[5]L. Laaksonen, H. Pulakka, V. Myllylä, and P. Alku, “Development, evaluation and implementation of an artificial bandwidth extension method of telephone speech in mobile terminal,” IEEE Trans. Consum. Electron., vol. 55, no. 2, pp. 780–787, 2009, doi: 10.1109/TCE.2009.5174454.
[6]R. P. Lippmann, L. D. Braida, and N. I. Durlach, “Study of multichannel amplitude compression and linear amplification for persons with sensorineural hearing loss,” J. Acoust. Soc. Am., vol. 69, no. 2, pp. 524–534, 1981.
[7]I. V NábÄ›lek, “Performance of hearing-impaired listeners under various types of amplitude compression,” J. Acoust. Soc. Am., vol. 74, no. 3, pp. 776–791, 1983.
[8]D. K. Bustamante and L. D. Braida, “Multiband compression limiting for hearing-impaired listeners,” J. Rehabil. Res. Dev., vol. 24, no. 4, pp. 149–160, 1987.
[9]H. Levitt, M. Bakke, J. Kates, A. Neuman, T. Schwander, and M. Weiss, “Signal processing for hearing impairment.,” Scand. Audiol. Suppl., vol. 38, pp. 7–19, 1993.
[10]G. Walker, D. Byrne, and H. Dillon, “Learning effects with a closed response set nonsense syllable test,” Aust. New Zeal. J. Audiol., vol. 4, no. 1, pp. 27–31, 1982.
[11]R. Plutchik, “A general psychoevolutionary theory of emotion,” in Theories of emotion, Elsevier, 1980, pp. 3–33.
[12]R. Plutchik, Emotion: A psychoevolutionary synthesis. Harpercollins College Division, 1980.
[13]J. Boyd, “Sony unleashes new Aibo robot dog,” IEEE Spectrum. IEEE, 2017.
[14]Y. Attabi and P. Dumouchel, “Anchor models for emotion recognition from speech,” IEEE Trans. Affect. Comput., vol. 4, no. 3, pp. 280–290, 2013, doi: 10.1109/T-AFFC.2013.17.
[15]M. F. Teng, “Emotional Development and Construction of Teacher Identity: Narrative Interactions about the Pre-Service Teachers’ Practicum Experiences.,” Aust. J. Teach. Educ., vol. 42, no. 11, pp. 117–134, 2017.
[16]R. Plutchik, “A psychoevolutionary theory of emotions.” Sage Publications, 1982.
[17]Y. Qian and A. Mita, “Acceleration-based damage indicators for building structures using neural network emulators,” Struct. Control Heal. Monit. Off. J. Int. Assoc. Struct. Control Monit. Eur. Assoc. Control Struct., vol. 15, no. 6, pp. 901–920, 2008.
[18]D. King, S. M. Ritchie, M. Sandhu, S. Henderson, and B. Boland, “Temporality of emotion: Antecedent and successive variants of frustration when learning chemistry,” Sci. Educ., vol. 101, no. 4, pp. 639–672, 2017.
[19]I. Varga, R. D. De Lacovo, and P. Usai, “Standardization of the AMR wideband speech codec in 3GPP and ITU-T,” IEEE Commun. Mag., vol. 44, no. 5, pp. 66–73, 2006.
[20]K. Pisanski et al., “Vocal indicators of body size in men and women: A meta-analysis,” Anim. Behav., vol. 95, pp. 89–99, 2014, doi: 10.1016/j.anbehav.2014.06.011.
[21]A. M. Kondoz, Digital speech: coding for low bit rate communication systems. John Wiley \& Sons, 2005.
[22]M. Agiwal, A. Roy, and N. Saxena, “Next generation 5G wireless networks: A comprehensive survey,” IEEE Commun. Surv. \& Tutorials, vol. 18, no. 3, pp. 1617–1655, 2016.
[23]A. Nishimura, “Data hiding in pitch delay data of the adaptive multi-rate narrow-band speech codec,” in 2009 fifth international conference on intelligent information hiding and multimedia signal processing, 2009, pp. 483–486.
[24]J. G. Beerends et al., “Perceptual objective listening quality assessment (POLQA), the third generation ITU-T standard for end-to-end speech quality measurement part I—Temporal alignment,” J. Audio Eng. Soc., vol. 61, no. 6, pp. 366–384, 2013.
[25]P. Coverdale, S. Moller, A. Raake, and A. Takahashi, “Multimedia quality assessment standards in ITU-T SG12,” IEEE Signal Process. Mag., vol. 28, no. 6, pp. 91–97, 2011.
[26]C. Spearman, “The proof and measurement of association between two things.,” 1961.
[27]C. Spearman, “The proof and measurement of association between two things,” Am. J. Psychol., vol. 100, no. 3/4, pp. 441–471, 1987.
[28]A. F. Lotz, I. Siegert, M. Maruschke, and A. Wendemuth, “Audio Compression And Its Impact On Emotion Recognition in Affective Computing,” Elektron. Sprachsignalverarbeitung 2017, pp. 1–8, 2017.