Preview

Proceedings of the Southwest State University

Advanced search

Applying Multitask Deep Learning to Emotion Recognition in Speech

https://doi.org/10.21869/2223-1560-2021-25-1-82-109

Abstract

Purpose of research. Emotions play one of the key roles in the regulation of human behaviour. Solving the problem of automatic recognition of emotions makes it possible to increase the effectiveness of operation of a whole range of digital systems such as security systems, human-machine interfaces, e-commerce systems, etc. At the same time, the low efficiency of modern approaches to recognizing emotions in speech can be noted. This work studies automatic recognition of emotions in speech applying machine learning methods.

Methods. The article describes and tests an approach to automatic recognition of emotions in speech based on multitask learning of deep convolution neural networks of AlexNet and VGG architectures using automatic selection of the weight coefficients for each task when calculating the final loss value during learning. All the models were trained on a sample of the IEMOCAP dataset with four emotional categories of ‘anger’, ‘happiness’, ‘neutral emotion’, ‘sadness’. The log-mel spectrograms of statements processed by a specialized algorithm are used as input data.

Results. The considered models were tested on the basis of numerical metrics: the share of correctly recognized instances, accuracy, completeness, f-measure. For all of the above metrics, an improvement in the quality of emotion recognition by the proposed model was obtained in comparison with the two basic single-task models as well as with known solutions. This result is achieved through the use of automatic weighting of the values of the loss functions from individual tasks when forming the final value of the error in the learning process.

Conclusion. The resulting improvement in the quality of emotion recognition in comparison with the known solutions confirms the feasibility of applying multitask learning to increase the accuracy of emotion recognition models. The developed approach makes it possible to achieve a uniform and simultaneous reduction of errors of individual tasks, and is used in the field of emotions recognition in speech for the first time.

About the Authors

A. V. Ryabinov
St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS); St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences
Russian Federation

Artem V. Ryabinov, Software Engineer of Laboratory of Autonomous Robotic Systems, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences

39, 14th Line, St. Petersburg 199178


Competing Interests:

The authors declare the absence of obvious and potential conflicts of interest related to the publication of this article.



M. Yu. Uzdiaev
St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS); St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences
Russian Federation

Mikhail Yu. Uzdiaev, Junior Researcher of Laboratory of Big Data In Socio-Cyberphysical Systems, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences

39, 14th Line, St. Petersburg 199178


Competing Interests:

The authors declare the absence of obvious and potential conflicts of interest related to the publication of this article.



I. V. Vatamaniuk
St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS); St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences
Russian Federation

Irina V. Vatamaniuk, Junior Researcher of Laboratory of Autonomous Robotic Systems, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS), St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences

39, 14th Line, St. Petersburg 199178


Competing Interests:

The authors declare the absence of obvious and potential conflicts of interest related to the publication of this article.



References

1. Tokuno S., Tsumatori, G., Shono S., Takei E., Yamamoto T., Suzuki G., Mituyoshi S., Shimura M. Usage of emotion recognition in military health care. Defense Science Research Conference and Expo (DSR). IEEE, 2011:1-5. https://doi.org/10.1109/DSR.2011.6026823

2. Saste S.T., Jagdale S.M. Emotion recognition from speech using MFCC and DWT for security system. 2017 international conference of electronics, communication and aerospace technology (ICECA). IEEE, 2017; 1:701-704. https://doi.org/10.1109/ICECA.2017.8203631

3. Rázuri J.G., Sundgren D., Rahmani R., Moran A., Bonet I., Larsson A. Speech emotion recognition in emotional feedbackfor human-robot interaction. International Journal of Advanced Research in Artificial Intelligence (IJARAI), 2015, 4(2), pp. 20¬27. https://doi.org/10.14569/IJARAI.2015.040204

4. Bojanić M., Delić V., Karpov A. Call redistribution for a call center based on speech emotion recognition. Applied Sciences, 2020, no. 10(13), pp. 46-53. https://doi.org/10.3390/app10134653

5. Björn W., Schuller L. Speech emotion recognition: two decades in a nutshell, benchmarks, and ongoing trends. Communications of the Acm, 2018, no. 61(5), pp.90¬99. https://doi.org/10.1145/3129340

6. Vilyunas V.K. [Emotions]. Bol'shoj psihologicheskij slovar' [Big psychological dictionary] /pod obshch. red. B.G. Meshcheryakova, V.P. Zinchenko (In Russ.). Available at: https://psychological.slovaronline.com/2078-EMOTSII

7. Il'in E.P., Emocii i chuvstva [Emotions and feelings]. Saint-Petersburg, Piter Publ., 2011 (In Russ.)

8. Sailunaz K., Dhaliwal M., Rokne J., Alhajj R. Emotion detection from text and speech: a survey. Social Network Analysis and Mining, 2018, no. 8(1), p. 28. https://doi.org/10.1007/s13278-018-0505-2

9. Ekman P. Facial expression and emotion. American psychologist, 1993. 48(4), 384 p. https://doi.org/10.1037/0003-066X.48.4.384

10. Russell J.A. Affective space is bipolar. Journal of personality and social psychology, 1979, no. 37 (3), 345 p. https://doi.org/10.1037/0022-3514.37.3.345

11. Russell J.A. Culture and the categorization of emotions. Psychological bulletin, 1991, no. 110 (3), 426 p. https://doi.org/10.1037/0033-2909.110.3.426

12. Trigeorgis G., Ringeval F., Brueckner R., Marchi E., Nicolaou M.A., Schuller B., Zafeiriou S. Adieu features? End-to-end speech emotion recognition using a deep convolutional recurrent network. 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2016:5200-5204. https://doi.org/10.1109/ICASSP.2016.7472669

13. Vryzas N., Vrysis L., Matsiola M., Kotsakis R., Dimoulas C., Kalliris G. Continuous Speech Emotion Recognition with Convolutional Neural Networks. Journal of the Audio Engineering Society, 2020, no. 68(1/2), pp. 14-24. https://doi.org/10.17743/jaes.2019.0043

14. Chen M., He X., Yang J., Zhang H. 3¬D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Processing Letters, 2018, no. 25(10), pp.1440-1444. https://doi.org/10.1109/LSP.2018.2860246

15. Satt A., Rozenberg S., Hoory R. Efficient Emotion Recognition from Speech Using Deep Learning on Spectrograms. Interspeech, 2017, pp. 1089-1093. https://doi.org/10.21437/Interspeech.2017-200

16. Zhang Z., Wu B., Schuller B. Attention-augmented end-to-end multi-task learning for emotion prediction from speech. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6705-6709. https://doi.org/10.1109/ICASSP.2019.8682896

17. Baveye Y., Chamaret C., Dellandréa E., Chen L. Affective video content analysis: A multidisciplinary insight. IEEE Transactions on Affective Computing, 2017, no. 9(4), pp. 396-409. https://doi.org/1-1.10.1109/TAFFC.2020.2983669

18. Caruana R. Multitask learning. Machine learning, 1997, no. 28(1), pp. 41-75. https://doi.org/10.1023/A:1007379606734

19. Busso C., Bulut M., Lee C.C., Kazemzadeh A., Mower E., Kim S., Chang J., Lee S., Narayanan S.S. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation, 2008, no. 42(4), 335 p. https://doi.org/10.1007/s10579-008-9076-6

20. Eyben F., Scherer K.R., Schuller B.W., Sundberg J., André E., Busso C., Devillers L., Epps J., Laukka P., Narayanan S., Truong K. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE transactions on affective computing, 2015, no. 7(2), pp. 190-202. https://doi.org/10.1109/TAFFC.2015.2457417

21. Schuller B., Steidl S., Batliner A., Vinciarelli A., Scherer K., Ringeval F., Chetouani M., Weninger F., Eyben F., Marchi E., Mortillaro M., Salamin H., Polychroniou A., Valente F., Kim S. The INTERSPEECH 2013 computational paralinguistics challenge: Social signals, conflict, emotion, autism. Proceedings INTERSPEECH 2013, 14th Annual Conference of the International Speech Communication Association, Lyon, France, 2013. Available at: https://mediatum.ub.tum.de/doc/1189705/file.pdf

22. Akçay M.B., Oğuz K. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supporting modalities, and classifiers. Speech Communication. 2020, no. 116, pp. 56-76. Available at: https://doi.org/10.1016/j.specom.2019.12.001

23. Schuller B., Batliner A., Seppi D., Steidl S., Vogt T., Wagner J., Devillers L., Vidrascu L., Amir N., Kessous L. Aharonson V. The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals. Eighth Annual Conference of the International Speech Communication Association, 2007, pp. 2253-2256. Available at: https://www.isca-speech.org/archive/interspeech_2007/i07_2253.html

24. Ringeval F., Sonderegger A., Sauer J., Lalanne D. Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG). IEEE, 2013, pp. 1-8. https://doi.org/10.1109/FG.2013.6553805

25. Khamparia A., Gupta D., Nguyen N.G., Khanna A., Pandey B., Tiwari P. Sound classification using convolutional neural network and tensor deep stacking network. IEEE Access, 2019; 7:7717-7727. https://doi.org/10.1109/ACCESS.2018.2888882

26. Srinivas N.S.S., Sugan N., Kumar L.S., Nath M.K., Kanhe A. Speaker-independent Japanese isolated speech word recognition using TDRC features. 2018 International CET Conference on Control, Communication, and Computing (IC4). IEEE, 2018, pp. 278¬283. https://doi.org/10.1109/CETIC4.2018.8530947

27. Li P., Li Y., Luo D., Luo H. Speaker identification using FrFT¬based spectrogram and RBF neural network. 2015 34th Chinese Control Conference (CCC). IEEE, 2015, pp. 3674¬3679. https://doi.org/10.1109/ChiCC.2015.7260207

28. Vryzas N., Kotsakis R., Liatsou A., Dimoulas C.A., Kalliris G. Speech emotion recognition for performance interaction. Journal of the Audio Engineering Society, 2018, 66(6), pp.457-467. https://doi.org/10.17743/jaes.2018.0036

29. Chorowski J.K., Bahdanau D., Serdyuk D., Cho K., Bengio Y. Attention-based models for speech recognition. Advances in neural information processing systems, 2015, 28, pp. 577-585. Available at: https://papers.nips.cc/paper/2015/hash/1068c6e4c8051cfd4e9ea8072e3189e2-Abstract.html

30. Burkhardt F., Paeschke A., Rolfes M., Sendlmeier W.F., Weiss B. A database of German emotional speech. Ninth European Conference on Speech Communication and Technology, 2005. Available at: https://www.isca-speech.org/archive/archive_papers/interspeech_2005/i05_1517.pdf

31. Srivastava N., Hinton G., Krizhevsky A., Sutskever I., Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research. 2014, no. 15(1), pp.1929¬1958. Available at: https://dl.acm.org/doi/abs/10.5555/2627435.2670313

32. Bilen H., Vedaldi A. Universal representations: The missing link between faces, text, planktons, and cat breeds. arXiv preprint arXiv:1701.07275. 2017.

33. Das A., Hasegawa-Johnson M., Veselý K. Deep Auto-Encoder Based Multi-Task Learning Using Probabilistic Transcriptions. INTERSPEECH, 2017, pp. 2073-2077. https://doi.org/10.21437/Interspeech.2017-582

34. Sanh V., Wolf T., Ruder S. A hierarchical multi-task approach for learning embeddings from semantic tasks. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, no. 33. pp. 6949-6956. https://doi.org/10.1609/aaai.v33i01.33016949

35. Teh Y., Bapst V., Czarnecki W.M., Quan J., Kirkpatrick J., Hadsell R., Heess N., Pascanu R. Distral: Robust multitask reinforcement learning. Advances in Neural Information Processing Systems, 2017, no. 30, pp.4496-4506. Available at: https://proceedings.neurips.cc/paper/2017/hash/0abdc563a06105aee3c6136871c9f4d1-Abstract.html

36. Ranjan R., Patel V.M., Chellappa R. Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, no. 41(1), pp. 121-135. https://doi.org/10.1109/TPAMI.2017.2781233

37. Parthasarathy S., Busso C. Jointly Predicting Arousal, Valence and Dominance with Multi-Task Learning. Interspeech. 2017:1103-1107. Available at: https://www.iscaspeech.org/archive/Interspeech_2017/pdfs/1494.PDF

38. Gideon J., Khorram S., Aldeneh Z., Dimitriadis D., Provost E.M. Progressive neural networks for transfer learning in emotion recognition. arXiv preprint arXiv:1706.03256. 2017.

39. Busso C., Parthasarathy S., Burmania A., AbdelWahab M., Sadoughi N., Provost E.M. MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception. IEEE Transactions on Affective Computing,. 2016, no. 8(1), pp.67-80. https://doi.org/10.1109/TAFFC.2016.2515617

40. Kendall A., Gal Y., Cipolla R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp.7482-7491. https://doi.org/10.1109/CVPR.2018.00781

41. Liebel L., Körner M. Auxiliary tasks in multi-task learning. arXiv preprint arXiv:1805.06334. 2018.

42. Gong T., Lee, T., Stephenson C., Renduchintala V., Padhy S., Ndirango A., Keskin G., Elibol O.H. A comparison of loss weighting strategies for multi task learning in deep neural networks. IEEE Access. 2019; 7:141627-141632. https://doi.org/10.1109/ACCESS.2019.294360

43. Liu S., Johns E., Davison A. J. End-to-end multi-task learning with attention. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 1871-1880. https://doi.org/10.1109/CVPR.2019.00197

44. Chen Z., Badrinarayanan V., Lee C.Y., Rabinovich A. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. International Conference on Machine Learning. PMLR, 2018. pp.794-803. http://proceedings.mlr.press/v80/chen18a.html

45. Krizhevsky A., Sutskever I., Hinton G. E. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 2017, no. 60(6), pp.84¬90. https://dl.acm.org/doi/abs/10.1145/3065386

46. Simonyan K., Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. 2014.

47. He K. et al. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770-778. Available at: https://openaccess.thecvf.com/content_cvpr_2016/html/He_Deep_Residual_Learning_CVPR_2016_paper.html

48. Kingma D.P., Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. 2014.

49. Livingstone S.R., Russo F.A. The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PloS one, 2018, no. 13(5):e0196391. https://doi.org/10.1371/journal.pone.0196391

50. Mariooryad S., Lotfian R., Busso C. Building a naturalistic emotional speech corpus by retrieving expressive behaviors from existing speech corpora. Fifteenth Annual Conference of the International Speech Communication Association. 2014. Available at: https://www.isca-speech.org/archive/interspeech_2014/i14_0238.html

51. Maaten L., Hinton G. Visualizing data using t-SNE. Journal of machine learning research, 2008, 9(Nov), pp. 2579-2605. Available at: https://www.jmlr.org/papers/v9/vandermaaten08a.html

52. Selvaraju R.R., Cogswell M., Das A., Vedantam R., Parikh D., Batra D. Gradcam: Visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE international conference on computer vision. 2017:618-626. Available at: https://openaccess.thecvf.com/content_iccv_2017/html/Selvaraju_Grad-CAM_Visual_Explanations_ICCV_2017_paper.html


Review

For citations:


Ryabinov A.V., Uzdiaev M.Yu., Vatamaniuk I.V. Applying Multitask Deep Learning to Emotion Recognition in Speech. Proceedings of the Southwest State University. 2021;25(1):82-109. (In Russ.) https://doi.org/10.21869/2223-1560-2021-25-1-82-109

Views: 713


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2223-1560 (Print)
ISSN 2686-6757 (Online)