Comparison of speaker dependent and speaker independent emotion recognition
Jan Rybka ; Artur Janicki
International Journal of Applied Mathematics and Computer Science, Tome 23 (2013), p. 797-808 / Harvested from The Polish Digital Mathematics Library

This paper describes a study of emotion recognition based on speech analysis. The introduction to the theory contains a review of emotion inventories used in various studies of emotion recognition as well as the speech corpora applied, methods of speech parametrization, and the most commonly employed classification algorithms. In the current study the EMO-DB speech corpus and three selected classifiers, the k-Nearest Neighbor (k-NN), the Artificial Neural Network (ANN) and Support Vector Machines (SVMs), were used in experiments. SVMs turned out to provide the best classification accuracy of 75.44% in the speaker dependent mode, that is, when speech samples from the same speaker were included in the training corpus. Various speaker dependent and speaker independent configurations were analyzed and compared. Emotion recognition in speaker dependent conditions usually yielded higher accuracy results than a similar but speaker independent configuration. The improvement was especially well observed if the base recognition ratio of a given speaker was low. Happiness and anger, as well as boredom and neutrality, proved to be the pairs of emotions most often confused.

Publié le : 2013-01-01
EUDML-ID : urn:eudml:doc:262324
@article{bwmeta1.element.bwnjournal-article-amcv23z4p797bwm,
     author = {Jan Rybka and Artur Janicki},
     title = {Comparison of speaker dependent and speaker independent emotion recognition},
     journal = {International Journal of Applied Mathematics and Computer Science},
     volume = {23},
     year = {2013},
     pages = {797-808},
     language = {en},
     url = {http://dml.mathdoc.fr/item/bwmeta1.element.bwnjournal-article-amcv23z4p797bwm}
}
Jan Rybka; Artur Janicki. Comparison of speaker dependent and speaker independent emotion recognition. International Journal of Applied Mathematics and Computer Science, Tome 23 (2013) pp. 797-808. http://gdmltest.u-ga.fr/item/bwmeta1.element.bwnjournal-article-amcv23z4p797bwm/

[000] Ayadi, M.E., Kamel, M.S. and Karray, F. (2007). Speech emotion recognition using Gaussian mixture vector autoregressive models, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007), Honolulu, HI, USA, Vol. 4, pp. IV-957-IV-960.

[001] Ayadi, M.E., Kamel, M.S. and Karray, F. (2011). Survey on speech emotion recognition: Features, classification schemes, and databases, Pattern Recognition 44(3): 572-587. | Zbl 1207.68275

[002] Batliner, A., Steidl, S., Hacker, C., Noth, E. and Niemann, H. (2005). Tales of tuning-prototyping for automatic classification of emotional user states, Interspeech 2005, Lisbon, Portugal, pp. 489-492.

[003] Brooks, M. (2012). Voicebox: Speech processing toolbox for Matlab, http://www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html.

[004] Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W. and Weiss, B. (2005). A database of German emotional speech, Interspeech 2005, Lisbon, Portugal, pp. 1517-1520.

[005] Camacho, A. and Harris, J.G. (2008). A sawtooth waveform inspired pitch estimator for speech and music, Journal of the Acoustical Society of America 124: 1638-1652.

[006] Cichosz, J. and Slot, K. (2007). Emotion recognition in speech signal using emotion-extracting binary decision trees, ACII 2007, Lisbon, Portugal.

[007] Clavel, C., Devillers, L., Richard, G., Vasilexcu, I. and Ehrette, T. (2007). Detection and analysis of abnormal situations through fear-type acoustic manifestations, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007), Honolulu, HI, USA, Vol. 4, pp. IV-21-IV-24.

[008] Devillers, L. and Vidrascu, L. (2006). Real-life emotions detection with lexical and paralinguistic cues on human-human call center dialogs, Interspeech 2006, Pittsburgh, PA, USA, pp. 801-804.

[009] Ekman, P. (1972). Universals and cultural differences in facial expressions of emotions, in J. Cole (Ed.), Nebraska Symposium on Motivation, Vol. 19, University of Nebraska Press, Lincoln, NE, pp. 207-282.

[010] Engberg, I.S., Hansen, A.V., Andersen, O. and Dalsgaard, P. (1997). Design, recording and verification of a Danish emotional speech database, Eurospeech 1997, Rhodes, Greece.

[011] Erden, M. and Arslan, L.M. (2011). Automatic detection of anger in human-human call center dialogs, Interspeech 2011, Florence, Italy, pp. 81-84.

[012] Gajsek, R., Mihelic, F. and Dobrisek, S. (2013). Speaker state recognition using an HMM-based feature extraction method, Computer Speech and Language 27(1): 135-150.

[013] Gorska, Z. and Janicki, A. (2012). Recognition of extraversion level based on handwriting and support vector machines, Perceptual and Motor Skills 114(3)(0031-5125): 857-869.

[014] Grimm, M., Kroschel, K. and Narayanan, S. (2007). Support vector regression for automatic recognition of spontaneous emotions in speech, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007), Honolulu, HI, USA, Vol. 4, pp. IV-1085-IV-1088, ID: 1.

[015] Hassan, A. and Damper, R.I. (2010). Multi-class and hierarchical SVMs for emotion recognition, Interspeech 2010, Makuhari, Japan, pp. 2354-2357.

[016] He, L., Lech, M., Memon, S. and Allen, N. (2008). Recognition of stress in speech using wavelet analysis and teager energy operator, Interspeech 2008, Brisbane, Australia, pp. 605-608.

[017] Hirschberg, J., Benus, S., Brenier, J.M., Enos, F., Friedman, S., Gilman, S., Gir, C., Graciarena, M., Kathol, A. and Michaelis, L. (2005). Distinguishing deceptive from non-deceptive speech, Interspeech 2005, Lisbon, Portugal, pp. 1833-1836.

[018] Iliou, T. and Anagnostopoulos, C.-N. (2010). Classification on speech emotion recognition-a comparative study, International Journal on Advances in Life Sciences 2(1-2): 18-28.

[019] Janicki, A. (2012). On the Impact of Non-speech Sounds on Speaker Recognition, Text, Speech and Dialogue, Vol. 7499, Springer, Berlin/Heidelberg, pp. 566-572.

[020] Janicki, A. and Turkot, M. (2008). Speaker emotion recognition with the use of support vector machines, Telecommunication Review and Telecommunication News (8-9): 994-1005, (in Polish).

[021] Jeleń, Ł., Fevens, T. and Krzyżak, A. (2008). Classification of breast cancer malignancy using cytological images of fine needle aspiration biopsies, International Journal of Applied Mathematics and Computer Science 18(1): 75-83, DOI: 10.2478/v10006-008-0007-x.

[022] Kaminska, D. and Pelikant, A. (2012). Recognition of human emotion from a speech signal based on Plutchik's model, International Journal of Electronics and Telecommunications 58(2): 165-170.

[023] Kang, B.S., Han, C.H., Lee, S.T., Youn, D.H. and Lee, C. (2000). Speaker dependent emotion recognition using speech signals ICSLP 2000, Beijing, China.

[024] Kowalczuk, Z. and Czubenko, M. (2011). Intelligent decision-making system for autonomous robots, International Journal of Applied Mathematics and Computer Science 21(4): 671-684, DOI: 10.2478/v10006-011-0053-7. | Zbl 1283.93203

[025] Liberman, M., Davis, K., Grossman, M., Martey, N. and Bell, J. (2002). Emotional Prosody Speech and Transcripts, Linguistic Data Consortium, Philadelphia, PA.

[026] Liscombe, J., Hirschberg, J. and Venditti, J.J. (2005). Detecting certainess in spoken tutorial dialogues, Interspeech 2005, Lisbon, Portugal.

[027] Liu, G., Lei, Y. and Hansen, J.H.L. (2010). A novel feature extraction strategy for multi-stream robust emotion identification, Interspeech 2010, Makuhari, Japan, pp. 482-485.

[028] Lugger, M. and Yang, B. (2007). The relevance of voice quality features in speaker independent emotion recognition, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007), Honolulu, HI, USA, Vol. 4, pp. IV-17-IV-20.

[029] Lugger, M., Yang, B. and Wokurek, W. (2006). Robust estimation of voice quality parameters under realworld disturbances, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2006), Toulouse, France, Vol. 1, p. I.

[030] Mehrabian, A. and Wiener, M. (1967). Decoding of inconsistent communications, Journal of Personality and Social Psychology 6(1): 109-114.

[031] Neiberg, D., Laukka, P. and Ananthakrishnan, G. (2010). Classification of affective speech using normalized time-frequency cepstra, 5th International Conference on Speech Prosody (Speech Prosody 2010), Chicago, IL, USA, pp. 1-4.

[032] Patan, K. and Korbicz, J. (2012). Nonlinear model predictive control of a boiler unit: A fault tolerant control study, International Journal of Applied Mathematics and Computer Science 22(1): 225-237, DOI: 10.2478/v10006-012-0017-6. | Zbl 1273.93071

[033] Scherer, K.R. (2003). Vocal communication of emotion: A review of research paradigms, Speech Communication 40(1-2): 227-256. | Zbl 1006.68948

[034] Schuller, B., Koehler, N., Moeller, R. and Rigoll, G. (2006). Recognition of interest in human conversational speech, Interspeech 2006, Pittsburgh, PA, USA, pp. 793-796.

[035] Schuller, B., Vlasenko, B., Eyben, F., Rigoll, G. and Wendemuth, A. (2009). Acoustic emotion recognition: A benchmark comparison of performances, IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU 2009), Merano, Italy, pp. 552-557.

[036] Seppi, D., Batliner, A., Schuller, B., Steidl, S., Vogt, T., Wagner, J., Devillers, L., Vidrascu, L., Amir, N. and Aharonson, V. (2008). Patterns, prototypes, performance: Classifying emotional user states, Interspeech 2008, Brisbane, Australia, pp. 601-604.

[037] Vapnik, V.N. (1982). Estimation of Dependences Based on Empirical Data, Springer-Verlag, New York, NY, (translation of Vosstanovlenie zavisimostei po empiricheskim dannym by Samuel Kotz). | Zbl 0499.62005

[038] Xiao, Z., Dellandrea, E., Dou, W. and Chen, L. (2006). Two-stage classification of emotional speech, International Conference on Digital Telecommunications (ICDT'06), Cap Esterel, Côte d'Azur, France, pp. 32-32.

[039] Yacoub, S., Simske, S., Lin, X. and Burns, J. (2003). Recognition of emotions in interactive voice response systems, Eurospeech 2003, Geneva, Switzerland, pp. 1-4.

[040] Yu, C., Aoki, P. M. and Woodruff, A. (2004). Detecting user engagement in everyday conversations, 8th International Conference on Spoken Language Processing (ICSLP 2004), Jeju, Korea, pp. 1-6.