Abstract:
Speech emotion recognition is an important research field in the field of human-computer interaction. How to extract the most representative speech emotional features is one of the research hotspots. A novel framework, Multi-scale Spectral Feature Attention Fusion Network (MSFAFN), is proposed to improve the ability of emotion recognition by synthesizing multiple levels of audio features. This network is mainly composed of feature extraction blocks and feature learning blocks. The feature extraction block extracts feature maps through three parallel paths with different convolution kernel sizes, and then reassigns and fuses features by the attention mechanism, which means that the network can learn features of different scales and directions, and enhance the model's ability to represent emotion-related information. The feature learning block is composed of multi-layer convolutional neural networks, which can learn features on different time scales by means of sliding Windows. The two modules work together to better learn the frequency spectrum and time characteristics of speech. In order to further optimize the generalization and classification ability of the model, the double loss function joint supervised learning is applied in the training process, which improves the accuracy and stability of the classification in the complex affective data set. Experiments show that the model MSFAFN has achieved 95.66% and 95.79% accuracy on RAVDESS and Emo-DB sentiment data sets, respectively.