• 基于多尺度特征注意力融合的语音情感识别

    Speech emotion recognition based on multi-scale feature attention fusion

    • 语音情感识别是人机交互领域的一个重要研究领域。如何提取最具有代表性的语音情感特征是研究热点之一。针对目前语音情感识别系统中存在特征表达能力不足的问题,提出了一种新的框架——多尺度频谱特征注意力融合网络(Multi-scale Spectral Feature Attention Fusion Network, MSFAFN),旨在通过综合多层次的音频特征,提升模型的情感识别能力。该网络主要由特征提取块和特征学习块组成。特征提取块通过3条不同卷积核大小的并行路径提取特征映射,然后由注意力机制对特征进行权重的重新分配与特征融合,这意味着网路能够学习到不同尺度、方向的特征,增强模型对情感相关信息的表征能力。特征学习块由多层卷积神经网络构成,通过滑动窗口的方式可以学习到不同时间尺度上的特征。两个模块协同作用,可以更好的学习到语音中的频谱和时间特征。为了进一步优化模型的泛化性能与类别区分能力,在训练过程中,应用双损失函数联合监督学习,从而在复杂情感数据集中提高了分类的精确度与稳定性。实验表明:模型MSFAFN在RAVDESS和Emo-DB情感数据集上分别取得了95.66%和95.79%的准确率。

       

      Abstract: Speech emotion recognition is an important research field in the field of human-computer interaction. How to extract the most representative speech emotional features is one of the research hotspots. A novel framework, Multi-scale Spectral Feature Attention Fusion Network (MSFAFN), is proposed to improve the ability of emotion recognition by synthesizing multiple levels of audio features. This network is mainly composed of feature extraction blocks and feature learning blocks. The feature extraction block extracts feature maps through three parallel paths with different convolution kernel sizes, and then reassigns and fuses features by the attention mechanism, which means that the network can learn features of different scales and directions, and enhance the model's ability to represent emotion-related information. The feature learning block is composed of multi-layer convolutional neural networks, which can learn features on different time scales by means of sliding Windows. The two modules work together to better learn the frequency spectrum and time characteristics of speech. In order to further optimize the generalization and classification ability of the model, the double loss function joint supervised learning is applied in the training process, which improves the accuracy and stability of the classification in the complex affective data set. Experiments show that the model MSFAFN has achieved 95.66% and 95.79% accuracy on RAVDESS and Emo-DB sentiment data sets, respectively.

       

    /

    返回文章
    返回