Indexed by:
Abstract:
Speech Emotion Recognition (SER) has always been an important topic in the field of human-computer interaction. Most of the existing methods use hand-crafted features, which may ignore emotion-related information contained in raw speech signals. In recent years, speech Self-supervised Learning (SSL) models such as Wav2vec 2.0 (W2V2) have emerged and been employed to extract general speech representations for the downstream SER tasks. However, the large number of parameters introduced by SSL models is unnecessary. In this paper, a SER model is proposed on the basis of the shallow structure of W2V2 and the attention mechanism. The W2V2-based module is constructed using the first seven Conv1d blocks of W2V2 to extract local feature representations from raw speech signals. The attention-based module is used to globally capture the contextual emotional information from the local feature representations. Within this module, three multi-head self-attention blocks are cascaded for multi-level feature fusion. Experimental results show that the proposed model achieves better performance than the baselines on the IEMOCAP and EMODB datasets. ©2024 IEEE.
Keyword:
Reprint Author's Address:
Email:
Source :
Year: 2024
Page: 398-402
Language: English
Cited Count:
SCOPUS Cited Count:
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 10
Affiliated Colleges: