Speech Emotion Recognition Based on Shallow Structure of Wav2vec 2.0 and Attention Mechanism - Details

Author：

Zhang, Yumei (Zhang, Yumei.) | Jia, Maoshen (Jia, Maoshen.) | Cao, Xuan (Cao, Xuan.) | Zhao, Zichen (Zhao, Zichen.)

Indexed by：

Abstract：

Speech　Emotion　Recognition　(SER)　has　always　been　an　important　topic　in　the　field　of　human-computer　interaction.　Most　of　the　existing　methods　use　hand-crafted　features,　which　may　ignore　emotion-related　information　contained　in　raw　speech　signals.　In　recent　years,　speech　Self-supervised　Learning　(SSL)　models　such　as　Wav2vec　2.0　(W2V2)　have　emerged　and　been　employed　to　extract　general　speech　representations　for　the　downstream　SER　tasks.　However,　the　large　number　of　parameters　introduced　by　SSL　models　is　unnecessary.　In　this　paper,　a　SER　model　is　proposed　on　the　basis　of　the　shallow　structure　of　W2V2　and　the　attention　mechanism.　The　W2V2-based　module　is　constructed　using　the　first　seven　Conv1d　blocks　of　W2V2　to　extract　local　feature　representations　from　raw　speech　signals.　The　attention-based　module　is　used　to　globally　capture　the　contextual　emotional　information　from　the　local　feature　representations.　Within　this　module,　three　multi-head　self-attention　blocks　are　cascaded　for　multi-level　feature　fusion.　Experimental　results　show　that　the　proposed　model　achieves　better　performance　than　the　baselines　on　the　IEMOCAP　and　EMODB　datasets.　©2024　IEEE.

Keyword：

Audio signal processing Self-supervised learning Emotion Recognition Speech analysis Semi-supervised learning

Author Community：

[ 1 ] [Zhang, Yumei]School of Information Science and Technology, Beijing University of Technology, Beijing, China
[ 2 ] [Jia, Maoshen]School of Information Science and Technology, Beijing University of Technology, Beijing, China
[ 3 ] [Cao, Xuan]School of Information Science and Technology, Beijing University of Technology, Beijing, China
[ 4 ] [Zhao, Zichen]College of Engineering, Yanbian University, Yanji, China

Reprint Author's Address：

Email：

Show more details

Related Keywords：

SSIM: self-supervised learning method based on spatially selected shifts and irregular image masking
2025，Journal of Supercomputing
Toward Industrial Densely Packed Object Detection: A Federated Semi-Supervised Learning Approach
2024，IEEE Internet of Things Journal
Target Speaker Extraction Method by Emphasizing the Active Speech with an Additional Enhancer
2024，2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2024
A Dual-path Conformer-based Network for Neural Speech Coding
2024，14th International Symposium on Chinese Spoken Language Processing, ISCSLP 2024

Source ：

Year： 2024

Page： 398-402

Language： English

Cited Count：

WoS CC Cited Count：

SCOPUS Cited Count：

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 10

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search Engineering Village

Type
Departments

All Years Choose Year From to