Indexed by:
Abstract:
To address the challenges of the poor representation capability and low data utilization rate of end-to-end speech recognition models in deep learning, this study proposes an end-to-end speech recognition model based on multi-scale feature fusion and multi-view self-supervised learning (MM-ASR). It adopts a multi-task learning paradigm for training. The proposed method emphasizes the importance of inter-layer information within shared encoders, aiming to enhance the model’s characterization capability via the multi-scale feature fusion module. Moreover, we apply multi-view self-supervised learning to effectively exploit data information. Our approach is rigorously evaluated on the Aishell-1 dataset and further validated its effectiveness on the English corpus WSJ. The experimental results demonstrate a noteworthy 4.6% reduction in character error rate, indicating significantly improved speech recognition performance. These findings showcase the effectiveness and potential of our proposed MM-ASR model for end-to-end speech recognition tasks. © The Author(s) 2024.
Keyword:
Reprint Author's Address:
Email:
Source :
Neural Processing Letters
ISSN: 1370-4621
Year: 2024
Issue: 4
Volume: 56
3 . 1 0 0
JCR@2022
Cited Count:
SCOPUS Cited Count: 2
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 7
Affiliated Colleges: