Indexed by:
Abstract:
Studies in speech separation have achieved significant success in recent years. To correctly separate the mixture signals, it is critical to encode the signals into an appropriate latent space. Existing speech separation methods include transforming mixed signals into frequency domain space or time domain space. The frequency domain features (spectrogram) are generated by STFT, which is closely related to speech articulation and reflects the energy of speech directly. The time domain features are learned from a latent embedding space, and the separation effect is facilitated by the end-to-end structure. However, these methods are based on the representations from only one domain, which is insufficient for providing a speech separation encoding space that is completely separable. Therefore, a Triple-Path Recurrent Neural Network (TPRNN) that fuse features from two domains is proposed. It employs a spectrogram as auxiliary information to improve the performance of speech separation. Experimental results on the Wall Street Journal (WSJ0) dataset show that this approach is beneficial to improve speech separation performance. © 2024, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
Keyword:
Reprint Author's Address:
Email:
Source :
ISSN: 1876-1100
Year: 2024
Volume: 1112 LNEE
Page: 239-248
Language: English
Cited Count:
WoS CC Cited Count: 0
SCOPUS Cited Count:
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 12
Affiliated Colleges: