Indexed by:
Abstract:
Most of the existing audio-video fusion mechanisms are fused by directly splicing or summing audio-visual cues, which cannot make full use of both kinds of information to work together. In order to solve this problem, a multimodal target speech extraction algorithm based on long short term attention mechanism is proposed in this paper. In this algorithm, firstly, the features of audio and lip are extracted by convolutional neural network and chunked with an overlap factor of 50% on the time axis. Secondly, the short term correlation and the long term correlation between the sequences are calculated by the long short term attention mechanism (LSTA). Finally, the obtained target speech mask sequence is multiplied with the target speech sequence and passed through the decoder to obtain the estimated target speaker speech. Experimental results show that compared with contrast algorithms, the proposed method performs better in the scale-invariant signal-to-noise ratio improvement (SI-SNRi) and perceptual evaluation of speech quality (PESQ), and achieves consistent improvement in cross-dataset evaluation. © 2023 IEEE.
Keyword:
Reprint Author's Address:
Email:
Source :
Year: 2023
Language: English
Cited Count:
WoS CC Cited Count: 0
SCOPUS Cited Count:
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 7
Affiliated Colleges: