Indexed by:
Abstract:
As a practical way to address the cocktail party problem, the target speaker extraction (TSE) has received increasing attention recently. Typically, the TSE is explored in two scenarios. The first scenario is a specific one, where the target speaker is present and the signal received by the microphone contains at least two speakers. The second scenario is a universal one, where the presence or absence of the target speaker is uncertain and the received signal may contain one or multiple speakers. For these two scenarios, numerous TSE studies rely on the speaker embedding derived from an enrollment of the target speaker to guide the extraction. However, solely utilizing this embedding may not fully leverage the contextual information within the enrollment. To address this limitation, a novel approach that directly exploits the contextual information in the time-frequency (T-F) domain was recently proposed. This paper improves this approach by integrating our previously proposed coarse-to-fine framework, which consists of a coarse extraction and a refining extraction. For the specific scenario, an interaction block with an attention mechanism is employed to facilitate direct interaction between the T-F representations of the enrollment and the received signal. This direct interaction leads to the consistent T-F representation of the enrollment that serves as guidance for the coarse extraction. Afterwards, the T-F representation of the coarsely extracted signal, which may contain some distortions, is utilized to guide the refining extraction. The residual T-F representation obtained during the refining extraction mitigates distortions of coarse extraction and increases the precision of speaker extraction. Besides, this paper explores an undisturbed version of the universal scenario where the noise and reverberation are not considered. An additional processor is integrated and a two-level decision-making scheme is devised to generalize the coarse-tofine TSE method for this undisturbed universal scenario. The proposed method achieves the state-of-the-art performance on different benchmark datasets for the specific scenario and is proven effective for the undisturbed universal scenario. IEEE
Keyword:
Reprint Author's Address:
Email:
Source :
ACM Transactions on Audio Speech and Language Processing
ISSN: 2329-9290
Year: 2024
Volume: 32
Page: 1-15
5 . 4 0 0
JCR@2022
Cited Count:
SCOPUS Cited Count: 1
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 13
Affiliated Colleges: