• Complex
  • Title
  • Keyword
  • Abstract
  • Scholars
  • Journal
  • ISSN
  • Conference
搜索

Author:

Yang, X. (Yang, X..) | Bao, C. (Bao, C..) | Chen, X. (Chen, X..)

Indexed by:

EI Scopus SCIE

Abstract:

As a practical way to address the cocktail party problem, the target speaker extraction (TSE) has received increasing attention recently. Typically, the TSE is explored in two scenarios. The first scenario is a specific one, where the target speaker is present and the signal received by the microphone contains at least two speakers. The second scenario is a universal one, where the presence or absence of the target speaker is uncertain and the received signal may contain one or multiple speakers. For these two scenarios, numerous TSE studies rely on the speaker embedding derived from an enrollment of the target speaker to guide the extraction. However, solely utilizing this embedding may not fully leverage the contextual information within the enrollment. To address this limitation, a novel approach that directly exploits the contextual information in the time-frequency (T-F) domain was recently proposed. This paper improves this approach by integrating our previously proposed coarse-to-fine framework, which consists of a coarse extraction and a refining extraction. For the specific scenario, an interaction block with an attention mechanism is employed to facilitate direct interaction between the T-F representations of the enrollment and the received signal. This direct interaction leads to the consistent T-F representation of the enrollment that serves as guidance for the coarse extraction. Afterwards, the T-F representation of the coarsely extracted signal, which may contain some distortions, is utilized to guide the refining extraction. The residual T-F representation obtained during the refining extraction mitigates distortions of coarse extraction and increases the precision of speaker extraction. Besides, this paper explores an undisturbed version of the universal scenario where the noise and reverberation are not considered. An additional processor is integrated and a two-level decision-making scheme is devised to generalize the coarse-tofine TSE method for this undisturbed universal scenario. The proposed method achieves the state-of-the-art performance on different benchmark datasets for the specific scenario and is proven effective for the undisturbed universal scenario. IEEE

Keyword:

Decision making Speech processing Data mining Attention mechanisms Distortion Target speaker extraction contextual information Refining speaker embedding attention mechanism Visualization

Author Community:

  • [ 1 ] [Yang X.]Institute of Speech and Audio Information Processing, School of Information Science and Technology, Beijing University of Technology, China
  • [ 2 ] [Bao C.]Institute of Speech and Audio Information Processing, School of Information Science and Technology, Beijing University of Technology, China
  • [ 3 ] [Chen X.]Institute of Speech and Audio Information Processing, School of Information Science and Technology, Beijing University of Technology, China

Reprint Author's Address:

Email:

Show more details

Related Keywords:

Related Article:

Source :

ACM Transactions on Audio Speech and Language Processing

ISSN: 2329-9290

Year: 2024

Volume: 32

Page: 1-15

5 . 4 0 0

JCR@2022

Cited Count:

WoS CC Cited Count:

SCOPUS Cited Count: 1

ESI Highly Cited Papers on the List: 0 Unfold All

WanFang Cited Count:

Chinese Cited Count:

30 Days PV: 13

Affiliated Colleges:

Online/Total:489/10642522
Address:BJUT Library(100 Pingleyuan,Chaoyang District,Beijing 100124, China Post Code:100124) Contact Us:010-67392185
Copyright:BJUT Library Technical Support:Beijing Aegean Software Co., Ltd.