TARGET SPEAKER EXTRACTION BY DIRECTLY EXPLOITING CONTEXTUAL INFORMATION IN THE TIME-FREQUENCY DOMAIN - Details

Author：

Yang, Xue (Yang, Xue.) | Bao, Changchun (Bao, Changchun.) | Zhou, Jing (Zhou, Jing.) | Chen, Xianhong (Chen, Xianhong.)

Indexed by：

CPCI-S

Abstract：

In　target　speaker　extraction,　many　studies　rely　on　the　speaker　embedding　which　is　obtained　from　an　enrollment　of　the　target　speaker　and　employed　as　the　guidance.　However,　solely　using　speaker　embedding　may　not　fully　utilize　the　contextual　information　contained　in　the　enrollment.　In　this　paper,　we　directly　exploit　this　contextual　information　in　the　time-frequency　(T-F)　domain.　Specifically,　the　T-F　representations　of　the　enrollment　and　the　mixed　signal　are　interacted　to　compute　the　weighting　matrices　through　an　attention　mechanism.　These　weighting　matrices　reflect　the　similarity　among　different　frames　of　the　T-F　representations　and　are　further　employed　to　obtain　the　consistent　T-F　representations　of　the　enrollment.　These　consistent　representations　are　served　as　the　guidance,　allowing　for　better　exploitation　of　the　contextual　information.　Furthermore,　the　proposed　method　achieves　the　state-of-the-art　performance　on　the　benchmark　dataset　and　shows　its　effectiveness　in　the　complex　scenarios.

Keyword：

contextual information attention mechanism speaker embedding Speech separation target speaker extraction

Author Community：

[ 1 ] [Yang, Xue]Beijing Univ Technol, Fac Informat Technol, Inst Speech & Audio Informat Proc, Beijing 100124, Peoples R China
[ 2 ] [Bao, Changchun]Beijing Univ Technol, Fac Informat Technol, Inst Speech & Audio Informat Proc, Beijing 100124, Peoples R China
[ 3 ] [Zhou, Jing]Beijing Univ Technol, Fac Informat Technol, Inst Speech & Audio Informat Proc, Beijing 100124, Peoples R China
[ 4 ] [Chen, Xianhong]Beijing Univ Technol, Fac Informat Technol, Inst Speech & Audio Informat Proc, Beijing 100124, Peoples R China

Reprint Author's Address：

[Bao, Changchun]Beijing Univ Technol, Fac Informat Technol, Inst Speech & Audio Informat Proc, Beijing 100124, Peoples R China

Email：

yangx11@emails.bjut.edu.cn |
baochch@bjut.edu.cn |
zhoujing@emails.bjut.edu.cn |
chenxianhong@bjut.edu.cn

Show more details

Related Keywords：

Coarse-to-Fine Target Speaker Extraction Based on Contextual Information Exploitation
2024，ACM Transactions on Audio Speech and Language Processing
Monaural Speech Separation Method Based on Recurrent Attention with Parallel Branches
2023，
Research Situation and Prospects of Multi-speaker Separation and Target Speaker Extraction; [多说话人分离与目标说话人提取的研究现状与展望]
2024，Journal of Data Acquisition and Processing
Speaker diarization with variants of self-attention and joint speaker embedding extractor
2023，Journal of Intelligent and Fuzzy Systems

Source ：

2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024)

ISSN： 1520-6149

Year： 2024

Page： 10476-10480

Cited Count：

WoS CC Cited Count： 2

SCOPUS Cited Count： 5

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 7

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search Web of Science

Type
Departments

All Years Choose Year From to