Coarse-to-Fine Target Speaker Extraction Based on Contextual Information Exploitation - Details

Author：

Yang, X. (Yang, X..) | Bao, C. (Bao, C..) | Chen, X. (Chen, X..)

Indexed by：

EI Scopus SCIE

Abstract：

As　a　practical　way　to　address　the　cocktail　party　problem,　the　target　speaker　extraction　(TSE)　has　received　increasing　attention　recently.　Typically,　the　TSE　is　explored　in　two　scenarios.　The　first　scenario　is　a　specific　one,　where　the　target　speaker　is　present　and　the　signal　received　by　the　microphone　contains　at　least　two　speakers.　The　second　scenario　is　a　universal　one,　where　the　presence　or　absence　of　the　target　speaker　is　uncertain　and　the　received　signal　may　contain　one　or　multiple　speakers.　For　these　two　scenarios,　numerous　TSE　studies　rely　on　the　speaker　embedding　derived　from　an　enrollment　of　the　target　speaker　to　guide　the　extraction.　However,　solely　utilizing　this　embedding　may　not　fully　leverage　the　contextual　information　within　the　enrollment.　To　address　this　limitation,　a　novel　approach　that　directly　exploits　the　contextual　information　in　the　time-frequency　(T-F)　domain　was　recently　proposed.　This　paper　improves　this　approach　by　integrating　our　previously　proposed　coarse-to-fine　framework,　which　consists　of　a　coarse　extraction　and　a　refining　extraction.　For　the　specific　scenario,　an　interaction　block　with　an　attention　mechanism　is　employed　to　facilitate　direct　interaction　between　the　T-F　representations　of　the　enrollment　and　the　received　signal.　This　direct　interaction　leads　to　the　consistent　T-F　representation　of　the　enrollment　that　serves　as　guidance　for　the　coarse　extraction.　Afterwards,　the　T-F　representation　of　the　coarsely　extracted　signal,　which　may　contain　some　distortions,　is　utilized　to　guide　the　refining　extraction.　The　residual　T-F　representation　obtained　during　the　refining　extraction　mitigates　distortions　of　coarse　extraction　and　increases　the　precision　of　speaker　extraction.　Besides,　this　paper　explores　an　undisturbed　version　of　the　universal　scenario　where　the　noise　and　reverberation　are　not　considered.　An　additional　processor　is　integrated　and　a　two-level　decision-making　scheme　is　devised　to　generalize　the　coarse-tofine　TSE　method　for　this　undisturbed　universal　scenario.　The　proposed　method　achieves　the　state-of-the-art　performance　on　different　benchmark　datasets　for　the　specific　scenario　and　is　proven　effective　for　the　undisturbed　universal　scenario.　IEEE

Keyword：

Decision making Speech processing Data mining Attention mechanisms Distortion Target speaker extraction contextual information Refining speaker embedding attention mechanism Visualization

Author Community：

[ 1 ] [Yang X.]Institute of Speech and Audio Information Processing, School of Information Science and Technology, Beijing University of Technology, China
[ 2 ] [Bao C.]Institute of Speech and Audio Information Processing, School of Information Science and Technology, Beijing University of Technology, China
[ 3 ] [Chen X.]Institute of Speech and Audio Information Processing, School of Information Science and Technology, Beijing University of Technology, China

Reprint Author's Address：

Email：

Show more details

Related Keywords：

TARGET SPEAKER EXTRACTION BY DIRECTLY EXPLOITING CONTEXTUAL INFORMATION IN THE TIME-FREQUENCY DOMAIN
2024，2024 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2024)
Predicting call center service grade with improved neural network algorithm
2010，2nd International Workshop on Intelligent Systems and Applications, ISA2010
Improving the performance of MGM-based voice conversion by preparing training data method
2004，2004 International Symposium on Chinese Spoken Language Processing
Two revised algorithms based on Apriori for mining association rules
2008，7th International Conference on Machine Learning and Cybernetics, ICMLC

Source ：

ACM Transactions on Audio Speech and Language Processing

ISSN： 2329-9290

Year： 2024

Volume： 32

Page： 1-15

5 . 4 0 0

JCR@2022

Cited Count：

WoS CC Cited Count：

SCOPUS Cited Count： 1

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 13

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search SCOPUS

Type
Departments

All Years Choose Year From to