Speaker diarization with variants of self-attention and joint speaker embedding extractor - Details

Author：

Fu, Pengbin (Fu, Pengbin.) | Ma, Yuchen (Ma, Yuchen.) | Yang, Huirong (Yang, Huirong.)

Indexed by：

EI Scopus SCIE

Abstract：

The　speaker　diarization　task　pertains　to　the　automated　differentiation　of　speakers　within　an　audio　recording,　while　lacking　any　prior　information　regarding　the　speakers.　The　introduction　of　the　self-attention　mechanism　in　End-to-End　Neural　Speaker　Diarization　(EEND)　has　elegantly　resolved　the　issue　of　overlapping　speakers.　The　Transformer　model　equipped　with　self-attention　mechanism　has　shown　great　potential　in　collecting　global　information,　yielding　remarkable　outcomes　in　various　tasks.　However,　the　individual　speaker　characteristics　are　predominantly　reflected　in　the　contextual　information,　which　conventional　self-attention　would　not　adequately　address.　In　this　study,　we　propose　a　hierarchical　encoders　model　to　augment　the　encoders＇　acquisition　of　speaker　information　in　two　distinct　ways:　(1)　Constraining　the　perceptual　field　of　the　self-attentive　mechanism　with　left-right　windows　or　Gaussian　weights　to　highlight　contextual　information;　(2)　Utilizing　a　pre-trained　time-delay　neural　network　based　speaker　embedding　extractor　to　alleviate　the　shortcomings　of　speaker　feature　extraction　ability.　We　evaluate　the　proposed　methods　on　a　simulated　dataset　of　two　speakers　and　a　real　conversation　dataset.　The　model　with　the　most　favorable　outcomes　among　the　proposed　enhancements　achieves　a　diarization　error　rate　of　7.74%　on　the　simulated　dataset　and　21.92%　on　MagicData-RAMC　after　adaptation.　These　results　compellingly　demonstrate　the　efficacy　of　the　proposed　methods.　©　2023　-　IOS　Press.　All　rights　reserved.

Keyword：

Embeddings Audio recordings Signal encoding

Author Community：

[ 1 ] [Fu, Pengbin]Faculty of Information Technology, Beijing University of Technology, Xidawang Road, Beijing, China
[ 2 ] [Ma, Yuchen]Faculty of Information Technology, Beijing University of Technology, Xidawang Road, Beijing, China
[ 3 ] [Yang, Huirong]Faculty of Information Technology, Beijing University of Technology, Xidawang Road, Beijing, China

Reprint Author's Address：

Email：

Show more details

Related Keywords：

A Novel Semantic-Enhanced Time-Aware Model for Temporal Knowledge Graph Completion
2023，12th National CCF Conference on Natural Language Processing and Chinese Computing, NLPCC 2023
Forecasting the dynamics of correlations in complex systems
2024，Chaos, Solitons and Fractals
A Network for Semantic Segmentation of Point Cloud reorganized by Sectoral Cylindrical Partition
2023，7th IEEE Information Technology and Mechatronics Engineering Conference, ITOEC 2023
GPT Rotational Position Embedding for Length Extrapolation
2023，6th International Conference on Machine Learning and Natural Language Processing, MNLP 2023

Source ：

Journal of Intelligent and Fuzzy Systems

ISSN： 1064-1246

Year： 2023

Issue： 5

Volume： 45

Page： 9169-9180

2 . 0 0 0

JCR@2022

Cited Count：

WoS CC Cited Count：

SCOPUS Cited Count：

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 11

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search Engineering Village

Type
Departments

All Years Choose Year From to