AudioVSR: Enhancing Video Speech Recognition with Audio Data - Details

Author：

Indexed by：

Abstract：

Visual　Speech　Recognition　(VSR)　aims　to　predict　spoken　content　by　analyzing　lip　movements　in　videos.　Recently　reported　state-of-the-art　results　in　VSR　often　rely　on　increasingly　large　amounts　of　video　data,　while　the　publicly　available　transcribed　video　datasets　are　insufficient　compared　to　the　audio　data.　To　further　enhance　the　VSR　model　using　the　audio　data,　we　employed　a　generative　model　for　data　inflation,　integrating　the　synthetic　data　with　the　authentic　visual　data.　Essentially,　the　generative　model　incorporates　another　insight,　which　enhances　the　capabilities　of　the　recognition　model.　For　the　cross-language　issue,　previous　work　has　shown　poor　performance　with　non-Indo-European　languages.　We　trained　a　multi-language-family　modal　fusion　model,　AudioVSR.　Leveraging　the　concept　of　modal　transfer,　we　achieved　significant　results　in　downstream　VSR　tasks　under　conditions　of　data　scarcity.　To　the　best　of　our　knowledge,　AudioVSR　represents　the　first　work　on　cross-language-family　audio-lip　alignment,　achieving　a　new　SOTA　in　the　cross-language　scenario.　©　2024　Association　for　Computational　Linguistics.

Keyword：

Video analysis Data assimilation Data integration Computational linguistics Speech recognition

Author Community：

[ 1 ] [Yang, Xiaoda]Zhejiang University, China
[ 2 ] [Cheng, Xize]Zhejiang University, China
[ 3 ] [Duan, Jiaqi]Qingdao University, China
[ 4 ] [Qiu, Hongshun]Beijing University of Technology, China
[ 5 ] [Hong, Minjie]Zhejiang University, China
[ 6 ] [Fang, Minghui]Zhejiang University, China
[ 7 ] [Ji, Shengpeng]Zhejiang University, China
[ 8 ] [Zuo, Jialung]Zhejiang University, China
[ 9 ] [Hong, Zhiqing]Zhejiang University, China
[ 10 ] [Zhang, Zhimeng]Zhejiang University, China
[ 11 ] [Jin, Tao]Zhejiang University, China

Reprint Author's Address：

Email：

Show more details

Related Keywords：

Mitigating the Discrepancy Between Video and Text Temporal Sequences: A Time-Perception Enhanced Video Grounding method for LLM
2025，31st International Conference on Computational Linguistics, COLING 2025
A ROBUST OBJECT SEGMENTATION METHOD
2008，INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL
Relation-Guided Actor Attention for Group Activity Recognition
2021，PATTERN RECOGNITION AND COMPUTER VISION, PT I
Speech Recognition Method based on CTC Multilayer Loss
2022，11th International Conference on Computing and Pattern Recognition, ICCPR 2022

Source ：

Year： 2024

Page： 15352-15361

Language： English

Cited Count：

WoS CC Cited Count：

SCOPUS Cited Count：

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 27

Affiliated Colleges：

Get Fulltext

Library Discovery Baidu Scholar Search Engineering Village

Type
Departments

All Years Choose Year From to