Indexed by:
Abstract:
Temporal text localization (TTL) task refers to identify a segment within a long untrimmed video that semantically matches a given textual query. However, most methods require extensive manual annotation of temporal boundaries for each query, which restricts their scalability and practicality in real-world applications. Moreover, modeling temporal context information is particularly crucial for TTL task. In this paper, a Vision Token Rolling Transformer for weakly supervised temporal text localization (VTR-former) is developed. VTR-former does not rely on predefined temporal boundaries during training or testing. It significantly improves the performance of the model in temporal information capture and feature representation by rolling vision tokens and utilizing advanced feature learning modules based on the transformer. Experiments on two challenging benchmarks, including Charades-STA and ActivityNet Captions, demonstrate that VTR-former outperforms the baseline network and achieves the leading performance. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.
Keyword:
Reprint Author's Address:
Email:
Source :
ISSN: 1865-0929
Year: 2025
Volume: 2302 CCIS
Page: 203-217
Language: English
Cited Count:
SCOPUS Cited Count:
ESI Highly Cited Papers on the List: 0 Unfold All
WanFang Cited Count:
Chinese Cited Count:
30 Days PV: 4
Affiliated Colleges: