Event Voxel Set Transformer for Spatiotemporal Representation Learning on Event Streams - Details

Author：

Xie, B. (Xie, B..) | Deng, Y. (Deng, Y..) | Shao, Z. (Shao, Z..) | Xu, Q. (Xu, Q..) | Li, Y. (Li, Y..)

Indexed by：

EI Scopus SCIE

Abstract：

Event　cameras　are　neuromorphic　vision　sensors　that　record　a　scene　as　sparse　and　asynchronous　event　streams.　Most　event-based　methods　project　events　into　dense　frames　and　process　them　using　conventional　vision　models,　resulting　in　high　computational　complexity.　A　recent　trend　is　to　develop　point-based　networks　that　achieve　efficient　event　processing　by　learning　sparse　representations.　However,　existing　works　may　lack　robust　local　information　aggregators　and　effective　feature　interaction　operations,　thus　limiting　their　modeling　capabilities.　To　this　end,　we　propose　an　attention-aware　model　named　Event　Voxel　Set　Transformer　(EVSTr)　for　efficient　spatiotemporal　representation　learning　on　event　streams.　It　first　converts　the　event　stream　into　voxel　sets　and　then　hierarchically　aggregates　voxel　features　to　obtain　robust　representations.　The　core　of　EVSTr　is　an　event　voxel　transformer　encoder　that　consists　of　two　well-designed　components,　including　the　Multi-Scale　Neighbor　Embedding　Layer　(MNEL)　for　local　information　aggregation　and　the　Voxel　Self-Attention　Layer　(VSAL)　for　global　feature　interaction.　Enabling　the　network　to　incorporate　a　long-range　temporal　structure,　we　introduce　a　segment　modeling　strategy　(S2TM)　to　learn　motion　patterns　from　a　sequence　of　segmented　voxel　sets.　The　proposed　model　is　evaluated　on　two　recognition　tasks,　including　object　classification　and　action　recognition.　To　provide　a　convincing　model　evaluation,　we　present　a　new　event-based　action　recognition　dataset　(NeuroHAR)　recorded　in　challenging　scenarios.　Comprehensive　experiments　show　that　EVSTr　achieves　state-of-the-art　performance　while　maintaining　low　model　complexity.　IEEE

Keyword：

neuromorphic vision object classification Feature extraction Streams Computational modeling Event camera Spatiotemporal phenomena Transformers attention mechanism Task analysis action recognition Cameras

Author Community：

[ 1 ] [Xie B.]Department of Mechanical Engineering, City University of Hong Kong, Hong Kong, SAR, China
[ 2 ] [Deng Y.]College of Computer Science, Beijing University of Technology, Beijing, China
[ 3 ] [Shao Z.]College of Information Science and Engineering, Hunan Normal University, Changsha, China
[ 4 ] [Xu Q.]Department of Electromechanical Engineering, University of Macau, Macao, SAR, China
[ 5 ] [Li Y.]Department of Mechanical Engineering, City University of Hong Kong, Hong Kong, SAR, China

Reprint Author's Address：

Email：

Show more details

Related Keywords：

Event Tubelet Compressor: Generating Compact Representations for Event-Based Action Recognition
2022，2022 7TH INTERNATIONAL CONFERENCE ON CONTROL, ROBOTICS AND CYBERNETICS, CRC
Action Recognition and Benchmark Using Event Cameras
2023，IEEE Transactions on Pattern Analysis and Machine Intelligence
Hypergraph neural network for gait recognition based on event camera
2023，3rd International Conference on Advanced Algorithms and Signal Image Processing, AASIP 2023
EISNet: A Multi-Modal Fusion Network for Semantic Segmentation with Events and Images
2024，IEEE Transactions on Multimedia

Source ：

IEEE Transactions on Circuits and Systems for Video Technology

ISSN： 1051-8215

Year： 2024

Page： 1-1

8 . 4 0 0

JCR@2022

Cited Count：

WoS CC Cited Count： 36

SCOPUS Cited Count： 1

ESI Highly Cited Papers on the List： 0 Unfold All

WanFang Cited Count：

Chinese Cited Count：

30 Days PV： 16

Affiliated Colleges：

Get Fulltext

DOI Library Discovery Baidu Scholar Search SCOPUS

Type
Departments

All Years Choose Year From to