S3T-Former: A Purely Spike-Driven State-Space Topology Transformer for Skeleton Action Recognition

2026-03-20 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, extended

Summary

The S3T-Former is a novel, purely spike-driven Transformer architecture designed for energy-efficient skeleton-based action recognition, addressing the high power consumption of traditional Artificial Neural Networks (ANNs) on edge devices. It introduces a Multi-Stream Anatomical Spiking Embedding (M-ASE) that acts as a kinematic differential operator, transforming multimodal skeleton features into sparse event streams without heavy fusion. To achieve true topological and temporal sparsity, the S3T-Former incorporates Lateral Spiking Topology Routing (LSTR) for on-demand conditional spike propagation and a Spiking State-Space (S3) Engine to capture long-range temporal dynamics. Experiments on NTU RGB+D 60, NTU RGB+D 120, and NW-UCLA datasets show that S3T-Former achieves competitive accuracy, outperforming existing Spiking GCNs by up to +6.38% and several ANNs, while theoretically reducing energy consumption to less than 10% of comparable ANNs.

Key takeaway

For Computer Vision Engineers developing action recognition systems for resource-constrained edge devices, S3T-Former offers a compelling solution. Its purely spike-driven architecture significantly reduces energy consumption (less than 10% of ANNs) while maintaining or exceeding the accuracy of many ANN models. You should consider adopting its principles, particularly the M-ASE for efficient data representation and the S3-Engine for robust long-range temporal reasoning, to build high-performance, low-power systems.

Key insights

S3T-Former is a purely spike-driven Transformer for energy-efficient skeleton action recognition, achieving high accuracy with extreme sparsity.

Principles

Spiking Neural Networks offer energy efficiency over ANNs.
Sparsity in SNNs can be maximized by focusing on dynamic changes.
Long-range temporal dynamics are crucial for action recognition.

Method

S3T-Former uses M-ASE for sparse kinematic embedding, ATG-QKV for dynamic attention, LSTR for zero-MAC spatial routing, and an S3-Engine for linear-complexity temporal memory, all within a purely spike-driven Transformer architecture.

In practice

Use M-ASE to convert dense skeleton data into sparse event streams.
Implement ATG-QKV to reduce spike rates by focusing on motion gradients.
Employ LSTR for efficient, anatomically-guided spatial spike propagation.

Topics

S3T-Former
Spiking Neural Networks
Skeleton Action Recognition
State-Space Models
Neuromorphic Computing

Best for: Computer Vision Engineer, Research Scientist, AI Researcher, AI Scientist, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.