Spatial-Temporal Decoupled Adapter for Micro-gesture Online Recognition
Summary
A Spatial-Temporal Decoupled Adapter, combined with Adaptive Soft Balanced Augmentation, has achieved an F1 score of 0.43808, ranking 1st in Track 2 of the 4th EI-MiGA-IJCAI Challenge for micro-gesture online recognition. This method addresses the challenge of localizing and classifying subtle gestures in untrimmed videos, which are difficult due to their extremely short duration, low motion amplitude, and ambiguous visual cues. The proposed adapter decouples video adaptation into independent temporal and spatial branches using lightweight depthwise convolutions, preventing feature entanglement. Additionally, Adaptive Soft Balanced Augmentation dynamically adjusts data augmentation intensity based on class rarity and learning difficulty, tackling the long-tail distribution problem prevalent in benchmark datasets like SMG. The SMG dataset comprises 40 untrimmed videos across 16 micro-gesture categories, with 35 subjects for training and 5 for testing. The system uses a VideoMAEv2-g backbone with an adapter bottleneck ratio of 0.25, temporal and spatial kernel sizes of 3, processing frames at 28 fps, resized to 160x160.
Key takeaway
For Machine Learning Engineers developing micro-gesture recognition systems, you should consider implementing a Spatial-Temporal Decoupled Adapter. This approach, combined with Adaptive Soft Balanced Augmentation, significantly improves performance on challenging, imbalanced datasets like SMG. Your models will better distinguish subtle gestures and handle skewed class distributions by separating spatial and temporal feature learning and dynamically adjusting augmentation. Explore these techniques to enhance the robustness and accuracy of your online recognition solutions.
Key insights
Decoupling spatial and temporal adaptation with adaptive augmentation significantly improves micro-gesture recognition in imbalanced datasets.
Principles
- Independent spatial and temporal modeling enhances fine-grained feature capture.
- Data augmentation should adapt to class rarity and learning difficulty.
- Larger pretrained backbones yield substantial performance benefits.
Method
The method employs a Spatial-Temporal Decoupled Adapter with parallel temporal and spatial branches, each using depthwise convolutions. Adaptive Soft Balanced Augmentation dynamically adjusts augmentation intensity based on effective sample counts and learning difficulty.
In practice
- Implement decoupled adapters for tasks requiring fine-grained spatio-temporal cues.
- Apply adaptive augmentation strategies to mitigate long-tail class distributions.
- Utilize VideoMAEv2-g or similar large pretrained backbones for video tasks.
Topics
- Micro-gesture Recognition
- Temporal Action Detection
- Parameter-Efficient Fine-tuning
- VideoMAE
- Data Augmentation
- Long-tail Distribution
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.