Spatial-Temporal Decoupled Adapter for Micro-gesture Online Recognition

2026-06-08 · Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, extended

Summary

A Spatial-Temporal Decoupled Adapter, combined with Adaptive Soft Balanced Augmentation, has achieved an F1 score of 0.43808, ranking 1st in Track 2 of the 4th EI-MiGA-IJCAI Challenge for micro-gesture online recognition. This method addresses the challenge of localizing and classifying subtle gestures in untrimmed videos, which are difficult due to their extremely short duration, low motion amplitude, and ambiguous visual cues. The proposed adapter decouples video adaptation into independent temporal and spatial branches using lightweight depthwise convolutions, preventing feature entanglement. Additionally, Adaptive Soft Balanced Augmentation dynamically adjusts data augmentation intensity based on class rarity and learning difficulty, tackling the long-tail distribution problem prevalent in benchmark datasets like SMG. The SMG dataset comprises 40 untrimmed videos across 16 micro-gesture categories, with 35 subjects for training and 5 for testing. The system uses a VideoMAEv2-g backbone with an adapter bottleneck ratio of 0.25, temporal and spatial kernel sizes of 3, processing frames at 28 fps, resized to 160x160.

Key takeaway

For Machine Learning Engineers developing micro-gesture recognition systems, you should consider implementing a Spatial-Temporal Decoupled Adapter. This approach, combined with Adaptive Soft Balanced Augmentation, significantly improves performance on challenging, imbalanced datasets like SMG. Your models will better distinguish subtle gestures and handle skewed class distributions by separating spatial and temporal feature learning and dynamically adjusting augmentation. Explore these techniques to enhance the robustness and accuracy of your online recognition solutions.

Key insights

Decoupling spatial and temporal adaptation with adaptive augmentation significantly improves micro-gesture recognition in imbalanced datasets.

Principles

Independent spatial and temporal modeling enhances fine-grained feature capture.
Data augmentation should adapt to class rarity and learning difficulty.
Larger pretrained backbones yield substantial performance benefits.

Method

The method employs a Spatial-Temporal Decoupled Adapter with parallel temporal and spatial branches, each using depthwise convolutions. Adaptive Soft Balanced Augmentation dynamically adjusts augmentation intensity based on effective sample counts and learning difficulty.

In practice

Implement decoupled adapters for tasks requiring fine-grained spatio-temporal cues.
Apply adaptive augmentation strategies to mitigate long-tail class distributions.
Utilize VideoMAEv2-g or similar large pretrained backbones for video tasks.

Topics

Micro-gesture Recognition
Temporal Action Detection
Parameter-Efficient Fine-tuning
VideoMAE
Data Augmentation
Long-tail Distribution

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.