Spatial-Temporal Decoupled Adapter for Micro-gesture Online Recognition

· Source: cs.CV updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, extended

Summary

A Spatial-Temporal Decoupled Adapter, combined with Adaptive Soft Balanced Augmentation, has achieved an F1 score of 0.43808, ranking 1st in Track 2 of the 4th EI-MiGA-IJCAI Challenge for micro-gesture online recognition. This method addresses the challenge of localizing and classifying subtle gestures in untrimmed videos, which are difficult due to their extremely short duration, low motion amplitude, and ambiguous visual cues. The proposed adapter decouples video adaptation into independent temporal and spatial branches using lightweight depthwise convolutions, preventing feature entanglement. Additionally, Adaptive Soft Balanced Augmentation dynamically adjusts data augmentation intensity based on class rarity and learning difficulty, tackling the long-tail distribution problem prevalent in benchmark datasets like SMG. The SMG dataset comprises 40 untrimmed videos across 16 micro-gesture categories, with 35 subjects for training and 5 for testing. The system uses a VideoMAEv2-g backbone with an adapter bottleneck ratio of 0.25, temporal and spatial kernel sizes of 3, processing frames at 28 fps, resized to 160x160.

Key takeaway

For Machine Learning Engineers developing micro-gesture recognition systems, you should consider implementing a Spatial-Temporal Decoupled Adapter. This approach, combined with Adaptive Soft Balanced Augmentation, significantly improves performance on challenging, imbalanced datasets like SMG. Your models will better distinguish subtle gestures and handle skewed class distributions by separating spatial and temporal feature learning and dynamically adjusting augmentation. Explore these techniques to enhance the robustness and accuracy of your online recognition solutions.

Key insights

Decoupling spatial and temporal adaptation with adaptive augmentation significantly improves micro-gesture recognition in imbalanced datasets.

Principles

Method

The method employs a Spatial-Temporal Decoupled Adapter with parallel temporal and spatial branches, each using depthwise convolutions. Adaptive Soft Balanced Augmentation dynamically adjusts augmentation intensity based on effective sample counts and learning difficulty.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CV updates on arXiv.org.