A Multi-Modal Framework with Cross-Subject Pseudo-Labeling and Semantic Alignment for Micro-Gesture Recognition
Summary
A multi-modal framework was developed for micro-gesture recognition, addressing challenges like low signal-to-noise ratio, severe long-tailed class distribution, and cross-subject domain shift in untrimmed videos. This framework, designed for Track 1 of the 4th MiGA-IJCAI Challenge, integrates a saliency-guided pipeline using 68-keypoint skeleton coordinates, 3D heatmap volumes, and high-resolution RGB features. It employs a square-root smoothed weighting mechanism with Orthogonal Semantic Embedding Loss to protect tail classes. Crucially, a Cross-Modal Pseudo-Labeling (CMPL) strategy enhances cross-subject generalization by boosting single-modal robustness. A temperature-scaled soft-voting mechanism mitigates overconfidence during late fusion. The framework achieved an F1-score of 68.13%, securing 4th place.
Key takeaway
For Computer Vision Engineers developing robust micro-gesture recognition systems, especially in cross-subject evaluation scenarios, consider integrating multi-modal data. Your approach should incorporate techniques like Cross-Modal Pseudo-Labeling for domain adaptation and semantic alignment to handle low signal-to-noise ratios and long-tailed class distributions. This can significantly boost single-modal robustness and overall recognition capabilities.
Key insights
A multi-modal framework uses pseudo-labeling and semantic alignment to improve micro-gesture recognition across subjects.
Principles
- Micro-gestures pose challenges due to low SNR and domain shift.
- Multi-modal fusion enhances fine-grained representation capture.
- Pseudo-labeling bridges cross-subject generalization gaps.
Method
Integrate 68-keypoint skeleton, 3D heatmap, and RGB features; apply square-root smoothed weighting with Orthogonal Semantic Embedding Loss; use Cross-Modal Pseudo-Labeling for domain adaptation; then employ temperature-scaled soft-voting for late fusion.
In practice
- Combine skeleton, heatmap, and RGB for robust feature extraction.
- Implement pseudo-labeling for unsupervised domain adaptation.
- Use soft-voting to mitigate overconfidence in fusion.
Topics
- Micro-Gesture Recognition
- Multi-Modal Learning
- Unsupervised Domain Adaptation
- Pseudo-Labeling
- Semantic Alignment
- Computer Vision
Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.