A Multi-Modal Framework with Cross-Subject Pseudo-Labeling and Semantic Alignment for Micro-Gesture Recognition

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Computer Vision · Depth: Expert, quick

Summary

A multi-modal framework was developed for micro-gesture recognition, addressing challenges like low signal-to-noise ratio, severe long-tailed class distribution, and cross-subject domain shift in untrimmed videos. This framework, designed for Track 1 of the 4th MiGA-IJCAI Challenge, integrates a saliency-guided pipeline using 68-keypoint skeleton coordinates, 3D heatmap volumes, and high-resolution RGB features. It employs a square-root smoothed weighting mechanism with Orthogonal Semantic Embedding Loss to protect tail classes. Crucially, a Cross-Modal Pseudo-Labeling (CMPL) strategy enhances cross-subject generalization by boosting single-modal robustness. A temperature-scaled soft-voting mechanism mitigates overconfidence during late fusion. The framework achieved an F1-score of 68.13%, securing 4th place.

Key takeaway

For Computer Vision Engineers developing robust micro-gesture recognition systems, especially in cross-subject evaluation scenarios, consider integrating multi-modal data. Your approach should incorporate techniques like Cross-Modal Pseudo-Labeling for domain adaptation and semantic alignment to handle low signal-to-noise ratios and long-tailed class distributions. This can significantly boost single-modal robustness and overall recognition capabilities.

Key insights

A multi-modal framework uses pseudo-labeling and semantic alignment to improve micro-gesture recognition across subjects.

Principles

Method

Integrate 68-keypoint skeleton, 3D heatmap, and RGB features; apply square-root smoothed weighting with Orthogonal Semantic Embedding Loss; use Cross-Modal Pseudo-Labeling for domain adaptation; then employ temperature-scaled soft-voting for late fusion.

In practice

Topics

Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.