SkillSpotter: Pose-Aware Multi-View Skilled Action Detection and Grading in Ego-Exo Videos

2026-06-30 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

SkillSpotter is a new pose-aware multi-view architecture designed for detecting and grading skilled actions in ego-exo videos, addressing the challenge of assessing execution quality rather than just action identification. Developed to support personalized, real-time coaching in domains like sports or cooking, it improves upon existing methods that grade near-randomly, as highlighted by the Ego-Exo4D proficiency benchmark. SkillSpotter incorporates three task-specific modules: adaptive temporal suppression for varying action density, gated 3D body pose fusion to integrate body kinematics, and bidirectional cross-view attention for effective ego and exo view combination. This architecture significantly boosts performance, increasing class-specific mAP from 12.40 to 21.82 (+76%) and balanced accuracy from 55.99% to 60.40% over the best baseline. Its modules are transferable and the method generalizes to HoloAssist.

Key takeaway

For Computer Vision Engineers developing real-time coaching or skill assessment systems, SkillSpotter offers a robust approach to move beyond mere action detection to actual proficiency grading. You should consider integrating pose-aware multi-view architectures, specifically leveraging 3D body pose fusion and cross-view attention, to significantly improve grading accuracy. This method's demonstrated gains in mAP and balanced accuracy suggest a viable path for creating more effective personalized feedback tools.

Key insights

SkillSpotter jointly detects and grades skilled actions in ego-exo videos by fusing visual and pose data across multiple views.

Principles

Combine ego and exo views for comprehensive understanding.
Body kinematics complement visual features for skill assessment.
Adapt temporal processing to varying action densities.

Method

SkillSpotter employs adaptive temporal suppression, gated 3D body pose fusion, and bidirectional cross-view attention to process multi-view ego-exo video for joint action detection and grading.

In practice

Develop AR coaching systems for sports or music.
Integrate 3D pose data into action recognition models.
Enhance existing temporal action detection architectures.

Topics

Ego-Exo Videos
Skilled Action Detection
Pose Estimation
Multi-View Learning
Action Grading
Computer Vision

Code references

eth-siplab/SkillSpotter

Best for: Research Scientist, AI Scientist, Computer Vision Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.