UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning
Summary
UNIEGO is a unified egocentric encoder developed through a hierarchical multi-teacher distillation framework to overcome the limitations of narrow egocentric video understanding. This framework trains UNIEGO using nine diverse teachers, encompassing ego-exo viewpoints, RGB, depth, and skeleton modalities, alongside four distinct foundation models. A critical component involves representation-specific Proxy models, which mediate knowledge transfer by translating heterogeneous teacher knowledge into a homogeneous egocentric space, thereby preventing conflicting gradients. The process further incorporates Selective Proxy Distillation (SPD), which adaptively selects only correct and confident proxies for each training sample, suppressing erroneous signals. UNIEGO's initialization as a learned convex combination of proxy parameters stabilizes this distillation. The model achieves state-of-the-art performance in action recognition, video retrieval, and action segmentation across three challenging ego-exo benchmarks.
Key takeaway
For Machine Learning Engineers developing robust egocentric video understanding models, consider adopting a proxy-mediated multi-teacher distillation approach. This method effectively unifies diverse knowledge from multiple viewpoints and modalities, overcoming conflicting gradients inherent in direct distillation. You should explore using adaptive proxy selection and parameter initialization to stabilize training and achieve superior performance in tasks like action recognition and video retrieval.
Key insights
Proxies mediate diverse teacher knowledge for unified egocentric video representation learning, achieving state-of-the-art performance.
Principles
- Egocentric representations need complementary knowledge.
- Direct multi-teacher distillation can cause conflicting gradients.
- Adaptive selection of reliable supervision improves distillation.
Method
A hierarchical multi-teacher distillation framework uses representation-specific Proxy models to homogenize diverse teacher knowledge. Selective Proxy Distillation (SPD) then adaptively selects confident proxies, with UNIEGO initialized as a convex combination of proxy parameters.
In practice
- Use proxy models to harmonize heterogeneous data sources.
- Implement adaptive selection for reliable multi-teacher supervision.
- Initialize models with combined teacher parameters for stability.
Topics
- Egocentric Video Understanding
- Multi-Teacher Distillation
- Representation Learning
- Proxy Models
- Action Recognition
- Video Retrieval
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.