UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

2026-06-18 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

UNIEGO is a unified egocentric encoder developed through a hierarchical multi-teacher distillation framework to overcome the limitations of narrow egocentric video understanding. This framework trains UNIEGO using nine diverse teachers, encompassing ego-exo viewpoints, RGB, depth, and skeleton modalities, alongside four distinct foundation models. A critical component involves representation-specific Proxy models, which mediate knowledge transfer by translating heterogeneous teacher knowledge into a homogeneous egocentric space, thereby preventing conflicting gradients. The process further incorporates Selective Proxy Distillation (SPD), which adaptively selects only correct and confident proxies for each training sample, suppressing erroneous signals. UNIEGO's initialization as a learned convex combination of proxy parameters stabilizes this distillation. The model achieves state-of-the-art performance in action recognition, video retrieval, and action segmentation across three challenging ego-exo benchmarks.

Key takeaway

For Machine Learning Engineers developing robust egocentric video understanding models, consider adopting a proxy-mediated multi-teacher distillation approach. This method effectively unifies diverse knowledge from multiple viewpoints and modalities, overcoming conflicting gradients inherent in direct distillation. You should explore using adaptive proxy selection and parameter initialization to stabilize training and achieve superior performance in tasks like action recognition and video retrieval.

Key insights

Proxies mediate diverse teacher knowledge for unified egocentric video representation learning, achieving state-of-the-art performance.

Principles

Egocentric representations need complementary knowledge.
Direct multi-teacher distillation can cause conflicting gradients.
Adaptive selection of reliable supervision improves distillation.

Method

A hierarchical multi-teacher distillation framework uses representation-specific Proxy models to homogenize diverse teacher knowledge. Selective Proxy Distillation (SPD) then adaptively selects confident proxies, with UNIEGO initialized as a convex combination of proxy parameters.

In practice

Use proxy models to harmonize heterogeneous data sources.
Implement adaptive selection for reliable multi-teacher supervision.
Initialize models with combined teacher parameters for stability.

Topics

Egocentric Video Understanding
Multi-Teacher Distillation
Representation Learning
Proxy Models
Action Recognition
Video Retrieval

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.