UNIEGO: Proxies as Mediators for Unified Egocentric Video Representation Learning

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

UNIEGO is a unified egocentric encoder developed through a hierarchical multi-teacher distillation framework to overcome the limitations of narrow egocentric video understanding. This framework trains UNIEGO using nine diverse teachers, encompassing ego-exo viewpoints, RGB, depth, and skeleton modalities, alongside four distinct foundation models. A critical component involves representation-specific Proxy models, which mediate knowledge transfer by translating heterogeneous teacher knowledge into a homogeneous egocentric space, thereby preventing conflicting gradients. The process further incorporates Selective Proxy Distillation (SPD), which adaptively selects only correct and confident proxies for each training sample, suppressing erroneous signals. UNIEGO's initialization as a learned convex combination of proxy parameters stabilizes this distillation. The model achieves state-of-the-art performance in action recognition, video retrieval, and action segmentation across three challenging ego-exo benchmarks.

Key takeaway

For Machine Learning Engineers developing robust egocentric video understanding models, consider adopting a proxy-mediated multi-teacher distillation approach. This method effectively unifies diverse knowledge from multiple viewpoints and modalities, overcoming conflicting gradients inherent in direct distillation. You should explore using adaptive proxy selection and parameter initialization to stabilize training and achieve superior performance in tasks like action recognition and video retrieval.

Key insights

Proxies mediate diverse teacher knowledge for unified egocentric video representation learning, achieving state-of-the-art performance.

Principles

Method

A hierarchical multi-teacher distillation framework uses representation-specific Proxy models to homogenize diverse teacher knowledge. Selective Proxy Distillation (SPD) then adaptively selects confident proxies, with UNIEGO initialized as a convex combination of proxy parameters.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.