From Frames to Temporal Graphs: In-Context Egocentric Action Recognition with Vision-Language Models

· Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

A novel approach addresses the challenge of egocentric action recognition by decoupling visual perception from symbolic reasoning, converting videos into Temporal Action Graphs. This multi-stage prompting pipeline first generates dense natural language narratives from short temporal windows, then formalizes them into structured, open-vocabulary graph representations. Evaluated on the EGTEA and Epic-Kitchens-100 datasets, few-shot graph demonstrations significantly boost accuracy compared to zero-shot frame and graph-based inference. Even in a zero-shot setting, graph-based reasoning competes effectively with pixel-based inference. Findings across 11 open-weight Vision-Language Models, spanning 2B to 235B parameters from six model families, indicate that current VLMs excel more as symbolic reasoners than as direct visual observers. This method offers a scalable, fine-tuning-free alternative that leverages VLMs' latent reasoning strengths by projecting video into the language domain.

Key takeaway

For Machine Learning Engineers developing egocentric action recognition systems, you should consider adopting a graph-based symbolic reasoning approach. This method, which converts video into Temporal Action Graphs, leverages Vision-Language Models' strengths as symbolic reasoners rather than direct visual observers. It offers a scalable, fine-tuning-free alternative that significantly improves accuracy with few-shot demonstrations, making it a compelling strategy for efficient and effective video analysis.

Key insights

Decoupling visual perception from symbolic reasoning via temporal graphs enhances VLM performance in egocentric action recognition.

Principles

Method

A multi-stage prompting pipeline generates natural language narratives from video, then formalizes them into structured, open-vocabulary Temporal Action Graphs for VLM processing.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.