From Frames to Temporal Graphs: In-Context Egocentric Action Recognition with Vision-Language Models

2026-06-13 · Source: Computer Vision and Pattern Recognition · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Expert, quick

Summary

A novel approach addresses the challenge of egocentric action recognition by decoupling visual perception from symbolic reasoning, converting videos into Temporal Action Graphs. This multi-stage prompting pipeline first generates dense natural language narratives from short temporal windows, then formalizes them into structured, open-vocabulary graph representations. Evaluated on the EGTEA and Epic-Kitchens-100 datasets, few-shot graph demonstrations significantly boost accuracy compared to zero-shot frame and graph-based inference. Even in a zero-shot setting, graph-based reasoning competes effectively with pixel-based inference. Findings across 11 open-weight Vision-Language Models, spanning 2B to 235B parameters from six model families, indicate that current VLMs excel more as symbolic reasoners than as direct visual observers. This method offers a scalable, fine-tuning-free alternative that leverages VLMs' latent reasoning strengths by projecting video into the language domain.

Key takeaway

For Machine Learning Engineers developing egocentric action recognition systems, you should consider adopting a graph-based symbolic reasoning approach. This method, which converts video into Temporal Action Graphs, leverages Vision-Language Models' strengths as symbolic reasoners rather than direct visual observers. It offers a scalable, fine-tuning-free alternative that significantly improves accuracy with few-shot demonstrations, making it a compelling strategy for efficient and effective video analysis.

Key insights

Decoupling visual perception from symbolic reasoning via temporal graphs enhances VLM performance in egocentric action recognition.

Principles

VLMs are stronger symbolic reasoners than visual observers.
Semantic bottlenecks improve VLM action recognition.
Graph representations enable efficient in-context learning.

Method

A multi-stage prompting pipeline generates natural language narratives from video, then formalizes them into structured, open-vocabulary Temporal Action Graphs for VLM processing.

In practice

Convert egocentric video to temporal action graphs.
Use few-shot graph demonstrations for accuracy gains.
Employ VLMs for symbolic reasoning over visual observation.

Topics

Egocentric Action Recognition
Vision-Language Models
Temporal Action Graphs
Symbolic Reasoning
In-Context Learning
Video-to-Language Conversion

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.