From Frames to Temporal Graphs: In-Context Egocentric Action Recognition with Vision-Language Models
Summary
A novel approach addresses the challenge of egocentric action recognition by decoupling visual perception from symbolic reasoning, converting videos into Temporal Action Graphs. This multi-stage prompting pipeline first generates dense natural language narratives from short temporal windows, then formalizes them into structured, open-vocabulary graph representations. Evaluated on the EGTEA and Epic-Kitchens-100 datasets, few-shot graph demonstrations significantly boost accuracy compared to zero-shot frame and graph-based inference. Even in a zero-shot setting, graph-based reasoning competes effectively with pixel-based inference. Findings across 11 open-weight Vision-Language Models, spanning 2B to 235B parameters from six model families, indicate that current VLMs excel more as symbolic reasoners than as direct visual observers. This method offers a scalable, fine-tuning-free alternative that leverages VLMs' latent reasoning strengths by projecting video into the language domain.
Key takeaway
For Machine Learning Engineers developing egocentric action recognition systems, you should consider adopting a graph-based symbolic reasoning approach. This method, which converts video into Temporal Action Graphs, leverages Vision-Language Models' strengths as symbolic reasoners rather than direct visual observers. It offers a scalable, fine-tuning-free alternative that significantly improves accuracy with few-shot demonstrations, making it a compelling strategy for efficient and effective video analysis.
Key insights
Decoupling visual perception from symbolic reasoning via temporal graphs enhances VLM performance in egocentric action recognition.
Principles
- VLMs are stronger symbolic reasoners than visual observers.
- Semantic bottlenecks improve VLM action recognition.
- Graph representations enable efficient in-context learning.
Method
A multi-stage prompting pipeline generates natural language narratives from video, then formalizes them into structured, open-vocabulary Temporal Action Graphs for VLM processing.
In practice
- Convert egocentric video to temporal action graphs.
- Use few-shot graph demonstrations for accuracy gains.
- Employ VLMs for symbolic reasoning over visual observation.
Topics
- Egocentric Action Recognition
- Vision-Language Models
- Temporal Action Graphs
- Symbolic Reasoning
- In-Context Learning
- Video-to-Language Conversion
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Computer Vision Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computer Vision and Pattern Recognition.