Molmo 2 | Complex video question answering

2025-12-16 · Source: Ai2 · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Intermediate, short

Summary

Molmo 2 is an open-source video question-answering model demonstrated by developer Christopher Clark. The model successfully answers complex, long-form questions about a professional soccer game video. It accurately identified the blue team as the scorer and provided a detailed description of the player who scored, including jersey details and actions. Molmo 2 also offered coaching advice, pinpointing the white team's critical mistake as failing to clear the ball. Furthermore, it made a speculative guess about the tournament, identifying "Champions" branding and professional broadcast quality as clues for a European football tournament like the UEFA Champions League. The model also demonstrated stylistic instruction following by describing the goal like a soccer announcer using emojis. A key aspect of Molmo 2's development involved creating new open-source data and training code to enable such detailed, instruction-following responses from video models, addressing the common reliance on closed-source training data.

Key takeaway

For AI Scientists developing video understanding models, Molmo 2's approach highlights the critical role of open-source data and training code in achieving sophisticated, instruction-following capabilities. You should explore open-source dataset creation and model training strategies to advance video Q&A, especially for applications requiring detailed analysis and contextual reasoning beyond simple object recognition.

Key insights

Molmo 2 demonstrates advanced open-source video Q&A capabilities through detailed, instruction-following responses, enabled by new open-source training data.

Principles

Open-source data enables complex video Q&A.
Contextual clues support speculative video analysis.

Method

Molmo 2's development focused on creating new open-source data and training code to achieve multi-sentence, instruction-following answers from video input, overcoming limitations of closed-source datasets.

In practice

Analyze video for specific player descriptions.
Extract tactical insights from game footage.
Infer context from visual branding and production quality.

Topics

Molmo 2
Video Question Answering
Open-Source Models
Instruction Following
Multimodal AI

Best for: Computer Vision Engineer, AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, AI Researcher

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Ai2.