Molmo 2 | Counting objects and actions
Summary
Zama, a developer of Momo 2, demonstrates the model's capabilities in video object and action counting and pointing, comparing it against Gemini 3. Momo 2 successfully identifies and points to landmarks like Mount Rainier and the Space Needle in a Seattle skyline video, even handling ambiguous queries like "tallest building" and partially obscured objects like a Ferris wheel. It also accurately counts 46 buildings in a complex scene and five flips performed by a dancer in a motion video. In direct comparisons, Momo 2 correctly identifies five flips, while Gemini 3 identifies four. Gemini 3, however, demonstrates superior temporal understanding by accurately identifying Mount Rainier "at night" at a later timestamp in a day-to-night transition video, whereas Momo 2's temporal understanding is noted as an area for improvement. Both models struggle with precise object counting in cluttered scenes, with Gemini 3 providing better bounding box identification for larger buildings.
Key takeaway
For AI scientists and computer vision engineers evaluating video analysis models, Momo 2 offers strong performance in object and action counting and pointing, making it suitable for tasks requiring precise enumeration. However, if your application demands sophisticated temporal understanding, such as identifying events at specific times of day within a video, Gemini 3 currently demonstrates an advantage. Be prepared for both models to face difficulties with highly occluded or numerous objects in complex scenes.
Key insights
Momo 2 excels in video object and action counting/pointing, while Gemini 3 shows stronger temporal understanding.
Principles
- Spatial queries require robust object recognition.
- Temporal understanding enhances video analysis.
- Occlusion significantly challenges object counting.
Method
Momo 2 leverages a language model for object knowledge and is trained for spatial and action-based queries, enabling it to point and count objects and actions in video frames.
In practice
- Use Momo 2 for precise action counting.
- Consider Gemini 3 for temporal video queries.
- Anticipate challenges with occluded object counting.
Topics
- Visual Language Models
- Video Object Counting
- Video Action Counting
- Spatial-Temporal AI
- AI Model Comparison
Best for: Computer Vision Engineer, AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, AI Researcher
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Ai2.