Molmo 2 | Counting objects and actions

2025-12-16 · Source: Ai2 · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Intermediate, medium

Summary

Zama, a developer of Momo 2, demonstrates the model's capabilities in video object and action counting and pointing, comparing it against Gemini 3. Momo 2 successfully identifies and points to landmarks like Mount Rainier and the Space Needle in a Seattle skyline video, even handling ambiguous queries like "tallest building" and partially obscured objects like a Ferris wheel. It also accurately counts 46 buildings in a complex scene and five flips performed by a dancer in a motion video. In direct comparisons, Momo 2 correctly identifies five flips, while Gemini 3 identifies four. Gemini 3, however, demonstrates superior temporal understanding by accurately identifying Mount Rainier "at night" at a later timestamp in a day-to-night transition video, whereas Momo 2's temporal understanding is noted as an area for improvement. Both models struggle with precise object counting in cluttered scenes, with Gemini 3 providing better bounding box identification for larger buildings.

Key takeaway

For AI scientists and computer vision engineers evaluating video analysis models, Momo 2 offers strong performance in object and action counting and pointing, making it suitable for tasks requiring precise enumeration. However, if your application demands sophisticated temporal understanding, such as identifying events at specific times of day within a video, Gemini 3 currently demonstrates an advantage. Be prepared for both models to face difficulties with highly occluded or numerous objects in complex scenes.

Key insights

Momo 2 excels in video object and action counting/pointing, while Gemini 3 shows stronger temporal understanding.

Principles

Spatial queries require robust object recognition.
Temporal understanding enhances video analysis.
Occlusion significantly challenges object counting.

Method

Momo 2 leverages a language model for object knowledge and is trained for spatial and action-based queries, enabling it to point and count objects and actions in video frames.

In practice

Use Momo 2 for precise action counting.
Consider Gemini 3 for temporal video queries.
Anticipate challenges with occluded object counting.

Topics

Visual Language Models
Video Object Counting
Video Action Counting
Spatial-Temporal AI
AI Model Comparison

Best for: Computer Vision Engineer, AI Scientist, Research Scientist, AI Engineer, Machine Learning Engineer, AI Researcher

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Ai2.