I Ran the 3B Model That Beat Gemini 3 Pro at Olympiad Math — It Shouldn't Work

2026-06-19 · Source: Towards AI - Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Intermediate, quick

Summary

The VibeThinker-3B, a 3-billion-parameter model, recently achieved a score of 94.3 on AIME 2026, surpassing Gemini 3 Pro's 91.7. This MIT-licensed model, whose 1.5B predecessor cost only \$7,800 to post-train, demonstrates first-tier performance in verifiable reasoning tasks like competition math and competitive coding. Crucially, its compact size allows it to run on a laptop, distinguishing it from larger general knowledge models. Its release on June 15, 2026, sparked immediate skepticism within the AI community, which has grown wary of "benchmark theater" and questions the real-world applicability of such specific benchmarks, leading to accusations of "benchmaxxing."

Key takeaway

For machine learning engineers evaluating models for specific, verifiable reasoning tasks like competitive programming or advanced mathematics, you should consider exploring smaller, specialized models. VibeThinker-3B demonstrates that top-tier performance in these domains is achievable with 3-billion-parameter models, offering significant advantages in deployment cost and local execution over larger, general-purpose alternatives. This shifts the focus from raw parameter count to task-specific optimization.

Key insights

A small, cost-effective 3B model achieved top-tier verifiable reasoning performance, challenging assumptions about model size and capability.

Principles

Specific reasoning benchmarks can reveal capabilities not evident in general knowledge tests.
Smaller models can achieve competitive performance on specialized, verifiable tasks.
Cost-effective post-training can yield high-performing, deployable models.

In practice

Run 3B models locally for competitive math or coding tasks.
Evaluate specialized models against specific, verifiable reasoning benchmarks.

Topics

VibeThinker-3B
Small Language Models
Mathematical Reasoning
Competitive Programming
Benchmark Evaluation
Model Efficiency

Best for: AI Engineer, Research Scientist, Entrepreneur, AI Scientist, Machine Learning Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.