TAI #208: Open Models Find Their Role as Agent Token Bills Rise

2026-06-09 · Source: Towards AI - Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Robotics & Autonomous Systems · Depth: Advanced, extended

Summary

This week saw a significant release of cheaper and open AI models, driven by rapidly increasing token consumption from long-running agent systems. Microsoft unveiled seven in-house MAI models, including MAI-Thinking-1, a 35-billion-active-parameter Mixture-of-Experts model trained on 30 trillion tokens, achieving 52.8% on SWE-Bench Pro. Google introduced the local multimodal Gemma 4 12B, an 11.95B dense model supporting 256K context, and MiniMax launched M3 via API, which scored 54.7 on the Intelligence Index. NVIDIA's Nemotron 3 Ultra, with 550 billion total parameters, also saw updates. These models are finding roles in high-volume, lower-cost tasks, as evidenced by Vercel data showing DeepSeek handling 17% of token volume for 1% of spend. OpenAI expanded Codex with role-specific plugins and updated GPT-Rosalind for life sciences, while Apple introduced Core AI for on-device custom models. OpenAI also rolled out Dreaming V3 for ChatGPT, improving factual recall to 82.8%.

Key takeaway

For AI Engineers optimizing agentic workflows, you should adopt a tiered model selection strategy. Route high-volume, routine tasks like data extraction or summarization to cheaper, open-weight models to manage rising token costs. Reserve frontier models for complex decisions, final synthesis, and tasks where quality and reliability are paramount. Test hosted endpoints with real traces, measuring retry and human intervention rates to ensure cost savings aren't offset by increased supervision. Consider self-hosting for specific data residency or high utilization requirements.

Key insights

Rising agent token costs necessitate a tiered model strategy, using cheaper open models for high-volume, verifiable tasks.

Principles

Assign models per workflow step, not per product.
Prioritize cost per verified result over raw token savings.
Base capabilities can be built from human-generated data.

In practice

Route high-volume, narrow tasks to cheaper models.
Test hosted endpoints with real traces to assess performance.
Consider self-hosting for data residency or custom tuning needs.

Topics

Open-Weight Models
AI Agents
LLM Inference Optimization
Multimodal AI
On-Device AI
KV-Cache Quantization

Code references

Best for: CTO, VP of Engineering/Data, AI Architect, AI Engineer, Machine Learning Engineer, Director of AI/ML

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.