TAI #208: Open Models Find Their Role as Agent Token Bills Rise
Summary
This week saw a significant release of cheaper and open AI models, driven by rapidly increasing token consumption from long-running agent systems. Microsoft unveiled seven in-house MAI models, including MAI-Thinking-1, a 35-billion-active-parameter Mixture-of-Experts model trained on 30 trillion tokens, achieving 52.8% on SWE-Bench Pro. Google introduced the local multimodal Gemma 4 12B, an 11.95B dense model supporting 256K context, and MiniMax launched M3 via API, which scored 54.7 on the Intelligence Index. NVIDIA's Nemotron 3 Ultra, with 550 billion total parameters, also saw updates. These models are finding roles in high-volume, lower-cost tasks, as evidenced by Vercel data showing DeepSeek handling 17% of token volume for 1% of spend. OpenAI expanded Codex with role-specific plugins and updated GPT-Rosalind for life sciences, while Apple introduced Core AI for on-device custom models. OpenAI also rolled out Dreaming V3 for ChatGPT, improving factual recall to 82.8%.
Key takeaway
For AI Engineers optimizing agentic workflows, you should adopt a tiered model selection strategy. Route high-volume, routine tasks like data extraction or summarization to cheaper, open-weight models to manage rising token costs. Reserve frontier models for complex decisions, final synthesis, and tasks where quality and reliability are paramount. Test hosted endpoints with real traces, measuring retry and human intervention rates to ensure cost savings aren't offset by increased supervision. Consider self-hosting for specific data residency or high utilization requirements.
Key insights
Rising agent token costs necessitate a tiered model strategy, using cheaper open models for high-volume, verifiable tasks.
Principles
- Assign models per workflow step, not per product.
- Prioritize cost per verified result over raw token savings.
- Base capabilities can be built from human-generated data.
In practice
- Route high-volume, narrow tasks to cheaper models.
- Test hosted endpoints with real traces to assess performance.
- Consider self-hosting for data residency or custom tuning needs.
Topics
- Open-Weight Models
- AI Agents
- LLM Inference Optimization
- Multimodal AI
- On-Device AI
- KV-Cache Quantization
Code references
- louisfb01/start-ai-engineering
- MoonshotAI/kimi-code
- tinyfish-io/bigset
- opencv/opencv
- Crosstalk-Solutions/project-nomad
Best for: CTO, VP of Engineering/Data, AI Architect, AI Engineer, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards AI - Medium.