[AINews] New AI Infra decacorns: Fireworks, Baseten (with OpenRouter on the way)

2026-05-27 · Source: Latent.Space - Www.latent.space · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Robotics & Autonomous Systems · Depth: Expert, long

Summary

Recent reports highlight a significant "Inference Inflection" in AI infrastructure, with companies like Fireworks and Baseten reportedly nearing decacorn valuations of \$15B and \$11B, respectively. OpenRouter secured a \$113M Series B, growing weekly volume from 5T to 25T tokens in six months, underscoring the demand for multi-model inference routing. Concurrently, AI agent development is shifting towards a "model + harness + eval loop" paradigm, with DeepSeek building harness teams and new benchmarks like DeepSWE emerging for agentic coding. Research agents are demonstrating latent capabilities with appropriate harnesses, while "Language Models Need Sleep" proposes a context consolidation phase for long-horizon memory. Other advancements include the AMUSE optimizer, MiniMax M3 sparse attention, and new vision models. Infrastructure concerns like datacenter power and a potential inference compute crunch are also rising. Local LLM performance, particularly with Qwen 3.6, shows strong local agentic workflows and VRAM optimization techniques.

Key takeaway

For Machine Learning Engineers building production AI systems, prioritize robust inference infrastructure and agentic harness development over solely focusing on base model strength. Your strategy should incorporate tools like OpenRouter for multi-model inference and consider techniques like context consolidation for long-horizon agents. Evaluate new benchmarks like DeepSWE for agentic coding and optimize local LLM deployments using methods like ik_llama.cpp or VRAM-saving display configurations to maximize throughput and resource efficiency.

Key insights

The AI landscape is rapidly maturing, shifting focus from raw model power to robust inference infrastructure and sophisticated agentic harnesses.

Principles

Winning AI stacks integrate model, harness, and eval loops.
Latent model capabilities require appropriate harnesses.
Context consolidation improves long-horizon memory.

Method

Agentic workflows can convert repeatable procedures into "skills" for tasks like DevOps or code generation, managed by a process spawning fresh-context sub-agents.

In practice

Use ik_llama.cpp for 23% throughput gain on local LLMs.
Optimize VRAM by forcing iGPU rendering for desktop environments.
Employ prompt caching and code execution in local UIs.

Topics

AI Infrastructure
Inference Optimization
AI Agents
Large Language Models
Benchmarking
Datacenter Technology
Quantization

Best for: Investor, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Latent.Space - Www.latent.space.