[AINews] New AI Infra decacorns: Fireworks, Baseten (with OpenRouter on the way)
Summary
Recent reports highlight a significant "Inference Inflection" in AI infrastructure, with companies like Fireworks and Baseten reportedly nearing decacorn valuations of \$15B and \$11B, respectively. OpenRouter secured a \$113M Series B, growing weekly volume from 5T to 25T tokens in six months, underscoring the demand for multi-model inference routing. Concurrently, AI agent development is shifting towards a "model + harness + eval loop" paradigm, with DeepSeek building harness teams and new benchmarks like DeepSWE emerging for agentic coding. Research agents are demonstrating latent capabilities with appropriate harnesses, while "Language Models Need Sleep" proposes a context consolidation phase for long-horizon memory. Other advancements include the AMUSE optimizer, MiniMax M3 sparse attention, and new vision models. Infrastructure concerns like datacenter power and a potential inference compute crunch are also rising. Local LLM performance, particularly with Qwen 3.6, shows strong local agentic workflows and VRAM optimization techniques.
Key takeaway
For Machine Learning Engineers building production AI systems, prioritize robust inference infrastructure and agentic harness development over solely focusing on base model strength. Your strategy should incorporate tools like OpenRouter for multi-model inference and consider techniques like context consolidation for long-horizon agents. Evaluate new benchmarks like DeepSWE for agentic coding and optimize local LLM deployments using methods like ik_llama.cpp or VRAM-saving display configurations to maximize throughput and resource efficiency.
Key insights
The AI landscape is rapidly maturing, shifting focus from raw model power to robust inference infrastructure and sophisticated agentic harnesses.
Principles
- Winning AI stacks integrate model, harness, and eval loops.
- Latent model capabilities require appropriate harnesses.
- Context consolidation improves long-horizon memory.
Method
Agentic workflows can convert repeatable procedures into "skills" for tasks like DevOps or code generation, managed by a process spawning fresh-context sub-agents.
In practice
- Use ik_llama.cpp for 23% throughput gain on local LLMs.
- Optimize VRAM by forcing iGPU rendering for desktop environments.
- Employ prompt caching and code execution in local UIs.
Topics
- AI Infrastructure
- Inference Optimization
- AI Agents
- Large Language Models
- Benchmarking
- Datacenter Technology
- Quantization
Best for: Investor, CTO, VP of Engineering/Data, AI Scientist, Machine Learning Engineer, Director of AI/ML
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Latent.Space - Www.latent.space.