Brick: Spatial Capability Routing for the Mixture-of-Models (MoM) Paradigm
Summary
Brick is a multimodal router designed for the Mixture-of-Models (MoM) paradigm, bridging heterogeneous LLM pools at inference time. Released in May 2026, it addresses the challenge of dispatching queries to the cheapest model that will answer correctly, moving beyond superficial routing methods. Brick scores models on six capability dimensions, combines this with a per-query difficulty estimate, and dispatches via a cost-penalized geometric rule. On Dataset A (5,504 queries), Brick at max-quality achieved 76.98% accuracy, surpassing kimi2.6 (75.02%) and external routers like RouteLLM and FrugalGPT, while being 28% cheaper. Its continuous preference knob $r$ allows operators to balance max-quality and max-saving profiles, cutting costs by up to 22.15x at min-cost. Median end-to-end latency also dropped from 51.2 s to 22.8 s.
Key takeaway
For AI Architects designing cost-effective LLM inference systems, Brick's Mixture-of-Models paradigm offers a compelling solution. You should consider implementing capability-aware routing to dynamically dispatch queries to the cheapest, most suitable model, significantly reducing cloud bills and latency. Leverage the continuous preference knob to fine-tune your quality-vs-spend balance, especially for agentic workloads where single-step routing avoids compounding costs and delays.
Key insights
Spatial capability routing for Mixture-of-Models (MoM) optimizes LLM inference cost and quality by matching query needs to model skills.
Principles
- Capability is multi-dimensional, not globally ranked.
- Asymmetric penalties for under-capacity vs. over-capacity.
- Additive cost penalty enables explicit quality-cost trade-off.
Method
Brick uses a six-step pipeline: query truncation, keyword matching, ModernBERT capability classification, complexity estimation, per-model scoring via cost-penalized geometric rule, and argmin selection.
In practice
- Deploy MoM to mix open-weight and commercial LLMs.
- Use a preference knob to tune cost-quality trade-offs.
- Prioritize single-step routing for agentic workloads.
Topics
- LLM Routing
- Mixture of Models
- Cost-aware Inference
- Capability Classification
- Open-weight Models
- Agentic Workloads
- Latency Optimization
Code references
Best for: CTO, VP of Engineering/Data, Director of AI/ML, MLOps Engineer, AI Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.