Brick: Spatial Capability Routing for the Mixture-of-Models (MoM) Paradigm
Summary
Brick is a novel multimodal router designed to optimize the deployment of Mixture-of-Models (MoM) by addressing the challenge of defining query difficulty and reducing the high costs associated with frontier language models. Presented on 2026-06-11, Brick scores each model across six distinct capability dimensions, integrates this with a per-query difficulty estimate, and dispatches requests using a cost-penalized geometric rule. It features a continuous preference knob, allowing operators to dynamically adjust between maximum quality and maximum cost-saving profiles at deployment. On a benchmark of 5,504 queries, Brick achieved 76.98% accuracy at its max-quality setting, surpassing the best single model's 75.02% and all other tested routers. At a neutral cost-quality profile, it delivered 74.11% accuracy with a 4.71x cost reduction compared to always using the strongest model. The router also reduced median latency from 51.2s to 22.8s.
Key takeaway
For MLOps Engineers deploying Mixture-of-Models, you should consider implementing a sophisticated routing solution like Brick to significantly reduce operational costs and latency. By dynamically assessing query difficulty and model capabilities, your team can achieve substantial savings, up to 22.15x, while maintaining acceptable accuracy or even improving it by 1.96 points over single-model baselines. Integrate a continuous preference knob to fine-tune your cost-quality balance in real-time.
Key insights
Brick optimizes Mixture-of-Models deployment by spatially routing queries based on model capabilities and query difficulty to balance cost and quality.
Principles
- LLM routing benefits from assessing within-domain query variance.
- Model dispatch can be optimized using cost-penalized geometric rules.
- Continuous preference controls enable dynamic cost-quality trade-offs.
Method
Brick scores models on six capability dimensions, estimates per-query difficulty, and dispatches requests using a cost-penalized geometric rule.
In practice
- Deploy a multimodal router to manage MoM costs.
- Configure routing to dynamically adjust between max-quality and max-saving.
- Reduce inference latency by intelligently dispatching queries.
Topics
- Mixture-of-Models
- LLM Routing
- Cost Optimization
- Query Difficulty
- Multimodal AI
- Inference Latency
Best for: AI Architect, CTO, VP of Engineering/Data, MLOps Engineer, AI Engineer, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.