Brick: Spatial Capability Routing for the Mixture-of-Models (MoM) Paradigm

2026-04-24 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Cloud Computing & IT Infrastructure · Depth: Advanced, extended

Summary

Brick is a multimodal router designed for the Mixture-of-Models (MoM) paradigm, bridging heterogeneous LLM pools at inference time. Released in May 2026, it addresses the challenge of dispatching queries to the cheapest model that will answer correctly, moving beyond superficial routing methods. Brick scores models on six capability dimensions, combines this with a per-query difficulty estimate, and dispatches via a cost-penalized geometric rule. On Dataset A (5,504 queries), Brick at max-quality achieved 76.98% accuracy, surpassing kimi2.6 (75.02%) and external routers like RouteLLM and FrugalGPT, while being 28% cheaper. Its continuous preference knob $r$ allows operators to balance max-quality and max-saving profiles, cutting costs by up to 22.15x at min-cost. Median end-to-end latency also dropped from 51.2 s to 22.8 s.

Key takeaway

For AI Architects designing cost-effective LLM inference systems, Brick's Mixture-of-Models paradigm offers a compelling solution. You should consider implementing capability-aware routing to dynamically dispatch queries to the cheapest, most suitable model, significantly reducing cloud bills and latency. Leverage the continuous preference knob to fine-tune your quality-vs-spend balance, especially for agentic workloads where single-step routing avoids compounding costs and delays.

Key insights

Spatial capability routing for Mixture-of-Models (MoM) optimizes LLM inference cost and quality by matching query needs to model skills.

Principles

Capability is multi-dimensional, not globally ranked.
Asymmetric penalties for under-capacity vs. over-capacity.
Additive cost penalty enables explicit quality-cost trade-off.

Method

Brick uses a six-step pipeline: query truncation, keyword matching, ModernBERT capability classification, complexity estimation, per-model scoring via cost-penalized geometric rule, and argmin selection.

In practice

Deploy MoM to mix open-weight and commercial LLMs.
Use a preference knob to tune cost-quality trade-offs.
Prioritize single-step routing for agentic workloads.

Topics

LLM Routing
Mixture of Models
Cost-aware Inference
Capability Classification
Open-weight Models
Agentic Workloads
Latency Optimization

Code references

regolo-ai/brick-SR1

Best for: CTO, VP of Engineering/Data, Director of AI/ML, MLOps Engineer, AI Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.