When Model Merging Breaks Routing: Training-Free Calibration for MoE

2026-06-02 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Model merging, a cost-effective method for consolidating LLM capabilities, faces a critical challenge when applied to Mixture-of-Experts (MoE) architectures. Existing linear parameter arithmetic or optimization techniques cause "routing breakdown," where the merged router fails to dispatch tokens to suitable experts. This issue arises from the sensitivity of MoE's non-linear softmax and discrete Top-k routing mechanisms to parameter perturbations, exacerbated by load-balancing constraints during pretraining. Even minor misrouting significantly degrades performance due to experts' distinct specializations. To counter this, researchers propose Hessian-Aware Router Calibration (HARC), a training-free framework. HARC uses second-order curvature information to realign the merged router, offering a closed-form solution efficiently solved via a matrix-free conjugate gradient method. Experiments on mathematical reasoning and code generation tasks demonstrate HARC's effectiveness in mitigating routing breakdown and achieving substantial performance improvements across various MoE merging baselines.

Key takeaway

For Machine Learning Engineers merging Mixture-of-Experts (MoE) models, you should anticipate and address "routing breakdown" issues. Existing merging techniques often degrade performance by misrouting tokens. Implement Hessian-Aware Router Calibration (HARC) to realign your merged router, leveraging its training-free, curvature-aware approach. This will significantly improve performance on tasks like mathematical reasoning and code generation, avoiding costly retraining.

Key insights

MoE model merging often fails due to routing breakdown; HARC offers a training-free, curvature-aware calibration to fix it.

Principles

MoE routing is highly sensitive to parameter perturbations.
Load-balancing constraints amplify merging challenges.
Second-order curvature data can realign routers.

Method

Hessian-Aware Router Calibration (HARC) uses second-order curvature information to realign merged MoE routers. It provides a closed-form solution, efficiently solvable with a matrix-free conjugate gradient method.

In practice

Apply HARC to improve merged MoE performance.
Use HARC for mathematical reasoning tasks.
Use HARC for code generation tasks.

Topics

Model Merging
Mixture-of-Experts
Routing Breakdown
Hessian-Aware Router Calibration
LLM Optimization
Code Generation
Mathematical Reasoning

Code references

huangcb01/HARC

Best for: Research Scientist, AI Engineer, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.