E-PMQ: Expert-Guided Post-Merge Quantization with Merged-Weight Anchoring

· Source: cs.CL updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Advanced, extended

Summary

E-PMQ is an expert-guided Post-Merge Quantization (PMQ) framework designed to efficiently deploy neural networks by integrating multiple task- or domain-specialized experts into a single low-bit model. It addresses the unreliability of directly applying post-training quantization (PTQ) to merged models, which suffers from coupled quantization and expert-relative merging deviations. E-PMQ mitigates these issues by using source expert weights to provide expert-guided output targets during layer-wise calibration, combined with merged-weight anchoring to stabilize the process and preserve the merged model's integrated behavior. Experiments demonstrate that E-PMQ significantly improves 4-bit GPTQ performance on CLIP-ViT-B/32, increasing accuracy from 65.0% to 73.6% under Task Arithmetic and from 69.1% to 74.8% under TIES-Merging. It also shows substantial gains on 20-task CLIP-ViT-L/14 (from 34.8% to 76.7%) and FLAN-T5-base GLUE (from 78.26% to 83.34%), and Llama-3.1 models, confirming its effectiveness across various models, modalities, and task scales.

Key takeaway

For AI Engineers and Research Scientists deploying merged models under low-resource constraints, E-PMQ offers a robust solution to improve low-bit model quality. Your teams should consider integrating E-PMQ into their post-merge quantization pipelines, especially for aggressive low-bit settings, to mitigate performance degradation caused by coupled merging and quantization deviations. Ensure access to source expert weights during the pre-deployment quantization stage to leverage E-PMQ's expert-guided calibration.

Key insights

E-PMQ improves post-merge model quantization by guiding calibration with source expert weights and stabilizing with merged-weight anchoring.

Principles

Method

E-PMQ performs layer-wise quantization using expert-guided output targets ($Y_{i}^{\ell}=W_{i}^{\ell}X_{i}^{\ell}$) and a merged-weight anchor ($\lambda^{\ell}\|Q^{\ell}-W_{m}^{\ell}\|_{F}^{2}$) solved via a GPTQ-style sequential rounding solver.

In practice

Topics

Code references

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.CL updates on arXiv.org.