Beyond Output Matching: Preserving Internal Geometry in NVFP4 LLM Distillatio
Summary
A new method, CKA-QAD, addresses internal representational degradation in NVFP4 large language model (LLM) distillation. While standard Quantization-Aware Distillation (QAD) recovers output accuracy, it often reduces layerwise representational similarity, particularly in RL-post-trained models, leading to bottlenecks in reasoning and coding tasks. CKA-QAD augments the QAD objective with a Centered Kernel Alignment (CKA) regularizer, which explicitly preserves the geometric structure of intermediate activations. Experiments on Nemotron 3 Nano and Qwen3-4B-Thinking-2507 demonstrate that CKA-QAD significantly improves representational alignment, raising average CKA from 0.958 to 0.994 on Nemotron 3 Nano and from 0.98 to 0.99 on Qwen3-4B-Thinking-2507. This method also enhances downstream reasoning and coding accuracy on benchmarks like AIME25, GPQA-D, and LiveCodeBench-v5, with only a 0.5% step time and 7.0% peak VRAM overhead.
Key takeaway
For AI Engineers deploying NVFP4 LLMs, relying solely on output-matching QAD risks internal representational degradation, impacting reasoning and coding tasks. You should integrate CKA-guided representational alignment into your distillation pipeline. This approach preserves critical internal geometry, improving accuracy on complex benchmarks like AIME25 and LiveCodeBench-v5, with minimal training overhead. Consider CKA-QAD to ensure robust low-bit LLM performance.
Key insights
Output-matching QAD can degrade internal LLM representations; CKA-guided alignment preserves geometry for better low-bit accuracy.
Principles
- Output-level alignment does not guarantee internal fidelity.
- Preserving internal geometry is vital for robust low-bit generalization.
- CKA offers invariance to common activation transformations.
Method
CKA-QAD augments standard QAD by adding a layerwise CKA regularizer to align intermediate activations' Gram matrices. It uses top-k logit distillation and dynamically balances the CKA term with the KL loss.
In practice
- Integrate CKA regularization into QAD pipelines.
- Target decoder block outputs for CKA alignment.
- Implement dynamic loss balancing for CKA term.
Topics
- NVFP4 Quantization
- Knowledge Distillation
- Centered Kernel Alignment
- Large Language Models
- Model Compression
- Representational Similarity
Best for: Research Scientist, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.