Beyond Output Matching: Preserving Internal Geometry in NVFP4 LLM Distillatio

2026-06-06 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

A new method, CKA-QAD, addresses internal representational degradation in NVFP4 large language model (LLM) distillation. While standard Quantization-Aware Distillation (QAD) recovers output accuracy, it often reduces layerwise representational similarity, particularly in RL-post-trained models, leading to bottlenecks in reasoning and coding tasks. CKA-QAD augments the QAD objective with a Centered Kernel Alignment (CKA) regularizer, which explicitly preserves the geometric structure of intermediate activations. Experiments on Nemotron 3 Nano and Qwen3-4B-Thinking-2507 demonstrate that CKA-QAD significantly improves representational alignment, raising average CKA from 0.958 to 0.994 on Nemotron 3 Nano and from 0.98 to 0.99 on Qwen3-4B-Thinking-2507. This method also enhances downstream reasoning and coding accuracy on benchmarks like AIME25, GPQA-D, and LiveCodeBench-v5, with only a 0.5% step time and 7.0% peak VRAM overhead.

Key takeaway

For AI Engineers deploying NVFP4 LLMs, relying solely on output-matching QAD risks internal representational degradation, impacting reasoning and coding tasks. You should integrate CKA-guided representational alignment into your distillation pipeline. This approach preserves critical internal geometry, improving accuracy on complex benchmarks like AIME25 and LiveCodeBench-v5, with minimal training overhead. Consider CKA-QAD to ensure robust low-bit LLM performance.

Key insights

Output-matching QAD can degrade internal LLM representations; CKA-guided alignment preserves geometry for better low-bit accuracy.

Principles

Output-level alignment does not guarantee internal fidelity.
Preserving internal geometry is vital for robust low-bit generalization.
CKA offers invariance to common activation transformations.

Method

CKA-QAD augments standard QAD by adding a layerwise CKA regularizer to align intermediate activations' Gram matrices. It uses top-k logit distillation and dynamically balances the CKA term with the KL loss.

In practice

Integrate CKA regularization into QAD pipelines.
Target decoder block outputs for CKA alignment.
Implement dynamic loss balancing for CKA term.

Topics

NVFP4 Quantization
Knowledge Distillation
Centered Kernel Alignment
Large Language Models
Model Compression
Representational Similarity

Best for: Research Scientist, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.