CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention

2026-03-18 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Emerging Technologies & Innovation · Depth: Expert, quick

Summary

CARE, a Covariance-Aware, Rank-Enhanced decomposition pipeline, converts pretrained attention modules like grouped-query attention (GQA) into multi-head latent attention (MLA) to enhance expressivity without increasing KV-cache costs. Existing conversion methods often use weight-only low-rank approximations and uniform rank allocation, leading to activation drift and degraded attention fidelity. CARE addresses these by introducing activation-preserving factorization, which aligns approximations with input activations; adjusted-rank allocation, which distributes a fixed KV budget based on layer needs; and KV-parity mapping, reparameterizing K and V to fit MLA while maintaining KV-cache size. This method significantly outperforms uniform-rank SVD baselines on models like Qwen3-4B/30B-A3B-Instruct-2507 and Llama-3.1-8B/70B-Instruct, reducing one-shot perplexity by up to 215x and improving mean accuracy by up to 1.70x at matched KV budgets. A brief post-SVD fine-tune fully recovers original model accuracy.

Key takeaway

For NLP engineers optimizing large language models for efficient inference, CARE offers a robust method to convert GQA to MLA, significantly improving perplexity and accuracy without increasing KV-cache costs. You should consider integrating CARE into your model conversion pipeline, especially for models like Qwen3 and Llama-3.1, to achieve better performance and fully recover original model accuracy with a brief fine-tune.

Key insights

CARE improves multi-head latent attention conversion by considering activation covariance and dynamic rank allocation.

Principles

Align approximations with input activations.
Allocate KV budget based on layer needs.
Maintain KV-cache size during reparameterization.

Method

CARE uses activation-preserving factorization, adjusted-rank allocation, and KV-parity mapping to convert GQA to MLA, optimizing for activation fidelity and KV budget.

In practice

Apply CARE to convert GQA to MLA.
Use post-SVD fine-tuning to recover accuracy.

Topics

Multi-Head Latent Attention
Grouped-Query Attention
Low-Rank Approximation
KV-Cache Optimization
Attention Mechanisms

Best for: NLP Engineer, AI Scientist, Research Scientist, AI Researcher, AI Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.