Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving

· Source: Machine Learning ML & Generative AI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

Together AI has open-sourced OSCAR, a 2-bit KV cache quantization system designed for serving long-context Large Language Models. Unlike most existing INT2 methods that suffer from significant accuracy loss due to data-oblivious Hadamard rotations, OSCAR maintains performance by employing attention-aware rotations. It rotates keys using query covariance (Q⊤Q) and values using score-weighted value covariance (V⊤S⊤SV), directing quantization noise away from attention-sensitive directions. This approach yields strong accuracy, with Qwen3-32B showing only a -0.02 pts drop versus BF16, and GLM-4.7-FP8 (358B) even gaining +0.27 pts. The system achieves an approximate 8x KV memory reduction, a 3.08x decode speedup at 100K context, and 7.83x job-level throughput, scaling to 256 concurrent requests on a single H100 (80GB). Pre-computed rotation matrices are available via RotationZoo on ModelScope and integrated into SGLang.

Key takeaway

For Machine Learning Engineers optimizing long-context LLM serving, you should evaluate Together AI's OSCAR system. Its attention-aware 2-bit KV cache quantization significantly reduces memory footprint by approximately 8x and boosts decode speed by over 3x, even on single H100 GPUs. This allows you to scale to 256 concurrent requests while maintaining model accuracy, making it a critical tool for cost-effective and high-throughput inference. Consider integrating its pre-computed RotationZoo matrices for immediate benefits.

Key insights

OSCAR uses attention statistics to guide 2-bit KV cache quantization, preserving LLM accuracy and performance.

Principles

Method

OSCAR employs two distinct rotations: keys are rotated using Q⊤Q query covariance, and values use V⊤S⊤SV score-weighted value covariance.

In practice

Topics

Code references

Best for: MLOps Engineer, NLP Engineer, AI Architect, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.