Together AI Open-Sources OSCAR: An Attention-Aware 2-Bit KV Cache Quantization System for Long-Context LLM Serving
Summary
Together AI has open-sourced OSCAR, a 2-bit KV cache quantization system designed for serving long-context Large Language Models. Unlike most existing INT2 methods that suffer from significant accuracy loss due to data-oblivious Hadamard rotations, OSCAR maintains performance by employing attention-aware rotations. It rotates keys using query covariance (Q⊤Q) and values using score-weighted value covariance (V⊤S⊤SV), directing quantization noise away from attention-sensitive directions. This approach yields strong accuracy, with Qwen3-32B showing only a -0.02 pts drop versus BF16, and GLM-4.7-FP8 (358B) even gaining +0.27 pts. The system achieves an approximate 8x KV memory reduction, a 3.08x decode speedup at 100K context, and 7.83x job-level throughput, scaling to 256 concurrent requests on a single H100 (80GB). Pre-computed rotation matrices are available via RotationZoo on ModelScope and integrated into SGLang.
Key takeaway
For Machine Learning Engineers optimizing long-context LLM serving, you should evaluate Together AI's OSCAR system. Its attention-aware 2-bit KV cache quantization significantly reduces memory footprint by approximately 8x and boosts decode speed by over 3x, even on single H100 GPUs. This allows you to scale to 256 concurrent requests while maintaining model accuracy, making it a critical tool for cost-effective and high-throughput inference. Consider integrating its pre-computed RotationZoo matrices for immediate benefits.
Key insights
OSCAR uses attention statistics to guide 2-bit KV cache quantization, preserving LLM accuracy and performance.
Principles
- Generic Hadamard rotations are data-oblivious.
- Attention statistics can guide quantization noise placement.
- Quantization noise should be pushed into least sensitive directions.
Method
OSCAR employs two distinct rotations: keys are rotated using Q⊤Q query covariance, and values use V⊤S⊤SV score-weighted value covariance.
In practice
- Achieve ~8x KV memory reduction for LLMs.
- Gain 3.08x decode speedup at 100K context.
- Utilize RotationZoo for pre-computed rotation matrices.
Topics
- KV Cache Quantization
- Long-Context LLMs
- Attention Mechanisms
- LLM Inference Optimization
- 2-bit Quantization
- SGLang
Code references
Best for: MLOps Engineer, NLP Engineer, AI Architect, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning ML & Generative AI News.