Together AI’s OSCAR: 2-Bit KV Cache Quantization for Long Context

Together AI's OSCAR: 2-Bit KV Cache Quantization for Long Context

Does 2-bit KV cache quantization actually work without turning your model into a rambling mess? Yes, but only if you stop pretending that data-oblivious transforms are enough to save your VRAM.

For anyone running local weights, the KV cache is the invisible wall. You can quantize the model weights to 4-bit or even 2-bit to fit the model on the card, but the moment you start pushing the context window toward 32k or 128k tokens, the KV cache balloons and eats every remaining byte of memory. Together AI is trying to fix this with OSCAR, which pushes the KV cache down to INT2.

It is a bit like trying to fit a king-sized mattress into a Mini Cooper—you can do it, but you’re going to have to fold things in ways that feel fundamentally wrong. The trick here is “Attention-Aware” rotation. Instead of using a generic Hadamard transform to smooth out the outliers before quantization, OSCAR looks at the actual spectral covariance of the attention heads. It derives specific rotations for keys and values that actually respect the data.

Will this let me run a 128k context on a 3090?

In theory, yes. If you are running a mid-sized model—think Llama 3.3 or a Qwen 2.5 variant—the KV cache usually becomes the primary bottleneck long before the weights do. By compressing the cache to 2-bit, you are effectively quadrupling the amount of context you can cram into the same VRAM footprint compared to FP16. For a 3090 or 4090 with 24GB of VRAM, this is the difference between hitting an Out-Of-Memory (OOM) error at 16k tokens and actually processing a full technical manual.

However, the real-world friction comes from the pre-computation. Since OSCAR is “offline,” you need to calculate those rotation matrices before you start serving. (Which is a fair trade for the memory gains, but it adds a step to the pipeline). If you’re using a Mac M3 Ultra with massive unified memory, you might not care as much, but for those of us fighting for every megabyte on NVIDIA hardware, this is the only way forward.

Why is spectral covariance better than Hadamard rotations?

Most “rotation” methods for quantization are basically just mathematical shuffles. They try to spread the high-variance values across the vector so that a 2-bit or 4-bit quantization doesn’t lose too much information. The problem is that these methods are data-oblivious; they treat every head and every model the same. It’s a blunt instrument.

OSCAR takes a more surgical approach by analyzing the actual covariance of the attention. By tailoring the rotation to the specific spectral properties of the keys and values, it preserves the “signal” that the model actually needs to maintain coherence over long distances. Or maybe it doesn’t preserve it perfectly—we’ve seen “lossless” quantization claims before that fall apart the moment you ask the model to retrieve a specific fact from the middle of a 100k token prompt. Still, it is a significantly smarter approach than the current industry standard.

Which inference engines will actually implement this?

This is where the rubber meets the road. Right now, most of us are living in the world of llama.cpp, Ollama, or vLLM. For OSCAR to matter, it needs to move beyond a research paper and into a kernel that doesn’t tank your tokens-per-second. If this gets integrated into vLLM or sglang, it becomes a massive win for anyone serving long-context models on a budget.

We are already seeing a shift toward more aggressive quantization in the open-weights pecking order. While Llama 3.3 is a beast, its memory requirements for long contexts are punishing. If OSCAR becomes the standard for INT2 cache, it effectively lowers the hardware floor for “long-context” local AI. I expect to see a functional implementation of OSCAR in a major inference engine by Q4.

A necessary evil for the local-host crowd.

Leave a Reply

Your email address will not be published. Required fields are marked *