Subquadratic's Claim of Breaking the LLM…

Imagine a developer at 3am, the only light in the room coming from a dual-monitor setup and a dying desk lamp. They are staring at a terminal window that has just spat out a CUDA out of memory error for the tenth time tonight. They’ve tried every quantization trick in the book and stripped the prompt to its bare essentials, but the quadratic cost of attention is a wall they cannot climb. It is the tax we all pay for long-context windows, and (the kind of thing that makes you want to throw your monitor out the window) it is the single biggest friction point in scaling LLMs. The memory wall isn’t just a technical hurdle; it’s a financial one that keeps the most interesting experiments locked behind a paywall of H100 clusters.

Enter Subquadratic. The startup recently stepped out of stealth claiming they have finally broken through this bottleneck. According to MIT Tech Review, the company believes they’ve solved the fundamental efficiency problem that makes scaling context so prohibitively expensive. On a slide deck, it looks like a miracle. To a venture capitalist looking for the next big architectural shift, it looks like a gold mine. But for those of us who actually spend our days fighting with tensors and VRAM, the excitement is tempered by a healthy dose of skepticism. We’ve seen the “quadratic to linear” promise before, and usually, the cost is a massive drop in retrieval quality.

Who actually believes the “out of stealth” press release? We have seen this movie before. Every few months, a new lab claims to have replaced the attention mechanism with something linear or recurrent that doesn’t sacrifice quality. It is like a chef claiming they have found a way to bake a cake in thirty seconds without using a microwave—sure, maybe you’ve found a weird chemical shortcut, but does the cake actually taste like cake? Most of these claims end up being slight optimizations of existing linear attention variants or a fancy way to prune the KV cache that falls apart the moment you move from a toy dataset to a real-world workload. Or maybe I’m being too cynical—but then again, look at the graveyard of “efficient” transformers from two years ago.

The real friction here isn’t just the theory; it’s the hardware reality. Even if Subquadratic has a more efficient way to handle tokens, we are still tethered to H100s that are priced like luxury condos. Efficiency gains are only useful if they actually lower the VRAM floor or allow for meaningful throughput increases on existing clusters without requiring a proprietary hardware shim. If this “breakthrough” requires a specific kernel that only works on a handful of GPUs or a closed-source runtime that forces you into a specific cloud provider, it isn’t a solution—it’s just a new dependency. The goal isn’t just “faster,” it’s “accessible.”

The industry has a bad habit of treating “stealth” as a substitute for a technical paper. If the math isn’t public, the claim is just a story. I suspect they are hiding the details because the “solution” is either a marginal gain dressed up as a leap or it relies on a trick that only works in very specific, narrow contexts. (I’ve been burned by “stealth” promises before). We don’t need another vague promise of efficiency; we need a verifiable benchmark that shows a 100k context window running on consumer hardware without the perplexity spiking into the stratosphere. Until we see the weights or the architecture, this is just marketing.

We will know the truth soon enough. Either they release a technical whitepaper that survives a peer review from people who actually understand the hardware, or they quietly pivot to a “managed service” where you never see the internals and just pay a premium for the “magic.” I’ll bet that by Q4, the hype will either be backed by a public benchmark that objectively beats FlashAttention 3 or the company will disappear back into the stealth void from which it emerged. The gap between a slide deck and a working implementation is a wide one, and very few companies actually jump it.

Most “stealth” breakthroughs are just better PR.

Related coverage

Google’s AI Zero-Day Claim: Marketing Narrative vs. Technical Reality

Nobel Laureate John Jumper Leaves Google DeepMind for Anthropic

The AI Infrastructure Bubble: Why a Hardware Crash Could Be Systemic

The Rise of Personality-as-a-Service: Karamo Brown's AI Wellness App