Do we actually need our text-to-speech running locally? Yes, but only if it doesn’t turn our laptops into space heaters while we’re trying to actually use them. For too long, the industry has treated “on-device” as a marketing buzzword rather than a technical requirement. We’ve been fed a diet of massive cloud models that sound great in a curated demo but feel like a chore in production because of the round-trip latency. The moment you introduce a network hop into a voice interaction, you’ve already lost the battle for natural conversation. Most “local” solutions we’ve seen lately actually just wrap a cloud API in a pretty shell, or they require a GPU that costs more than the average developer’s first car.
That is why the release of supertonic is interesting. It isn’t trying to out-parameterize the giants; instead, it focuses on the boring, difficult work of making multilingual TTS run natively via ONNX. For those of us who have spent time in the trenches of deployment, ONNX is the only sane way to handle the fragmented mess of local hardware. It allows a model to actually move across different runtimes without requiring the developer to rewrite the entire inference engine every time a new chip drops (or so the documentation suggests). By targeting a runtime that is actually portable, they are attacking the biggest friction point in local AI: the installation process.
Of course, “lightning-fast” is a relative term. The real-world friction always comes down to the runtime environment and the specific version of ONNX you’re fighting with on a given OS. There is a specific kind of hell reserved for developers trying to align C++ runtimes across Windows and Linux while keeping VRAM usage under a ceiling that doesn’t crash the rest of the system. Even with an optimized model, you’re still at the mercy of the user’s hardware. If the weights are too heavy for a mid-range laptop or if the memory bandwidth is throttled, the speed gains are purely academic. Or maybe not—if the quantization is tight enough, it might actually be usable.
Still, the strategic move here is the pivot away from the cloud. The obsession with massive, centralized models has created a bottleneck that makes real-time agents feel clunky and robotic. Using a cloud-based TTS for a local agent is like trying to have a conversation through a bad long-distance phone call from 1994—you spend half the time waiting for the other person to realize you’ve finished your sentence. Who actually enjoys waiting three seconds for a voice response in a real-time loop? It kills the UX and makes the entire “AI assistant” concept feel like a toy rather than a tool.
The industry needs to stop chasing the highest possible MOS (Mean Opinion Score) in a vacuum and start prioritizing the “last mile” of execution. A model that sounds 5% less human but responds in 50 milliseconds is infinitely more useful than a perfect voice that lags. We have reached a point of diminishing returns on audio fidelity; the real frontier is now the temporal gap between input and output. If you can’t interrupt the AI because the server is still processing the previous token, you don’t have a conversation—you have a series of monologues.
By Q4, we will see a wave of local-first AI agents that ditch cloud TTS entirely to avoid this latency tax, moving toward a standard where the voice is an integrated part of the local binary rather than an API call. This shift will force a reckoning for the API providers who have been charging by the character for something that can now be done on a decent MacBook. The value is shifting from the model weights to the runtime optimization. Whoever makes the voice feel the most “instant” wins, regardless of whether they have the most parameters.
It’s a win for the edge, provided the weights actually fit.