Remember when the government thought “the cloud” just meant someone else’s computer, and then spent a decade trying to build “GovClouds” that were essentially just AWS with a more expensive sticker?
The air-gap paradox
There is a fundamental, almost comedic tension in the report from The Decoder regarding the use of closed-source models on top-secret networks. For the uninitiated, a top-secret network is, by definition, air-gapped or strictly controlled to prevent data exfiltration. Yet, the push is toward models from OpenAI and Google. (I assume the Pentagon can actually find their passwords first). You cannot simply “plug in” a GPT-4 API to a network that isn’t allowed to talk to the public internet. This means they are either building incredibly expensive private instances of these models or, more likely, they are struggling with the logistical nightmare of trying to make proprietary black boxes play nice with hardware that isn’t allowed to call home.
Why on earth would you trust a closed API with top-secret intel? It is like hiring a contractor to fix your vault but letting them keep a copy of the keys and a live feed of the interior. If the goal is truly “top secret,” the only rational move is to go full open-weights. If I’m running a cyber-offensive task force, I don’t want a model that might be updated by a third party on a Tuesday, suddenly refusing to “find vulnerabilities” because of a new safety alignment layer. I want a weight file I can freeze, fine-tune on my own proprietary exploit datasets, and run on hardware I physically own.
The VRAM tax
This is where the rubber meets the road for the devs. If Cyber Command actually pivots to a local strategy—which they must if they want any real security—they aren’t looking at a 4090 or even a Mac M4 Ultra. To get the reasoning capabilities required for actual vulnerability research, they need the heavy hitters. We are talking Llama 3.1 405B or the massive Qwen 2.5 variants. To run a 405B model at a usable tokens-per-second rate, you aren’t just buying a GPU; you’re buying a cluster of H100s. Even with aggressive quantization (think GGUF Q4_K or EXL2), the VRAM floor for the biggest open models is staggering. You’d need at least eight H100s just to keep the model in memory, and that’s before you even start thinking about the context window needed to ingest a massive codebase for auditing.
If they want something more nimble for the “edge” of their secret networks, something like Cohere’s Command R or Mistral’s larger weights make more sense. Command R, in particular, is built for RAG and tool use, which is exactly what a cyber-analyst needs when querying a massive database of known CVEs. But again, the license is the catch. Most of these “open” weights come with restrictive commercial licenses or “acceptable use” policies that the military might find stifling. The only true freedom is in the Apache 2.0 territory, where you can modify the model to be as aggressive as the mission requires without asking for permission from a corporate ethics board in San Francisco.
The vulnerability race
The claim that AI finds bugs faster than humans is a given—it always has, provided the prompt is right. But the real win isn’t the speed; it’s the scale. A human hacker is a sniper; an LLM is a carpet bomb. By running local instances of Llama 3.3 or Qwen via vLLM or sglang, the NSA could theoretically scan every single line of government code in a weekend. But here is the gamble: the same tools they are deploying to find holes in the enemy’s wall are the same tools the enemy is using to find holes in theirs.
By Q3 2025, we will see a massive pivot where the Pentagon abandons the “closed API” dream and instead releases a heavily fine-tuned, government-specific version of an open-weights model. They will realize that the only way to truly secure a top-secret network is to own the weights, the quantization method, and the silicon they sit on.
Using a closed-source API for top-secret intelligence is a security disaster waiting to happen.