Running DeepSeek-V4-Flash on AMD MI300X:…

It is 3:14 AM. A developer is staring at a ROCm installation log that looks like a wall of red text, wondering why they didn’t just spend the money on a H100 and call it a day. The goal was simple: get the new DeepSeek-V4-Flash running on an AMD MI300X without spending the next three weekends rewriting kernels. This is the ritual of the local-weights enthusiast—the willingness to endure absolute misery for the sake of avoiding a proprietary API.

The recent report on bringing DeepSeek-V4-Flash to the MI300X confirms what we’ve suspected about the current state of the hardware war. The MI300X is a monster on paper, and when the software actually cooperates, it is. We are seeing a shift where the raw VRAM advantage of AMD (192GB on the MI300X) starts to outweigh the “it just works” convenience of NVIDIA’s ecosystem. For a model like V4-Flash, which uses a Mixture-of-Experts (MoE) architecture to keep inference lean, having that much headroom is the difference between running a massive context window and constantly hitting OOM errors.

The reality is that the software friction is the real tax we pay for moving away from CUDA. Even with vLLM doing the heavy lifting, getting these models to behave on non-NVIDIA hardware feels like trying to put a professional espresso machine in a tiny apartment kitchen—it technically fits, but you have to move the fridge and the trash can just to make a cup of coffee. Why do we keep doing this to ourselves? Probably because the alternative is paying per token to a company that might change the system prompt tomorrow and break every one of our prompts.

When you compare DeepSeek-V4-Flash to the rest of the open-weights pack, the value proposition is clear. While Llama 3.3 70B is the reliable workhorse and Qwen3 is pushing the boundaries of multilingual logic, the DeepSeek “Flash” series is about throughput. It is designed for the high-velocity pipeline. On a rig like the MI300X, the tokens-per-second numbers are impressive, but the real win is the ability to host multiple instances or massive KV caches without sweating.

The license remains a point of interest. DeepSeek generally stays permissive, which is the only reason the hobbyist community bothers. If they suddenly pivoted to a restrictive “commercial-only” license (and likely for a long time, they won’t), the momentum would vanish overnight. Devs don’t want to worry about legal departments when they are just trying to build a local coding assistant that doesn’t leak their data to a cloud provider.

This is exactly where it falls apart for the home user. The MI300X is an enterprise beast, and most of us are still clinging to 3090s and 4090s. For the 24GB VRAM crowd, the “Flash” designation is a tease unless you are running heavily quantized versions. We are talking GGUF Q4_K_M or EXL2 if you want to see anything resembling acceptable speed. If you’re running this via Ollama or llama.cpp on a Mac M3 Ultra, you’re in the goldilocks zone, but the Windows/Linux GPU crowd is still fighting for every megabyte.

The gap between “it runs” and “it is usable” is wide. To get V4-Flash to feel snappy on consumer gear, we need the quantization community to do their thing. By Q4, we’ll see an optimized GGUF that brings the tokens-per-second on a 4090 into a range that actually makes sense for real-time agents. Until then, the MI300X results are basically a glimpse into a future where we aren’t held hostage by a single chip vendor.

AMD is finally in the room, but they’re still tripping over the rug.

Related coverage

Google’s Coral Board: Local Gemma 3 Execution and the Hardware Gap

LetinAR and the Hardware Bottleneck of AI Glasses

Nvidia RTX Spark: Breaking the VRAM Wall for Local AI Agents

Figure AI’s Humanoid Robots: Marketing Performance vs. Technical Reality