Type to search · Enter for full results

Lab · Local LLM Arena

What local model wins at what?

Same prompts, same machine, local judge. No cloud APIs. Every model in Ollama runs through 178 tests across 16 categories.

17 models · 178 tests · AMD Strix Halo · 96 GB VRAM · Ollama 0.23

#1 9.5 gemma4:31b 31.3B · 262K ctx #2 9.1 qwen3.6:latest 36.0B · 262K ctx #3 8.7 gemma4:e4b 8.0B · 131K ctx

Click column headers to sort · Click a model for the full breakdown

Model Global agenticaudiocodefrontendinstructionlong-contextmathmultilingual Tests tok/s Time
gemma4:31b 31.3B · Q4_K_M · 18.5 GB VisionToolsThinking 9.5 9.9 9.1 10.0 9.7 10.0 10.0 9.5 170/178 9.3 2 h 5 min
qwen3.6:latest 36.0B · Q4_K_M · 22.3 GB VisionToolsThinking 9.1 9.9 8.9 10.0 8.7 10.0 10.0 9.1 170/178 44.2 54 min 47 s
gemma4:e4b 8.0B · Q4_K_M · 8.9 GB VisionToolsThinking 8.7 9.5 8.2 8.2 8.7 9.1 9.2 9.0 8.8 178/178 44.4 54 min 27 s
mistral-small3.2:latest 24.0B · Q4_K_M · 14.1 GB VisionTools 8.5 9.8 8.6 8.8 8.8 9.1 8.0 8.5 152/160 14.7 1 h 1 min
gemma3:12b 12.2B · Q4_K_M · 7.6 GB Vision 8.4 9.4 7.9 7.0 9.4 9.4 7.5 8.2 152/160 24.2 43 min 54 s
jobautomation/OpenEuroLLM-Spanish:latest 12.2B · Q4_K_M · 7.6 GB Vision 8.3 9.3 7.8 7.4 8.6 9.4 7.9 8.1 152/160 9.6 1 h 48 min
milkey/Seed-OSS-36B-Instruct:q4_K_M 36.2B · Q4_K_M · 20.3 GB ToolsThinking 8.3 9.7 8.8 8.4 9.1 8.0 10.0 8.0 144/160 9.6 2 h 9 min
qwen3.6:latest · Lemonade · GGUF · 20.1 GB VisionTools 8.3 18/18 42.6 6 min 52 s
qwen3-coder-next:latest 79.7B · Q4_K_M · 48.2 GB Tools 8.2 9.1 8.9 9.1 8.7 7.3 9.0 7.4 144/160 35.0 56 min 48 s
deepseek-r1:32b 32.8B · Q4_K_M · 18.5 GB Thinking 8.1 9.4 8.4 7.8 8.7 9.9 8.0 7.9 144/160 10.9 1 h 13 min
gemma4:31b · Lemonade · GGUF · 17.0 GB VisionTools 8.1 18/18 8.1 49 min 20 s
gpt-oss:20b 20.9B · MXFP4 · 12.8 GB ToolsThinking 7.6 9.3 9.0 7.2 8.0 9.9 8.2 6.4 144/160 48.2 55 min 13 s
qwen2.5:7b 7.6B · Q4_K_M · 4.4 GB Tools 7.4 9.0 7.5 6.9 8.6 8.4 7.2 8.0 144/160 44.0 16 min 5 s
mistral-nemo:12b 12.2B · Q4_0 · 6.6 GB Tools 7.2 9.2 7.1 7.0 8.4 9.0 4.6 7.7 144/160 27.7 26 min 31 s
qwen2.5vl:7b 8.3B · Q4_K_M · 5.6 GB Vision 7.1 8.9 7.0 6.3 8.7 9.3 9.8 7.5 152/160 37.8 34 min 21 s
aya-expanse:8b 8.0B · Q4_K_M · 4.7 GB Tools 7.0 8.9 6.5 6.2 8.8 7.8 4.3 7.5 144/160 38.1 19 min 32 s
gemma4:e4b · Lemonade · GGUF · 5.2 GB VisionTools 7.0 18/18 11.4 2 min 20 s

FAQ — Local LLM benchmarks

What is the NCN Local LLM Arena?
A reproducible benchmark: the same prompts run on every Ollama model on fixed hardware (AMD Strix Halo, 96 GB), scored by a local judge across 16 categories.
Which local LLM is best overall?
Rankings change as we add models. Use the table above for the current global leader and per-category winners — updated when new models are benchmarked.
How are scores calculated?
Each model runs identical automated tests plus a local judge. Scores are 0–10 per category, with a weighted global average. Full transcripts live on Murray's Lab.
Can I reproduce these benchmarks?
Yes. Same Ollama models, same test suite, local-only inference. We publish methodology and link to raw runs on murrayslab.com/lab/llms/.

Benchmark rig

  • CPU AMD Ryzen AI Max+ 395 · 16C/32T · Zen 5
  • GPU Radeon 8060S · 40 CUs · 96 GB unified VRAM
  • Stack Ollama 0.23 · Ubuntu 24.04 · ROCm
  • Method Auto-checks + local judge model · full prompt/response logs