Lab · Local LLM Arena
What local model wins at what?
Same prompts, same machine, local judge. No cloud APIs. Every model in Ollama runs through 178 tests across 16 categories.
Click column headers to sort · Click a model for the full breakdown
| Model | Global | agentic | audio | code | frontend | instruction | long-context | math | multilingual | Tests | tok/s | Time |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| gemma4:31b VisionToolsThinking | 9.5 | 9.9 | — | 9.1 | 10.0 | 9.7 | 10.0 | 10.0 | 9.5 | 170/178 | 9.3 | 2 h 5 min |
| qwen3.6:latest VisionToolsThinking | 9.1 | 9.9 | — | 8.9 | 10.0 | 8.7 | 10.0 | 10.0 | 9.1 | 170/178 | 44.2 | 54 min 47 s |
| gemma4:e4b VisionToolsThinking | 8.7 | 9.5 | 8.2 | 8.2 | 8.7 | 9.1 | 9.2 | 9.0 | 8.8 | 178/178 | 44.4 | 54 min 27 s |
| mistral-small3.2:latest VisionTools | 8.5 | 9.8 | — | 8.6 | 8.8 | 8.8 | 9.1 | 8.0 | 8.5 | 152/160 | 14.7 | 1 h 1 min |
| gemma3:12b Vision | 8.4 | 9.4 | — | 7.9 | 7.0 | 9.4 | 9.4 | 7.5 | 8.2 | 152/160 | 24.2 | 43 min 54 s |
| jobautomation/OpenEuroLLM-Spanish:latest Vision | 8.3 | 9.3 | — | 7.8 | 7.4 | 8.6 | 9.4 | 7.9 | 8.1 | 152/160 | 9.6 | 1 h 48 min |
| milkey/Seed-OSS-36B-Instruct:q4_K_M ToolsThinking | 8.3 | 9.7 | — | 8.8 | 8.4 | 9.1 | 8.0 | 10.0 | 8.0 | 144/160 | 9.6 | 2 h 9 min |
| qwen3.6:latest · Lemonade VisionTools | 8.3 | — | — | — | — | — | — | — | — | 18/18 | 42.6 | 6 min 52 s |
| qwen3-coder-next:latest Tools | 8.2 | 9.1 | — | 8.9 | 9.1 | 8.7 | 7.3 | 9.0 | 7.4 | 144/160 | 35.0 | 56 min 48 s |
| deepseek-r1:32b Thinking | 8.1 | 9.4 | — | 8.4 | 7.8 | 8.7 | 9.9 | 8.0 | 7.9 | 144/160 | 10.9 | 1 h 13 min |
| gemma4:31b · Lemonade VisionTools | 8.1 | — | — | — | — | — | — | — | — | 18/18 | 8.1 | 49 min 20 s |
| gpt-oss:20b ToolsThinking | 7.6 | 9.3 | — | 9.0 | 7.2 | 8.0 | 9.9 | 8.2 | 6.4 | 144/160 | 48.2 | 55 min 13 s |
| qwen2.5:7b Tools | 7.4 | 9.0 | — | 7.5 | 6.9 | 8.6 | 8.4 | 7.2 | 8.0 | 144/160 | 44.0 | 16 min 5 s |
| mistral-nemo:12b Tools | 7.2 | 9.2 | — | 7.1 | 7.0 | 8.4 | 9.0 | 4.6 | 7.7 | 144/160 | 27.7 | 26 min 31 s |
| qwen2.5vl:7b Vision | 7.1 | 8.9 | — | 7.0 | 6.3 | 8.7 | 9.3 | 9.8 | 7.5 | 152/160 | 37.8 | 34 min 21 s |
| aya-expanse:8b Tools | 7.0 | 8.9 | — | 6.5 | 6.2 | 8.8 | 7.8 | 4.3 | 7.5 | 144/160 | 38.1 | 19 min 32 s |
| gemma4:e4b · Lemonade VisionTools | 7.0 | — | — | — | — | — | — | — | — | 18/18 | 11.4 | 2 min 20 s |
FAQ — Local LLM benchmarks
- What is the NCN Local LLM Arena?
- A reproducible benchmark: the same prompts run on every Ollama model on fixed hardware (AMD Strix Halo, 96 GB), scored by a local judge across 16 categories.
- Which local LLM is best overall?
- Rankings change as we add models. Use the table above for the current global leader and per-category winners — updated when new models are benchmarked.
- How are scores calculated?
- Each model runs identical automated tests plus a local judge. Scores are 0–10 per category, with a weighted global average. Full transcripts live on Murray's Lab.
- Can I reproduce these benchmarks?
- Yes. Same Ollama models, same test suite, local-only inference. We publish methodology and link to raw runs on murrayslab.com/lab/llms/.
Benchmark rig
- CPU AMD Ryzen AI Max+ 395 · 16C/32T · Zen 5
- GPU Radeon 8060S · 40 CUs · 96 GB unified VRAM
- Stack Ollama 0.23 · Ubuntu 24.04 · ROCm
- Method Auto-checks + local judge model · full prompt/response logs
Full test catalog & raw data also on Murray's Lab ↗