What is the NCN Local LLM Arena?

A reproducible benchmark that runs the same prompts across local Ollama models on fixed hardware, scored by a local judge across 16 categories.

Which local LLM has the best overall score?

Rankings change as new models are tested. See the live leaderboard at neuralcorenews.com/labs/llms/ for current global and per-category winners.

How are local LLM benchmarks scored?

Each model runs identical tests with automated checks plus a local judge model. Scores are 0–10 per category with a weighted global score.

Lab · Local LLM Arena

What local model wins at what?

Name: NCN Local LLM Arena — benchmark results
Creator: NeuralCoreNews
Keywords: local LLM benchmark, Ollama benchmark, best local LLM, LLM arena, open source LLM comparison, local AI models ranking, Strix Halo LLM, reproducible LLM tests

Same prompts, same machine, local judge. No cloud APIs. Every model in Ollama runs through 178 tests across 16 categories.

17 models · 178 tests · AMD Strix Halo · 96 GB VRAM · Ollama 0.23

#1 9.5 gemma4:31b 31.3B · 262K ctx #2 9.1 qwen3.6:latest 36.0B · 262K ctx #3 8.7 gemma4:e4b 8.0B · 131K ctx

Click column headers to sort · Click a model for the full breakdown

Model	Global	agentic	audio	code	frontend	instruction	long-context	math	multilingual	Tests	tok/s	Time
gemma4:31b 31.3B · Q4_K_M · 18.5 GB VisionToolsThinking	9.5	9.9	—	9.1	10.0	9.7	10.0	10.0	9.5	170/178	9.3	2 h 5 min
qwen3.6:latest 36.0B · Q4_K_M · 22.3 GB VisionToolsThinking	9.1	9.9	—	8.9	10.0	8.7	10.0	10.0	9.1	170/178	44.2	54 min 47 s
gemma4:e4b 8.0B · Q4_K_M · 8.9 GB VisionToolsThinking	8.7	9.5	8.2	8.2	8.7	9.1	9.2	9.0	8.8	178/178	44.4	54 min 27 s
mistral-small3.2:latest 24.0B · Q4_K_M · 14.1 GB VisionTools	8.5	9.8	—	8.6	8.8	8.8	9.1	8.0	8.5	152/160	14.7	1 h 1 min
gemma3:12b 12.2B · Q4_K_M · 7.6 GB Vision	8.4	9.4	—	7.9	7.0	9.4	9.4	7.5	8.2	152/160	24.2	43 min 54 s
jobautomation/OpenEuroLLM-Spanish:latest 12.2B · Q4_K_M · 7.6 GB Vision	8.3	9.3	—	7.8	7.4	8.6	9.4	7.9	8.1	152/160	9.6	1 h 48 min
milkey/Seed-OSS-36B-Instruct:q4_K_M 36.2B · Q4_K_M · 20.3 GB ToolsThinking	8.3	9.7	—	8.8	8.4	9.1	8.0	10.0	8.0	144/160	9.6	2 h 9 min
qwen3.6:latest · Lemonade · GGUF · 20.1 GB VisionTools	8.3	—	—	—	—	—	—	—	—	18/18	42.6	6 min 52 s
qwen3-coder-next:latest 79.7B · Q4_K_M · 48.2 GB Tools	8.2	9.1	—	8.9	9.1	8.7	7.3	9.0	7.4	144/160	35.0	56 min 48 s
deepseek-r1:32b 32.8B · Q4_K_M · 18.5 GB Thinking	8.1	9.4	—	8.4	7.8	8.7	9.9	8.0	7.9	144/160	10.9	1 h 13 min
gemma4:31b · Lemonade · GGUF · 17.0 GB VisionTools	8.1	—	—	—	—	—	—	—	—	18/18	8.1	49 min 20 s
gpt-oss:20b 20.9B · MXFP4 · 12.8 GB ToolsThinking	7.6	9.3	—	9.0	7.2	8.0	9.9	8.2	6.4	144/160	48.2	55 min 13 s
qwen2.5:7b 7.6B · Q4_K_M · 4.4 GB Tools	7.4	9.0	—	7.5	6.9	8.6	8.4	7.2	8.0	144/160	44.0	16 min 5 s
mistral-nemo:12b 12.2B · Q4_0 · 6.6 GB Tools	7.2	9.2	—	7.1	7.0	8.4	9.0	4.6	7.7	144/160	27.7	26 min 31 s
qwen2.5vl:7b 8.3B · Q4_K_M · 5.6 GB Vision	7.1	8.9	—	7.0	6.3	8.7	9.3	9.8	7.5	152/160	37.8	34 min 21 s
aya-expanse:8b 8.0B · Q4_K_M · 4.7 GB Tools	7.0	8.9	—	6.5	6.2	8.8	7.8	4.3	7.5	144/160	38.1	19 min 32 s
gemma4:e4b · Lemonade · GGUF · 5.2 GB VisionTools	7.0	—	—	—	—	—	—	—	—	18/18	11.4	2 min 20 s

FAQ — Local LLM benchmarks

What is the NCN Local LLM Arena?: A reproducible benchmark: the same prompts run on every Ollama model on fixed hardware (AMD Strix Halo, 96 GB), scored by a local judge across 16 categories.
Which local LLM is best overall?: Rankings change as we add models. Use the table above for the current global leader and per-category winners — updated when new models are benchmarked.
How are scores calculated?: Each model runs identical automated tests plus a local judge. Scores are 0–10 per category, with a weighted global average. Full transcripts live on Murray's Lab.
Can I reproduce these benchmarks?: Yes. Same Ollama models, same test suite, local-only inference. We publish methodology and link to raw runs on murrayslab.com/lab/llms/.

Benchmark rig

CPU AMD Ryzen AI Max+ 395 · 16C/32T · Zen 5
GPU Radeon 8060S · 40 CUs · 96 GB unified VRAM
Stack Ollama 0.23 · Ubuntu 24.04 · ROCm
Method Auto-checks + local judge model · full prompt/response logs

Full test catalog & raw data also on Murray's Lab ↗