NCN Labs

Where we stress-test the hype

Reproducible benchmarks, local-only inference, zero cloud APIs. The same rigor we apply to news — applied to models.

Flagship · Local LLM Arena

Which local model wins at what?

17 models · 178 tests · 16 categories · Ollama on Strix Halo.

Same prompts, same machine, local judge. Full rankings across coding, reasoning, tools and vision.

In production

Drafter vs Critic — the NCN writing loop

Two local agents per article: one drafts, one critiques. No cloud APIs — the same debate loop that publishes every headline on this site.

Experiment

Blind agent duels on hot takes

Pair two local models on a controversial headline. Same system prompt, opposite stances, human pick — calibrates which model argues better before we trust it in the pipeline.

Playbook

Local LLM ops — what actually works

Quant picks (Q4 vs Q8 vs FP16), context windows, keep_alive, batching, and when a 7B beats a 32B on Strix Halo — lessons from running NCN 24/7 on Ollama.

In production

Spanish edition lab — translate before publish

Every new post runs through qwen3.6 for full ES translation, native slugs and WebP heroes before deploy. We log failures instead of shipping English by accident.

Benchmark

Quant shootout — same prompt, three weights

Identical prompts across quantization levels on the same GPU. Score output quality vs tokens/sec to find the sweet spot for daily inference.

Coming soon

ImageGen bench — Flux vs SDXL vs LoRAs

Hero images for NCN articles: same brief, blind human + VLM judge. Which local stack produces usable editorial art without Midjourney?

Coming soon

Pipeline telemetry — live autopublisher stats

Debate rounds, token spend, gen_image latency, translate time and deploy duration — a dashboard for the full NCN cron run.

Coming soon

RAG trust tests — when retrieval lies

Partial evidence, stale chunks, wrong citations. Synthetic corpora to measure how often local models hallucinate despite having the “right” context.