Solving Long-Form Coherence in Small Open-Weight LLMs

It is 3:14 AM, and a developer is staring at a terminal window in a state of pure frustration. They’ve spent the last four hours trying to get a local 8B model to write a coherent 3,000-word short story. Everything is going great for the first six hundred tokens, but then the model hits a wall. Suddenly, the prose descends into a repetitive loop, or worse, it decides to wrap up the entire plot in a single, abrupt paragraph that reads like a Wikipedia summary. The hardware is humming, the VRAM is holding, but the intelligence has simply evaporated.

The long-form collapse

Introduction of a guidance mechanism to prevent quality degradation in long-form output.
A method to force small open-weight models to adhere to requested lengths without filling space with fluff.
Reduction of the “drift” effect where models lose the plot as token counts increase.
A framework for bridging the gap between 7B-class models and frontier-scale giants in creative writing.

Why are we still pretending that 8B models can handle a novel? If you’ve spent any time running Llama 3 or Mistral Nemo locally, you know the drill. These models are fantastic for chat, coding, and short summaries, but the moment you ask for a long-form narrative, they behave like a toddler trying to build a skyscraper—they get the first three floors looking great, and then the whole thing tips over. The POLARIS paper identifies this exactly: small models either choke on the length or sacrifice all coherence to hit a word count.

The solution presented here isn’t about just adding more data or increasing the context window (which is a hardware tax we can’t always afford). Instead, it’s about guiding the model to maintain a structural trajectory. In my view, this is the only logical path for the local-inference community. We cannot simply wait for someone to shrink a 400B model into a 7B footprint without losing the “soul” of the writing. If we want a model that can actually draft a chapter of a book without hallucinating its own ending, we need this kind of architectural steering rather than just hoping the next fine-tune is “smarter.”

From a deployment perspective, this is where things get interesting for those of us with a 3090 or 4090. Most of the current creative writing “fixes” involve massive ensembles or prompt-chaining that kills your tokens-per-second. If Polaris can be baked into a weight-set or implemented via a lightweight adapter, we can keep our inference speeds high in Ollama or llama.cpp without needing a Mac M3 Ultra just to maintain a plot point. (And probably not the 16GB versions of those cards, either). The real test will be seeing if this survives quantization. If the guidance breaks the moment you move to a GGUF Q4_K_M or an EXL2 4.0bpw, then it’s a laboratory curiosity, not a tool for the people.

Compared to the current open-weights pecking order, the gap in long-form writing is the one place where the “frontier” models still hold a massive lead. Llama 3.1 is a beast, but it still suffers from the same structural decay as its peers when pushed past a few pages of prose. By Q4, we will see a specialized “Writer’s” fine-tune of a Llama-3 variant using Polaris-style guidance that consistently outperforms the base models in long-form coherence benchmarks. It is a necessary evolution; otherwise, local models remain glorified chatbots that can’t tell a story to save their lives.

The hardware is ready, the weights are available, but the coherence is still missing.

Small models finally have a map to stop them from wandering off a cliff.