Morndur.com — From Weekend Idea to Three-Week Rabbit Hole
I Spent More Time on Benchmarks Than Coding. And I'm Not Even a Coder.
A few weeks ago I had what I was sure was a small idea. The setup was simple. A web page tells you the opening of a story. You write a paragraph or two about what your character does next. A few hours pass. An AI reads what you and everyone else wrote, weaves it into the next chapter of the story, and publishes it. Rinse, repeat. A 1980s play-by-mail RPG, but the GM is a 27B-parameter language model running on my own server.
I gave myself a weekend.
I've been running local AI on a Mac mini for a while — about a dozen agents over an FastAPI proxy in front of Ollama, a ChromaDB instance on another box for the RAG-heavy stuff, the whole homelab pipeline that I keep telling my wife "pays for itself in electricity bills, eventually". I know how to wire prompts. I know how to embed text. I know how to ask a 27B model to write 600 words of prose and it will, most of the time, write 600 words of prose.
So how hard could "read N paragraphs, write 1 paragraph back" actually be?
Spoiler: I just spent three weeks on it. The site is live at morndur.com. I have written maybe 200 lines of code with my own hands. The other ~12,000 lines were written by AI, and most of my actual work was deciding what to test and reading the results. By a wide margin, the single biggest chunk of my time went into benchmarks for problems whose names I didn't know existed before I started.
This is the post about those problems.
The Naive Plan
Here's what I sketched on a notepad on day one:
- A Postgres table called
submissions. Each row is one player writing one move during a 24-hour cycle. - At cycle close, dump every submission into a synthesis prompt, ask a 27B model to write the next chapter.
- Save the chapter. Open the next cycle. Done.
Add some scaffolding — pre-filter to block prompt injections and out-of-genre nonsense, a feedback email to opted-in players with their contribution highlighted, basic anti-griefing. A weekend of plumbing.
Cycle 1 with 1 submission? Trivial. The model sees one paragraph and writes a chapter around it.
Cycle 1 with 10 submissions? Still fine. Concatenate them, label them, ship them.
Cycle 1 with 100 submissions? OK, the prompt is getting fat but a 16k context window swallows it.
Cycle 1 with 1000 submissions? That's where the whole project tipped over a cliff I didn't know was there.
The First Wall
Local 27B and 35B-a3b models handle roughly 3000-6000 tokens of useful prompt before quality starts noticeably degrading from what attention researchers politely call "lost in the middle" - a 2023 Stanford paper by Liu et al. that I read three times before properly believing. At ~50 tokens per submission paragraph, you fit maybe 100 voices into a single prompt before you start truncating, and even before truncation the model starts treating the middle of the prompt as background noise.
OK, so I need to compress. First instinct: just summarise. Take all 500 submissions, ask an 8B model to summarise them into a paragraph, feed that paragraph to the 27B as a "voice of the players".
That's when I realised I had wandered into a research field.
The technical name for "given N documents, write one good summary that respects all of them" is Multi-Document Summarisation. People have been publishing on this since the late 90s. The basic pattern - Map-Reduce summarisation - is well-trodden: split into clusters, summarise each cluster, then summarise the summaries. Google had a blog post about doing it with Gemini at production scale. LangChain has templates for it.
So far so manageable. I went and built the obvious thing: embed every submission, cluster by cosine similarity, write a short summary per cluster with a cheap 8B model, then feed the summaries to the big 27B for the actual chapter prose. Map-Reduce, 240 lines, done.
I generated 100 synthetic submissions to test it. The clustering output was a one-line horror show:
N=100, threshold=0.72 → clusters: [97, 2, 1]
97 of 100 submissions ended up in the same cluster. Two were in a cluster of 2. One was alone. So when I ran the rest of the pipeline, the "summary of the cycle" was effectively "one cluster says X, two odd people say Y" — for input that was deliberately diverse: investigate the door, attack the stranger, follow the noise, look in the chest, talk to the old man, retreat to the campfire, and dozens more.
The clustering was lying to me. Spectacularly.
Cluster Collapse, And A Surprisingly Cheap Fix
I assumed the embedding model was weak. The model is qwen3-embedding:4b, a 2560-dimensional model (which itself was a surprise - the documentation said 2048, but if you actually ask it, you get 2560 floats back; I logged that for posterity in DECISIONS.md). I was about to try alternatives - bge-m3, mxbai-embed-large, nomic-embed-text - when I remembered seeing an Anthropic blog post a while back about something called Contextual Retrieval.
Anthropic's observation, from September 2024: when you embed a paragraph in isolation, the embedding sees the semantic skeleton of the sentence - "subject does action to object" - and skeletons of game-writing submissions are very similar to each other even when the intent is completely different. "I push the iron door open with my shoulder" and "I follow the hooded stranger through the back gate" look like the same shape to an embedder that doesn't know what either sentence is about. Anthropic's fix is to prepend a cheap one-line context to each chunk before embedding it:
[subject: Vorlax · intent: push-the-iron-door-open] I push the iron door open with my shoulder, watching for the second guard.
The pre-filter step in my pipeline was already extracting subject and intent per submission (so the rest of the engine knows who's acting and what they want). So prepending those as a context string before embedding cost me literally zero extra LLM calls. I added five lines to embedding.py, re-ran the same N=100 test:
N=100, threshold=0.72 → clusters: 8, max cluster size 56
N=100, threshold=0.80 → clusters: 19, max cluster size 16
A 49% reduction in retrieval failure across Anthropic's own tests; in my tiny corner, the difference between unusable and usable. Threshold 0.80 with contextual prefix became the lock-in. Both decisions live in DECISIONS.md with a date and a rationale; future-me will thank present-me.
This was the first moment I understood that the project was going to be mostly reading, not mostly coding.
The Real Problem: Fairness Under Disagreement
OK, clusters are sensible now. But there's a worse problem hiding behind clustering, and it's the one I most want to write about, because I had no idea it had a name until I tripped over it.
Picture this scenario. Cycle 17. 800 players write some variation of "the party should kill Vorlax, he's been a problem since cycle 12". 100 players write "no, Sela steps in front of him — she said 'uncle' in cycle 14, remember, that meant something". Twenty more write completely orthogonal things.
The naive synthesis answer is: write the majority outcome. Vorlax dies. The 100 voices get a polite email saying "your idea was outvoted in red". The story moves on.
The naive answer is terrible. It silences a thread that the narrator should be staging as drama. If you let majority always win, the game becomes a popularity contest and the actually interesting dynamic — minority conviction colliding with majority intent — vanishes. Worse, a coordinated brigade of 50 friends could pile-on any character they want gone, because there's no mechanism to amplify the dissenting voice.
I went looking for prior art and found a 2026 paper — Faithful Summarisation under Disagreement via Belief-Level Aggregation — that names this problem and gives the rough shape of a solution: merge submissions at the proposition layer, not the string layer. Each submission isn't "amount of X" but "a belief about what should happen". When beliefs conflict, you don't average them — you preserve both and let the narrator render the conflict as a scene.
In plain English: instead of letting 800 voices write the killing blow because they outnumber 100 voices, you write the moment Sela steps between Vorlax and the eight hundred and the chapter is that confrontation. Both groups get represented. The story is more interesting than what either group asked for in isolation.
This is now task #25 in my backlog. I haven't built it yet — the basic engine ships first, the polish ships once I see how real submissions actually fight with each other. But I can already feel the shape of the problem the paper is solving, because I built a synthetic 200-voice test case to see what my current pipeline does without belief aggregation, and the chapter the 27B produces is exactly the bland majority outcome.
Maximal Marginal Relevance, Or: The 1998 Paper That Pays Rent
Adjacent problem: even within "the voices that survive clustering", you have one more selection step. You want to pick a handful of raw submissions to carry forward as spotlights — the solo creative leaps, the unusual angles, the weird thing the third player wrote that nobody else followed up but that's clearly the most interesting submission of the cycle.
If you pick spotlights by score (creativity × novelty), you get a clump of similar high-scoring submissions. They cancel each other out as "spotlights" because they're all making the same point. What you want is diverse picks that maximize informational coverage.
The relevant algorithm here is Maximal Marginal Relevance, Carbonell & Goldstein 1998 — the same MMR that probably ranks search results in whatever tool you used to find this post. It picks documents one at a time, scoring each candidate as a tradeoff between high relevance to the query and low similarity to documents already picked. After you pick the first one, the second one is biased toward "different from the first". The third is biased toward "different from the first two". Etc.
A 28-year-old algorithm, three pages of Python, and it was the cleanest answer to "preserve minority creative voices when most submissions are similar" that I tried.
The full pipeline now looks like this when the cycle has more than 40 submissions:
- Pre-filter every submission (8B model, 3-layer guardrails — universal rules, per-season physics, tone cosine).
- Embed every kept submission with contextual prefix.
- Sybil dedup (same cookie token + cosine ≥ 0.92 means one person spam-submitting — fold them to one voice).
- Greedy cluster at cosine 0.80.
- Summarise each cluster (8B model, one short paragraph per cluster, parallel batches).
- Score every submission: priority =
(amp × 2) + creativity + (continuity × 0.5) + novelty_bonus. - Pick top 10 by
creativity × noveltyas spotlights (raw submissions, not summarised — these are the loners). - Use MMR to pick 5 more wild cards from submissions not in any cluster representative and not in spotlights.
- Assemble the synthesis prompt: cast + locations + items + echoes-of-canon + cluster summaries (capped at 25) + spotlights + wild cards + a "fairness banner" telling the model not to over-weight any one cluster beyond 30% of chapter prose.
- Send to 27B, get back JSON with chapter prose + state changes + award nominations.
- Run a fairness audit with the 8B model — count cluster representative mentions in the chapter prose, compare against each cluster's actual submission share. If any cluster is over-represented by more than 2×, regenerate the chapter once with an explicit "rebalance" instruction.
That's about ~600 lines of orchestration. Each step is individually obvious. The whole thing took me three days of reading (the MDS literature, the NexusSum hierarchical paper, the Anthropic post, half a dozen LangChain examples) and one day of writing, the writing being roughly: "Claude, here are my notes on what I want, please implement step 6 and parallelise the cluster MAP calls".
Testing The Models, Which Models, Why
Let me back up to the part that took the most calendar time: deciding which model does which thing.
I had four candidates from the start: qwen3:8b (cheap and fast for one-off classifications), qwen3.6:27b (dense, expensive, my default for "write good prose"), qwen3.6:35b-a3b (mixture-of-experts with about 3B active parameters per token, MoE for short, very fast on consumer hardware), and qwen3-embedding:4b (the embedder).
I ran what I called "Phase 2" benchmarks — 30 mock cycles at small N (8 submissions each), three modes per model, six adversarial pre-filter jailbreaks per model, tone-cosine drift tests. The headline number: at N=8, the MoE 35b-a3b wrote chapters in 45-60 seconds where the dense 27B took 180-215 seconds. Identical jailbreak resistance (0 failures on 6 cases for both), identical publish-ready prose quality on a cross-model judge. I picked 35b-a3b as the synthesis primary. Locked it in. Wrote a DECISIONS.md entry. Moved on.
Then I did "Phase 2b" because I suspected Phase 2 wasn't representative of the load that mattered. I ran the same engine at N ∈ {1, 2, 5, 10, 25, 50, 100, 250, 500} with each model, plus a cross-model pipeline matrix (synthesis from one model fed into downstream tasks of another), plus a hierarchical end-to-end run on 1000 synthetic submissions.
The result un-flipped my decision. At N=500 with MapReduce compression, the dense 27B's "whole-picture attention" beat the MoE 35b-a3b's wallclock advantage on output quality, because at the upper end of the prompt budget (5000+ tokens), MoE models' narrower per-token computation starts to show. Cross-model judging confirmed it: 27B-synthesised chapters got rated 4/5 by a 35B judge; 35B-synthesised chapters got rated 3/5 by a 27B judge. The judges disagreed about the floor; they agreed about the ranking.
So I flipped my pick a second time. 27B dense became the synthesis primary, 35b-a3b became the fallback (used when 27B's queue is busy, and for the smaller downstream tasks like award judging and the post-publish curator pass). Another DECISIONS.md entry, explicit about overriding the previous entry.
The pattern repeats. Phase 2 said one thing. Phase 2b said something else. Tomorrow's Phase 3, when I actually have real human submissions instead of synthetic ones, might say a third thing. The DECISIONS log is now 386 lines and counting. Most of those lines are me arguing with myself in writing.
"But Are You Coding?"
I am not. Not in any meaningful sense.
I read carefully. I write notes. I design prompts for Codex/Gemini/Claude in extreme detail. I review every diff. I run the benchmarks myself, mostly because I want to see the numbers come out of the orchestrator at three in the morning, not because I distrust the agent. I keep DECISIONS.md, I keep the README honest about what's done and what's pending, I sketch architecture in plain English first.
Then AI writes the code.
Almost all 12,000 lines of Python (plus the prompts, plus the HTML templates, plus the Postgres ALTER TABLE statements I shipped on prod with sudo -u postgres psql) were generated by an agent following instructions like:
Add post-synthesis profile pass for narrator-introduced characters that state_delta inserted without backstory/role/secret. Per 09-character-lifecycle.md § "When the chapter introduces an extra". Query characters withintroduced_cycle = current_cycle AND backstory_brief IS NULL. For each, callprofile_service.generate_profile()withsubmission_body=chapter_mdandcast_names=current_alive_cast. Set role/backstory/secret/location only if currently NULL (don't overwrite human-set fields).
That's the actual prompt I gave Claude two weeks ago. Claude returned a 30-line diff that did exactly that. I read it, verified it didn't shadow some other logic I cared about, hit commit.
The lesson I keep coming back to: the bottleneck is knowing what you want and being able to tell whether you got it. Not typing. Not memorising APIs. Not even remembering how to write an async SQLAlchemy 2.0 select. The agent does the typing. My job is the spec, the verification, and the sometimes-painful decision of what to ignore.
Of my last three weeks, conservatively:
- ~40% benchmarks and reading research papers (the Anthropic post, three MDS papers, NexusSum, the Belief-Aggregation paper, the original 1998 MMR paper, the "Lost in the Middle" Stanford paper, the embedding leaderboard, the Apocalypse World GM principles for the storytelling side)
- ~25% writing prompts to IDE and reviewing diffs
- ~15% prompt engineering for the game itself (the synthesis prompt template, the pre-filter prompt, the character profile prompt, the recap prompt — these are the hot path; they live in
sandbox/prompts/and get tweaked between every benchmark run) - ~10% writing DECISIONS.md and trying to be honest about which direction I just flipped on and why
- ~10% sysadmin (nginx, systemd, Cloudflare DNS, Postgres role management on the prod VM, swapping a self-signed cert in for CF Full mode at 2 AM because the HTTPS request was hitting a different vhost)
Coding-as-in-typing-code is somewhere inside that 25% of reviewing diffs, when I see something Gemini wrote that I want to nudge by a few characters and it's faster to just edit it myself than write a follow-up prompt. Maybe 2% of my hours.
If you'd told me a year ago that I'd build this much engine in three weeks without writing most of the code, I would have rolled my eyes. I rolled my eyes at people saying that a year ago. But here we are.
What I'd Have Stuck On Without The Reading
The honest answer to "what saved you from a much worse outcome": the literature. Specifically:
- Anthropic's Contextual Retrieval post — without it, my cluster collapse would have sent me on a multi-day embedding model goose chase. I'd have ended up either with a worse model or with the same model and worse results, blaming myself.
- NexusSum (2025) — confirmed that the hierarchical MapReduce pattern I was about to build was, in fact, the right pattern at this scale. Saved me from re-deriving it badly.
- Carbonell & Goldstein, MMR, 1998 — three pages. Solves "preserve minority voices in selection" cleanly. Twenty-eight years old, still pays rent.
- The Belief-Aggregation paper (2026) — articulated a problem I was already feeling but couldn't name, which is exactly the kind of paper that's worth its citations.
- "Lost in the Middle" (Liu et al., 2023) — the empirical floor under "why we compress instead of just buying a longer context window".
If I had to point at one habit that mattered: when I hit something that surprised me, I assumed somebody had already named it. Most of the time, somebody had. The names were just sitting in 2023-2025 arXiv papers waiting to be googled in the right order.
What Now
The site is live. The pipeline is wired end-to-end. The first cycle that has a real human submission will fire at 20:00 CET on whatever day that turns out to be. The narrator is sitting at the gate, waiting.
Things I still haven't built and might never need to: belief-aggregation pass (only matters when a cycle actually has a contested vote in the wild), hybrid BM25+embedding echo retrieval (only matters when canon gets long enough that pure-embedding recall starts missing exact names), and a curator pass that scores chapters for social-post-worthiness and proposes Mastodon and Bluesky highlights - that one I'll build when there's a chapter worth highlighting, which there isn't yet.
There is also a feature called the Night of Long Returning, which is the resurrection lottery: every dead character has, after a two-month mourning period, a 25% chance each month that the lottery opens, with a hard cap of one return per year per world. If the dead character has a bound email, that creator gets a one-click "bring them back" link. If they don't, the narrator decides. I built it last week. I have not yet had a dead character. I have not yet had a character. I am, possibly, slightly ahead of myself. And if you read that - thank you, as this is reward for reading whole thing. For normal players this feature is not disclosed ;)
If you want to test it: morndur.com. The Iron Maw is open. Walk in. Name yourself. The Hold remembers names forever, so pick one you'd be willing to be known by in three months.
If you write your move with an email, you'll get the next chapter back with your contribution color-coded inside the prose — green where you led the crowd, blue where you stood alone with a clever idea, yellow where the narrator kept your moment, red where the hive mind outvoted you. We don't show your email or those colours on the public site. They're just yours.
A few links if you want to go down the same rabbit hole:
- Anthropic — Contextual Retrieval (Sep 2024)
- NexusSum — Hierarchical Multi-Agent LLM for Long-Form Narrative Summarisation (2025)
- Faithful Summarisation under Disagreement via Belief-Level Aggregation (2026)
- Lost in the Middle: How Language Models Use Long Contexts (Liu et al., 2023)
- Carbonell & Goldstein, MMR, 1998 — the original PDF
- Apocalypse World GM Principles — the storytelling-side principles that informed the narrator's prompt