Eighteen tries to get my AI-narrated cast to wear the right clothes, a Flux 2 Klein post-mortem

I needed two recognisable characters across dozens of dark-fantasy scenes. Klein gave me eighteen failed iterations of the same closing - twins, hobbits, photoreal Hollywood stills, a soldier dressed as a beggar - before I stopped fixing the model and started fixing the composition.

Eighteen tries to get my AI-narrated cast to wear the right clothes, a Flux 2 Klein post-mortem

Three weeks into the Morndur rabbit hole, the prose pipeline was running. qwen3.6:27b was writing 600-word dark-fantasy chapters that held tone across cycles. Players liked it. Then I tried to give it pictures. The image
above is iteration 13b of the chapter 1 closing scene. The soldier-smith you see on the right is wearing rags. There were twelve attempts before this one. There were five more after it before I gave up on the model fix and changed the composition instead.

This is the post-mortem of getting character-consistent illustrations
out of Flux 2 Klein 9B on a Mac Mini M4 Pro — without LoRA training, without
fine-tuning, and (mostly) without buying anything new. It involves the
underrated multi-image conditioning of Klein, a DINOv3 dead-end, a
Qwen-Image-Edit-2511 experiment that taught me something by failing, a
vision-loop judge with an opinionated VLM, and the composition shift that
mattered more than every model fix combined.

BTW. Here is a game that is currently running, and you can submit entries at any time!


The cast (what they're supposed to look like)

Morndur has, at the time of writing, two named living characters: Edrin
Vale
(hooded ranger, weathered late-30s, clean-shaven, small iron key on
a cord around his neck) and Tharn Ironfist (bearded soldier-smith with
a war-hammer, dark hair, scarred face, plate-and-leather armour, fur-lined
cloak).

Their canonical portraits, generated once and pinned to the cast page,
look like this:

Cast portrait — Tharn

The challenge: keep these two recognisable across dozens of new scenes in
a consistent ink-wash + parchment style. Easy in principle. Eighteen
iterations in practice.

Drewniany manekin do rysowania, elastyczne ciało z ruchomymi stawami, regulowane stawy, figurki drewniane do szkicowania, malowania i dekoracji. - AliExpress 21
Smarter Shopping, Better Living! Aliexpress.com

The closing scene that wouldn't die

The chapter 1 closing scene needs Edrin and Tharn together. One image.
Both characters. Recognisable. In style. That's the spec. Below is the
partial sequence of what Klein actually produced over two weeks of
prompt-tuning. (Versions skip — there were also v1-v4, v7-v8, v15-v16,
v19-v21, most of which were minor variations of the same failure modes.)

Iteration 5 — too old, near-identical

Both characters too old; both with the
same grey-bearded gravitas; the scene reads as "two elder monks" not
"ranger and soldier-smith." The first appearance of the twin
attractor
— even when Klein gives them different builds, it settles
them on the same emotional/age register.

Iteration 6 — style finally right, identity drifted

Ink-wash + parchment style finally
holds. But Edrin is now a frail elderly scribe with a grimoire
(he's a ranger with a key) and Tharn is in plate armour with a
war-hammer (correct!) but rendered as a stoic photoreal-leaning
blacksmith poster. Style fixed; identity drifted hard.

Iteration 9 — twins emerge, weapon mutates

Both characters now look like the same
brown-haired bearded man, just in different outfits. Plus Tharn's
war-hammer became an axe. The model collapsed two adult males into
one identity, then made up plausible variations on top.

Iteration 11 — perfect twins, school-photo pose

The clearest twin-collapse in the sequence. Same face. Same haircut. Same age. Same expression. Same height. Same beard. Both stare straight at the camera like a 19th-century school photograph. One has an apron and key, one has chest armour and hammer — and that's the entire visual difference. This is what "character collapse" looks like in extremis.

Iteration 12 — full photorealism, Hollywood still

Klein flips into complete photorealism — cinematic atmospheric lighting, leather texture rendered to last pore, faces lit like a Game of Thrones still. The ink-wash style anchor died completely. Bonus: still near-twins. This was the moment I started taking the "Flux's strongest visual
prior is photoreal" lesson seriously.

4 w 1 rysik rysunek Tablet pióro pojemnościowy ekran dotykowy długopisy dla Apple Android telefon inteligentne tablety rysik ołówek - AliExpress 202192403
Smarter Shopping, Better Living! Aliexpress.com

Iteration 13b — the lead image of this post

Style returns to ink-wash. Twins remain. Tharn renders as a vagabond
in rags
, his hammer gone, his armour gone. (Lead image above.) This
was the version where I realised: the problem is not the style anchor,
not the prompt, not the references. The problem is that the model has
no idea how to draw two distinct adult males in the same frame.

Iteration 14 — twins as wandering brothers

Out of the cave, into a gothic ruined city. Same twin-collapse, both characters now look like wandering brothers in greatcoats — neither has armour, neither has a hammer. Tharn's identity is just gone.

Iteration 17 — sixteen tries in, still wrong

Sixteen iterations later. Tharn's armour is kind of there now (the right figure has a leather shoulder piece). But Edrin (left) is now a frail old man in along coat instead of a weathered late-30s ranger. And the twin
resemblance hasn't fully gone. The progress per iteration has
become asymptotic.

After iteration 18 I stopped iterating on this particular shot. The
problem was structural, not parametric.


Other failure modes from the same two weeks

The closing-scene loop above was the spine of the work. In parallel,
the single-character scenes and other compositions threw their own
distinct failures. Briefly:

Tharn as a literal skull-faced ghoul

Klein read "weathered scarred soldier" as "no skin on the face" and produced a skull. Edrin to the left at least has a face.

Photoreal drift, ink-wash anchor destroyed (single-char)

The object scene — Edrin's hand holding the iron key — produced a photorealistic bloodied giant fist with a baroque key, cinematic candlelight. Style anchor buried. The single biggest stylistic failure of the project.

Village blacksmith instead of soldier-smith (portrait)

Klein read "smith" louder than "soldier" and produced a village blacksmith in a leather apron, Diablo-4-screenshot style. The entire cast canon (war-hammer slung, plate-and-leather, scarred face) didn't make it through.

Hobbits at the gate (hero scene)

Both characters rendered at roughly half adult height — hobbits, not Edrin and Tharn — plus a strange double-gate "portal" misinterpretation. Scale collapsed.

Anonymous monk crowd replaces the named cast (hero, alt take)

Edrin and Tharn nowhere. Instead, a crowd of seven anonymous hooded monks. This is what happens when "atmospheric tone" outweighs "named characters" in the prompt budget.

Edrin became a wizard with a grimoire

Tharn fine, soldier with hammer. Edrin rendered as an elderly wizard holding a grimoire because cloak + book is denser in the training set than cloak + key. Distribution drift.

Klein read the proper noun and engraved it

The arch has "IRON MAW" carved into the lintel as a title card because the prompt said "the Iron Maw." Plus, again, twin figures.


Dwupalcowe rękawice malarskie Anti-touch Anti-pollution Anti-brudzące, prawe i lewe ręka, do rysowania ekranu dotykowego tabletu IPad - AliExpress 7
Smarter Shopping, Better Living! Aliexpress.com

Why these specific failures happen

After all of the above I started keeping a one-line note per failure mode.
The patterns were:

failure root cause
skull-face "weathered scarred" extrapolated past skin into "exposed bone" — model interpolates between aged warrior and undead
photoreal drift Flux's strongest visual prior is cinematic photoreal — wins unless ink-wash anchor is first in prompt and weighted hard
village smith "smith" is a stronger token than "soldier-smith" — compound noun splits, more visual half wins
hobbits character height is implicit; without explicit "adult human male, tall" cues, model interpolates toward more "fantasy-scene-y" proportions
monk crowd atmospheric load > named-character load in prompt
Edrin-as-wizard cloak + book is more frequent in training than cloak + key — model collapses to nearest archetype
twin collapse two adult males in the same frame, similar lighting, similar age — model has no token-level mechanism to enforce distinctness. With a single reference image it averages; with text-only it picks one face and renders it twice.
IRON MAW engraved model trained on stock-image captions learns named places get their names written on them

Eight failure modes from one model. None of them are bugs. They're all
the model doing what it was trained to do — interpolate toward the
densest nearby region of training distribution
— when the prompt didn't
pin it firmly enough. The whole image pipeline became, in effect,
learning to pin the prompt firmly enough.

The closing-scene twin collapse needed something different.


Klein 2's underrated trick: native multi-image conditioning

Black Forest Labs shipped Flux 2 Klein 9B in early 2026 with a quiet but
huge feature: you can pass a list of base64 reference images in the
images:[…] array of the OpenAI-compat images/generations endpoint, and
Klein applies a RoPE positional offset (ImageRefScale=10) to each so the
model treats them as distinct visual anchors rather than averaging them.

In plain English: multi-character identity from references, no training
required
. Pass Edrin's portrait + Tharn's portrait, write a prompt for a
scene with both, get a scene where both stay recognisable.

When all the pieces are in place — multi-ref + style anchor leading +
crisp text descriptors + NOT-clauses — Klein produces this on first try:

Klein 2 with multi-ref + every prompt trick described
below. Both characters distinct: Edrin hooded with lantern, Tharn
bearded with war-hammer, iron gate behind. Style anchor (ink wash +
parchment) intact. No twins, no rags, no hobbits.

Two weeks to get here.


The fixes that actually worked

1. STYLE_ANCHOR leading

Flux weights early tokens far more heavily than later ones. Every prompt
now starts with the full ink-wash + parchment + Beksiński + Brom + NOT
photorealistic
style block, then character descriptions, then
composition. This killed photoreal drift (v12) about 80% of the time.

2. NOT-clauses on every identity-prone descriptor

Tharn: bearded, DARK hair, NOT silver-haired, NOT bald, NOT elderly,
       middle-aged, scarred face, war-hammer slung across back,
       plate-and-leather armour, NOT a village apron, NOT casual shirt,
       NOT rags, NOT a wanderer

Verbose. But this is what reliably kept Tharn out of both the village-
blacksmith attractor (v4-ish) and the beggar-vagabond attractor (v13b).

3. Iconic-name scrubbing

Regex-replace proper nouns before sending: Iron Maw → great wrought-iron gate. Plus an explicit NO_TEXT, no letters, no labels guard appended
to every prompt. Killed the title-card bug (Figure 15).

4. DINOv3 as identity gate — rejected after testing

Idea: generate candidate → DINOv3 face embedding → cosine vs canonical
portrait → reject if low.

Cosine numbers on a sample set: same-character 0.40-0.79, different-
character 0.20-0.47.
Distributions overlap. DINOv3 was built for
photorealistic faces; stylised ink-wash sits inside its embedding space
the way every "drawn human" sits — close to each other, far from
photographs. Hard threshold would reject legit matches half the time.

Killed the idea. Kept the lesson.

5. Tharn portrait — finally fixed

After every text fix above, plus multi-ref, plus the explicit
"soldier-smith with plate-and-leather armour, war-hammer slung across
back, NOT a village apron, NOT rags" descriptor, Tharn's solo portrait
finally lands canonically:

Soldier-smith in plate-and-leather, war-hammer slung, scarred face, dark beard, ink-wash + parchment style. Same prompt structure as the failed v1/v2 portrait attempts above. The difference is the NOT-clauses and
multi-ref.


Qwen-Image-Edit-2511: the wrong tool that taught me something

Halfway through, Alibaba shipped Qwen-Image-Edit-2511 — instruction-tuned
image-edit, advertised multi-reference identity preservation. I set up
ComfyUI on the Mac Mini (38 GB of model files, GGUF Q6_K, Lightning LoRA
4-step), wrote a workflow with two reference images and a scene prompt.

Qwen-Image-Edit-2511 result for a 2-character scene. Identity preservation excellent — Edrin and Tharn unmistakable. But the characters look literally pasted — faces transferred, bodies arranged behind them, scene built around the faces. It's a collage, not a render.

Identity excellent, but the model is an edit model, not a generation
model
. It treats refs as identity to preserve literally and the prompt
as "what edit to apply." Style anchor evaporated when pressed:

Same model, Tharn solo with same dark-fantasy ink-wash anchor. Result: comic-book inked panel, bright orange flames.

Same model, Edrin hero. Setting drifted from "underground basalt cathedral" to "Victorian brick sewer tunnel."

Lesson: Qwen-Image-Edit is an edit model, not a generation model.
It holds identity rigidly because that's its job. It doesn't generate
into a scene; it edits a reference toward a description. Wrong
primitive for an every-chapter new-composition workflow.

I kept ComfyUI for one more experiment.

Czytnik e-booków Bigme B7, 4+64GB, 7-calowy ekran E-ink przypominający papier, system Android 14, obsługa Google Play, komfortowy dla oczu czytnik e-booków z funkcją rozmów 4G - AliExpress 44
Smarter Shopping, Better Living! Aliexpress.com

The vision-loop judge — and the only sub-70B VLM that respects JSON

Idea2Img pattern: generate → VLM scores 0-10 → regenerate if score < 7.
Cheap if the VLM is small.

Small VLMs do not, in my testing, emit clean JSON. qwen3-vl:8b, Llava 1.6,
Gemma-vision — all chain-of-thought into prose given format:"json",
sometimes wrapping JSON in commentary, sometimes forgetting the closing
brace.

mistral-small3.2:24b was the only sub-70B vision model that returned
clean parseable JSON on the first try. Repeatably. If you're optimising
a vision-loop pipeline on Apple Silicon, this is the single most useful
sentence in this article.


The hybrid experiment: Klein generates, Qwen-Edit refines

If Klein generates atmospheric scenes but drifts identity, and Qwen-Edit
preserves identity but breaks style — maybe the hybrid works? Klein
produces the scene, Qwen-Edit refines only the faces toward the
canonical references with low denoise.

Denoise sweep on a hard case (Klein's silver-haired Tharn — should be
black-haired):

Denoise 0.3. 4-step Lightning sampling = ~2 effective steps. Nothing happens. Tharn still silver.

Denoise 0.5. Still nothing.

Denoise 0.7. Tharn's hair finally black. Edrin slightly more weathered. Composition preserved, style anchor mostly holds. The sweet spot.

So the hybrid works — at a cost.

step model wall clock
Klein scene Flux 2 Klein 9B ~45 s
Qwen-Edit refine (3 refs, Lightning 4-step) Qwen-Image-Edit-2511 Q6_K ~11 min
Total per scene ~12 min

For a 4-scene chapter, nearly an hour. Decision: keep the hybrid as
a selective fallback for scenes the vision-loop judge flags below 7.
For everything else, the cost isn't worth it.


The paradigm shift that mattered more than any model

After all the model tuning, eighteen iterations of the same closing
scene, two weeks of prompt trickery — the change that finally killed
the twin-collapse problem wasn't technical. It was a composition
rule
:

No medium-shot of two characters. Ever.

Either one character in close-up (no second figure for the model to
collapse into a duplicate), or both characters as wide-distant tiny
silhouettes
(so far away that visual distinctness doesn't matter —
the viewer reads them as "two figures," and prose carries the rest).

This is, embarrassingly, how 1980s tabletop RPG illustrators handled the
same problem. They didn't have multi-character LoRAs either. They drew
party scenes as wide establishing shots with tiny figures, or as solo
close-ups with hands and objects, and let imagination fill in.

Here it is in production. Chapter 1's closing scene — the version that
actually shipped:

Chapter 1 closing — two tiny silhouettes deep in a
vast cave, lit by torch sconces. You can tell there are two of them.
You cannot tell which is which. That's fine. The prose immediately
before this image names them. The image is atmosphere, not
identification.

And the chapter 1 object scene — a single-character close-up where
identity collapse is structurally impossible:

Chapter 1 object scene. The Tarnished Key, in a mortal
hand, ink-wash style intact. (Compare to Figure 10's photoreal bloody
fist with baroque key. Same model. Different prompt structure.)


Chapter 2: all the lessons stacked

The most recent published chapter — The Weight of Iron — uses every
fix above. Hero scene: Tharn in close-up foreground, Edrin as the tiny
figure at the gate in the background.

Chapter 2 hero. Tharn close (war-hammer, plate-and-
leather, scarred face, satchel of herbs at the belt). Edrin reduced
to a small silhouette at the gate behind — recognisable by silhouette,
not by face. Anti-collapse by asymmetric framing: one near, one
far. Never medium-shot 2+.

Chapter 2 portrait of Tharn. Frame-perfect identity match. No hybrid pass needed — Klein got it on the first generation. The village-smith attractor and the rags-vagabond attractor are both dead.

Chapter 2 closing — the Whispering Stairs. Two figures
mid-spiral, small enough that the architecture is the subject and they
are the witnesses. Same wide-distant paradigm.


What I'd tell a past me

  1. Multi-reference conditioning is enough for single-character scenes
    if you lead with style anchor and crisp text descriptors with
    NOT-clauses.
    Don't reach for fine-tuning.
  2. The model's strongest priors will eat your weakest tokens.
    Photoreal beats ink-wash unless ink-wash is first and loud.
    "Smith" beats "soldier-smith" unless you negate the village version
    explicitly. "Weathered scarred" can become "skull" unless you also
    negate "undead, skeletal, ghoul."
  3. For two adult males in the same medium-shot, no amount of
    prompting will reliably make them distinct.
    This is a structural
    limit of single-pass diffusion text-to-image with weak text-side
    character grounding. Either change the composition or use an
    identity-preserving refinement pass.
  4. DINOv3 is not a face-matching tool for stylised illustration.
    Same logic for most distance metrics calibrated on photographs.
  5. Edit models and generation models are not interchangeable. Pick
    the right primitive.
  6. Vision-loop judges work if you pick a model that respects
    format:"json".
    As of mid-2026, on Apple Silicon, that's
    mistral-small3.2:24b.
  7. Hybrid pipelines work, but at compounding cost (~12 min/scene
    for us). Reserve for surgical fixes flagged by a cheap upstream
    signal.
  8. The composition fix beat every model fix. Wide-distant or
    close-up. Never medium-shot multi-character. This is the rule the
    architecture of the image is built around, not the rule the
    architecture of the prompt is built around.

Eighteen iterations of one scene taught me lesson 3 the long way. If
you're building a similar pipeline, take it for free.

The whole thing runs on a single Mac Mini M4 Pro, Ollama serving the
LLMs, ComfyUI serving the (rare) hybrid pass. No cloud, no training,
no GPU rental. Just lots of reading, lots of failed runs (above), and
the slow accumulation of small lessons that, individually, don't look
like much — until you compare iteration thirteen-b to the chapter 2
hero and realise they were generated by the same model two weeks
apart, in the same style, from the same reference images.

If you want to see the system in motion, the game is at
morndur.com. Free, no signup, daily cycles,
narrator about to weave its next chapter with exactly the pipeline
described above.

You always have a choice — support in the way that suits you best!

Buy Me a Coffee

Fuel my creativity with a coffee — every sip keeps this blog running!

Buy Me a Coffee

Support This Blog — Because Heroes Deserve Recognition!

Whether it's a one-time tip or a subscription, your support keeps this blog alive and kicking. Thank you for being awesome!

Tip Once

Hey, Want to Join Me on This Journey? ☕

While I'm brewing my next technical deep-dive (and probably another cup of coffee), why not become a regular part of this caffeinated adventure?

Subscribe
Listed on Blogarama·OnTopList