Most LLM apps in 2026 ship by vibes — write a prompt change, read a few outputs, decide if it feels better, ship. The “decide if it feels better” step is the problem.
The Control Experiment
Evaluate a baseline prompt against an identical copy of itself. No change. Nothing should regress.
Result: 9 of 12 scenarios judged as regressions. 25 dimensional regressions across the suite. Pure judge variance, zero actual change to the prompt.
Trust your gut on those outputs and you’d roll back a perfectly good prompt 3 out of 4 times. That’s not iteration — that’s noise dressed up as decision-making.
The fix is an eval harness. Not the thing anyone wants to build.
What This Isn’t
This is application-quality evaluation — does this prompt change make the conversational behavior better or worse than baseline?
Not capability evaluation (METR-style autonomous-task benchmarks). Not safety evaluation (red-team prompt injection). The methods rhyme; the questions are different. Building one doesn’t give you the others.
Architecture — Three Concerns
About 1,000 lines of Python in eval/compare.py + eval/judge_prompt.py. Separate folder, separate dependencies, separate from the application code.
Concern 1: Goldens, Not Transcripts
Twelve scenarios cover the surface area: cold-start onboarding, decision capture, dependency linking, edit flows, vague vs expert engineers, cross-domain edge cases.
Each golden defines what must happen and what must not happen — never the exact wording. String-match fails for conversational AI: many valid paths exist for any input.
{
"first_message": "...",
"expected_outcomes": ["Recognizes the domain", "Moves to specific decisions in 2 turns"],
"anti_patterns": ["Treats a known concept as ambiguous"],
"key_dimensions": ["pacing", "specificity", "domain-fit", "..."]
}
Outputs are LLM-judged against a 17-dimension rubric — pacing, specificity, domain-fit, conversational coherence, and similar axes — scored Pass / Partial / Fail per dimension.
Goldens grow from bugs. When a real conversation surfaces a regression the eval missed, the broken behavior gets added as an anti-pattern. The suite hardens over time, in proportion to product use.
Concern 2: Noise-Aware Judging
The control experiment proved single-run judging is unusable. The fix:
Multi-run majority vote. Default: 2 passes per side; --thorough runs 3 passes. Verdicts aggregate by majority — a regression has to be confirmed across runs, not flagged once.
Confidence threshold. If agreement across runs drops below 60% on a dimension, the dimension is classified Noisy and excluded from the verdict. The eval refuses to take a position when the signal isn’t there.
(Defensive note: a 3-stage JSON parser strips markdown fences, fixes smart quotes / unescaped newlines / trailing commas, retries up to three times. Parse failures used to silently masquerade as regressions. They don’t anymore.)
Concern 3: Regression Taxonomy
Not every regression is equal:
| Class | Rule | Action |
|---|---|---|
| Hard regression | Pass → Fail | Always BLOCK |
| Soft regression | Pass → Partial, or Partial → Fail | REVIEW at 1–2 across the suite; BLOCK at 3+ |
| Noisy | Multi-run agreement < 60% | Excluded from verdict |
| Dropped | Parse error or missing dimension | Excluded from verdict |
Suite-level verdicts:
- SHIP — ≥1 improvement, 0 regressions
- REVIEW — 1–2 soft regressions, human decides
- BLOCK — any hard regression, or 3+ soft
- NEUTRAL — no change either way
That’s the entire decision surface. No reading transcripts. No vibes.
Cost Discipline
Eval is expensive. A full multi-run pass across all 12 scenarios costs ~$2–3 in API calls. Run that on every prompt change and you light money on fire.
Caching cuts ~80% of per-run cost. Anthropic prompt caching on the ~4K-token system prompt saves ~60% on input-token cost. A baseline-log cache keyed by a hash of each golden’s simulation-relevant fields skips re-simulating the baseline entirely (~50% additional) until a golden changes.
Tiered evaluation, fail-fast:
| Tier | Cost | Runs | Stop rule |
|---|---|---|---|
| 1 | ~$0.35 | 2–3 targeted scenarios, 1 judge pass | Stop on first failure |
| 2 | ~$1.25 | All 12, 1 judge pass | Stop on first hard regression |
| 3 | ~$2–3 | All 12, 2-pass judging | Only if Tier 2 says SHIP / REVIEW |
A bad prompt change fails at Tier 1 for $0.35 instead of consuming the full $3. CI engineers do this for compute. Almost no one does it for evals.
Versioning and Rollback
src/core/prompt_versions/ contains 10 prompt files (v4 → v13, sequentially numbered). One rule: never overwrite a working prompt. Every change starts a new file. Same principle as a database migration — an in-place ALTER on production state is malpractice.
The honest part of the version log is the rollbacks.
Heavy gate frameworks — explicit if/else branches in the system prompt for state transitions, hard-coded refusal patterns for edge cases — got tried and rolled back. Beautiful in theory; in practice, the structural rewrites consistently underperformed lightweight behavioral patches against a known-good baseline. The harness caught it. Patches shipped instead.
Subtract before you add. The default instinct on a regression is to add a constraint, a guard rail, a structural rewrite. The harness data pointed the other way: most quality wins came from removing instructions that conflicted with each other, not adding new ones. The lesson applies to prompts as much as code — but you only learn it if you have a harness to tell you the heavy framework lost.
What This Work Demands
Three habits this kind of work forces, none of which feel natural:
-
Distrusting your own outputs by default. The instinct after a prompt change is to read the outputs and decide if they’re better. The control experiment shows why that’s broken — the same prompt against itself looks like a regression 75% of the time. Human judgment on conversational outputs is high-variance, including mine.
-
Treating API spend like compute. CI engineers tier their pipelines without thinking about it — fast cheap checks first, expensive ones last. The same logic applies to evals, and skipping it burns money on every change. Not sophisticated; just borrowed from places where the discipline already exists.
-
Keeping rollbacks visible. Ten prompt versions where several were rollbacks is a more honest artifact than ten that all “worked.” The rolled-back versions are where the harness earned its keep — proof that the structural rewrites that looked best on paper underperformed against baseline.
Eval-driven prompt work is closer to scientific method than software craft. Hypothesis → controlled comparison → verdict → ship or learn. The harness exists to keep the work honest with itself.
Built solo, alongside the rest of a production AI application — auth, payments, persistence, deployment. The eval system is the part most worth writing about, because it’s the part that revealed which prompt instincts were wrong.
Disclosure boundary: no proprietary prompts, scenario contents, rubric text, or domain specifics. The numbers (LOC, costs, control-experiment results, version count) are real, cross-checked against the source and eval/runs/*.json before publishing.