Demo mock: the money-shot
21st Jun 2026
Mock the demo before building it
The cold-email asset is a CLI run-through (the spec) — which means it’s just text and timing, and we can fake it convincingly before the engine exists. A mock buys us two things:
- It doubles as the build spec — the transcript below is the exact output the
real
testpath runhas to reproduce. - It de-risks the message before the product — we can test whether the pitch lands (on a few friendly contacts, during inbox warmup) before sinking weeks into the statistics engine.
What it does not do: prove the product works. A resonant gif with no math behind it is a marketing asset, not a company — so the engine still has to follow, and we don’t imply a live tool to anyone we can’t back up. The numbers below are plausible placeholders, chosen to be honest about what the real thing would show.
The money-shot
Two runs, side by side — the whole pitch is the contrast.

The transcripts below are the same two runs in text (the build’s acceptance target):
Run 1 — catches a regression a single-run eval misses:
$ testpath run --scorer ./refund_agent.py --cases suite.jsonl --baseline v1.4 --candidate v1.5
testpath 0.1.0 · wrapping ./refund_agent.py · 24 cases · α=0.05
v1.4 (baseline) 24 cases · 312 runs · 0.94M tok
v1.5 (candidate) 24 cases · 298 runs · 0.89M tok
case v1.4 v1.5 Δ verdict
----------------------------------------------------------------------
refund_within_policy 0.98 0.97 -0.01 ok
escalate_angry_customer 0.95 0.94 -0.01 ok
decline_out_of_policy 0.91 0.78 -0.13 REGRESSION (p=0.004)
multi_item_partial_refund 0.88 0.87 -0.01 ok
verify_identity_before_pii 0.93 0.92 -0.01 ok
----------------------------------------------------------------------
GATE: FAIL · 1 regression · exit 1
cost: 610 runs · 1.83M tokens · ≈ $5.10 @ $2.8/M · 42% of a fixed-30 sweep (≈ $7 saved)
note: a single-run eval scored decline_out_of_policy PASS on v1.5.
across 18 runs it passed 78% vs 91% — a real drop (95% CI, p=0.004), not noise.
Run 2 — ignores noise a single-run eval false-flags:
$ testpath run --scorer ./refund_agent.py --cases suite.jsonl --baseline v1.5 --candidate v1.6
testpath 0.1.0 · wrapping ./refund_agent.py · 24 cases · α=0.05
v1.5 (baseline) 24 cases · 286 runs · 0.86M tok
v1.6 (candidate) 24 cases · 274 runs · 0.82M tok
case v1.5 v1.6 Δ verdict
----------------------------------------------------------------------
decline_out_of_policy 0.78 0.82 +0.04 ok
tone_match_brand 0.86 0.81 -0.05 noise (within variance, p=0.21)
refund_within_policy 0.97 0.98 +0.01 ok
----------------------------------------------------------------------
GATE: PASS · 0 regressions · 1 flagged-but-noise · exit 0
cost: 560 runs · 1.68M tokens · ≈ $4.70 @ $2.8/M · 39% of a fixed-30 sweep (≈ $7 saved)
note: a single-run eval scored tone_match_brand FAIL on v1.6.
across 22 runs the 0.05 dip sits inside run-to-run variance — not a regression.
That’s the teardown pain — “your eval scores one run; it can’t tell a regression from noise” — reproduced and resolved on screen.
And the cost: line pre-empts the obvious objection — won’t running everything N
times cost a fortune? Cost-aware sequential stopping spends ≈ 40% of a naive fixed-N
sweep, so the whole gated comparison lands around $5 — the
token economy made visible inside the pitch itself.
Rendering the gif
The mock lives in public/demo/: an executable testpath (canned output, colors on a
TTY, plain text when piped) and a demo.tape for vhs.
cd src/docs/public/demo
brew install vhs # ttyd + ffmpeg under the hood
vhs demo.tape # -> testpath.gif
The rendered testpath.gif (embedded at the top) is committed next to the tape —
re-run vhs demo.tape to regenerate it after any change to the mock’s output.