Demo mock: the money-shot

21st Jun 2026

Mock the demo before building it

The cold-email asset is a CLI run-through (the spec) — which means it’s just text and timing, and we can fake it convincingly before the engine exists. A mock buys us two things:

  1. It doubles as the build spec — the transcript below is the exact output the real testpath run has to reproduce.
  2. It de-risks the message before the product — we can test whether the pitch lands (on a few friendly contacts, during inbox warmup) before sinking weeks into the statistics engine.

What it does not do: prove the product works. A resonant gif with no math behind it is a marketing asset, not a company — so the engine still has to follow, and we don’t imply a live tool to anyone we can’t back up. The numbers below are plausible placeholders, chosen to be honest about what the real thing would show.

The money-shot

Two runs, side by side — the whole pitch is the contrast.

testpath catching a real regression, then dismissing noise — the cold-email demo

The transcripts below are the same two runs in text (the build’s acceptance target):

Run 1 — catches a regression a single-run eval misses:

$ testpath run --scorer ./refund_agent.py --cases suite.jsonl --baseline v1.4 --candidate v1.5
testpath 0.1.0 · wrapping ./refund_agent.py · 24 cases · α=0.05

  v1.4 (baseline)    24 cases · 312 runs · 0.94M tok
  v1.5 (candidate)   24 cases · 298 runs · 0.89M tok

  case                         v1.4    v1.5    Δ      verdict
  ----------------------------------------------------------------------
  refund_within_policy         0.98    0.97    -0.01   ok
  escalate_angry_customer      0.95    0.94    -0.01   ok
  decline_out_of_policy        0.91    0.78    -0.13   REGRESSION  (p=0.004)
  multi_item_partial_refund    0.88    0.87    -0.01   ok
  verify_identity_before_pii   0.93    0.92    -0.01   ok
  ----------------------------------------------------------------------

  GATE: FAIL · 1 regression · exit 1
  cost: 610 runs · 1.83M tokens · ≈ $5.10 @ $2.8/M  ·  42% of a fixed-30 sweep (≈ $7 saved)

  note: a single-run eval scored decline_out_of_policy PASS on v1.5.
        across 18 runs it passed 78% vs 91% — a real drop (95% CI, p=0.004), not noise.

Run 2 — ignores noise a single-run eval false-flags:

$ testpath run --scorer ./refund_agent.py --cases suite.jsonl --baseline v1.5 --candidate v1.6
testpath 0.1.0 · wrapping ./refund_agent.py · 24 cases · α=0.05

  v1.5 (baseline)    24 cases · 286 runs · 0.86M tok
  v1.6 (candidate)   24 cases · 274 runs · 0.82M tok

  case                         v1.5    v1.6    Δ      verdict
  ----------------------------------------------------------------------
  decline_out_of_policy        0.78    0.82    +0.04   ok
  tone_match_brand             0.86    0.81    -0.05   noise  (within variance, p=0.21)
  refund_within_policy         0.97    0.98    +0.01   ok
  ----------------------------------------------------------------------

  GATE: PASS · 0 regressions · 1 flagged-but-noise · exit 0
  cost: 560 runs · 1.68M tokens · ≈ $4.70 @ $2.8/M  ·  39% of a fixed-30 sweep (≈ $7 saved)

  note: a single-run eval scored tone_match_brand FAIL on v1.6.
        across 22 runs the 0.05 dip sits inside run-to-run variance — not a regression.

That’s the teardown pain — “your eval scores one run; it can’t tell a regression from noise” — reproduced and resolved on screen.

And the cost: line pre-empts the obvious objection — won’t running everything N times cost a fortune? Cost-aware sequential stopping spends ≈ 40% of a naive fixed-N sweep, so the whole gated comparison lands around $5 — the token economy made visible inside the pitch itself.

Rendering the gif

The mock lives in public/demo/: an executable testpath (canned output, colors on a TTY, plain text when piped) and a demo.tape for vhs.

cd src/docs/public/demo
brew install vhs        # ttyd + ffmpeg under the hood
vhs demo.tape           # -> testpath.gif

The rendered testpath.gif (embedded at the top) is committed next to the tape — re-run vhs demo.tape to regenerate it after any change to the mock’s output.