The warmup month

21st Jun 2026

The runway

The cold-email domain needs ≈ 3–4 weeks of warmup before it can send without torching deliverability (why — a brand-new domain blasting cold mail goes straight to spam). That wait is not dead time; it’s a forced runway. The goal: exit warmup with a working demo, a tested sending pipeline, and a send-ready list — so the day the inbox is warm is the day we send something real, not the day we start building.

Two rules hold all month: keep the warmup running, and do not send a single cold email early. Everything below happens behind that.

The one must-have: a working spike

If only one thing ships this month, it’s a thin vertical slice of the product: take an eval, run it N times, compute the stats (Wilson interval + two-proportion regression test), and emit a noise-aware pass / fail / flaky verdict with a CI exit code. Nothing else — no dashboard, no judges, no UI.

The spike exists to answer the two questions that actually de-risk the whole thesis:

Does it work? On real, stochastic agent output, does the test reliably catch an injected regression while ignoring run-to-run noise?
Is it affordable? Does cost-aware sequential stopping keep the run count — and the token bill — sane, or does separating signal from noise need infeasibly many runs?

If both answers are yes, we have a product and the single best cold-email asset: a demo that shows the thing the whole teardown says nobody has.

Week by week

Week 1 — scope + skeleton. Lock the MVP boundary (smallest shippable slice). Build the runner: BYO-key, wraps a user scorer, runs it N times, stores results. Validate the statistics on synthetic data first — known pass probability, inject a known regression, confirm the detection rate and the false-alarm rate match α.
Week 2 — real output. Point the spike at an actual stochastic support-agent eval. Confirm it catches an injected regression and shrugs off noise; tune α and the noise floor. Measure cost, then implement and verify sequential stopping (clear cases resolve in a handful of runs).
Week 3 — make it showable. A clean CLI demo and a real GitHub Action that gates a pipeline red/green. Capture the demo asset (asciinema / gif / screenshot) for the cold email and the landing page — the pitch needs to be seen, not described.
Week 4 — load the cannon. Resolve the list: turn the 8 “verify” + 8 “manual” leads into send-ready (verified email, stage 0). Test the sending machinery end to end — one throwaway email through the Gmail CLI, confirm the stage bump and last_contact stamp fire. Final deliverability check (DMARC / SPF / DKIM green). Draft a tested cold-email skeleton — not 50 personalized drafts.

If warmup finishes at 3 weeks, weeks 3–4 compress; the spike is the spine, everything else flexes around it.

Definition of done

By the time the inbox is warm, all of these are true:

Spike: catches an injected regression, ignores noise, on real-ish agent output, at a sane token cost.
Demo asset: a gif/screenshot that makes the pitch land in one glance.
List: every target send-ready — verified email, a real hook, stage 0.
Pipeline: one test email sent and tracked end to end (stage + last_contact).
Deliverability: inbox warm, DMARC / SPF / DKIM all passing.
Copy: a source-keyed template tested — full personalization happens at send-time, not now.

What we explicitly don’t do

Write the cold copy early. It rots; we have fresher context (and more leads) the week we send. Decided already.
Build the whole product. No judges, dashboards, or UI — that’s the knife fight we don’t pick. The spike, and only the spike.
Send before warm. No exceptions. The whole runway is in service of not doing this.