The warmup month
21st Jun 2026
The runway
The cold-email domain needs ≈ 3–4 weeks of warmup before it can send without torching deliverability (why — a brand-new domain blasting cold mail goes straight to spam). That wait is not dead time; it’s a forced runway. The goal: exit warmup with a working demo, a tested sending pipeline, and a send-ready list — so the day the inbox is warm is the day we send something real, not the day we start building.
Two rules hold all month: keep the warmup running, and do not send a single cold email early. Everything below happens behind that.
The one must-have: a working spike
If only one thing ships this month, it’s a thin vertical slice of the product: take
an eval, run it N times, compute the stats (Wilson interval +
two-proportion regression test), and emit a noise-aware pass / fail / flaky
verdict with a CI exit code. Nothing else — no dashboard, no judges, no UI.
The spike exists to answer the two questions that actually de-risk the whole thesis:
- Does it work? On real, stochastic agent output, does the test reliably catch an injected regression while ignoring run-to-run noise?
- Is it affordable? Does cost-aware sequential stopping keep the run count — and the token bill — sane, or does separating signal from noise need infeasibly many runs?
If both answers are yes, we have a product and the single best cold-email asset: a demo that shows the thing the whole teardown says nobody has.
Week by week
- Week 1 — scope + skeleton. Lock the MVP boundary (smallest shippable slice). Build the runner: BYO-key, wraps a user scorer, runs it N times, stores results. Validate the statistics on synthetic data first — known pass probability, inject a known regression, confirm the detection rate and the false-alarm rate match α.
- Week 2 — real output. Point the spike at an actual stochastic support-agent eval. Confirm it catches an injected regression and shrugs off noise; tune α and the noise floor. Measure cost, then implement and verify sequential stopping (clear cases resolve in a handful of runs).
- Week 3 — make it showable. A clean CLI demo and a real GitHub Action that gates a pipeline red/green. Capture the demo asset (asciinema / gif / screenshot) for the cold email and the landing page — the pitch needs to be seen, not described.
- Week 4 — load the cannon. Resolve the list: turn the 8 “verify” + 8 “manual”
leads into send-ready (verified email, stage 0). Test the sending
machinery end to end — one throwaway email through the Gmail CLI, confirm the
stagebump andlast_contactstamp fire. Final deliverability check (DMARC / SPF / DKIM green). Draft a tested cold-email skeleton — not 50 personalized drafts.
If warmup finishes at 3 weeks, weeks 3–4 compress; the spike is the spine, everything else flexes around it.
Definition of done
By the time the inbox is warm, all of these are true:
- Spike: catches an injected regression, ignores noise, on real-ish agent output, at a sane token cost.
- Demo asset: a gif/screenshot that makes the pitch land in one glance.
- List: every target send-ready — verified email, a real hook, stage 0.
- Pipeline: one test email sent and tracked end to end (stage + last_contact).
- Deliverability: inbox warm, DMARC / SPF / DKIM all passing.
- Copy: a source-keyed template tested — full personalization happens at send-time, not now.
What we explicitly don’t do
- Write the cold copy early. It rots; we have fresher context (and more leads) the week we send. Decided already.
- Build the whole product. No judges, dashboards, or UI — that’s the knife fight we don’t pick. The spike, and only the spike.
- Send before warm. No exceptions. The whole runway is in service of not doing this.