Builder testing teardown: where testpath fits

20th Jun 2026

The finding

Taking the builders from our candidate list, we tore down how each one tests the agents built on it. Every one of them already ships testing — emulators, eval suites, LLM-judges, analytics. But none handle stochasticity: they run a test once, score it pass/fail, and have no way to separate a real regression from LLM run-to-run noise. A stable red/green signal across versions is exactly testpath’s wedge. Ten teardowns below — the nine builders, plus Ada as a managed contrast (testing features verified from each vendor’s docs, Jun 2026).

Botpress

Offers: Visual Studio has only a manual emulator + (paid) analytics — no saved tests. The code-first ADK shipped Evals (Mar 2026): declarative tests with response/tool/state assertions, an LLM-judge, and CLI/CI.
Philosophy: Split — visual users iterate by hand + watch dashboards; the dev ADK adopts a real test-driven loop.
Gap: ADK evals run once, single pass/fail — no repeats, variance, or noise handling. The popular visual product has no regression testing at all.

Voiceflow

Offers: The most built-out — in-canvas Test Tool, native Tests (turns + AI simulation; response/routing/tool-call checks, parallel), Evaluations (LLM-judge on every transcript), agent-to-agent goal tests, a CLI, and a dev→staging→prod pipeline.
Philosophy: Transcript-driven observability — capture every conversation, auto-score it, jump back to the canvas.
Gap: Docs literally say “AI responses may vary… focus on goal achievement,” with no re-run/stability mechanism. Silent regressions rely on brittle exact-match or coarse trend-watching.

Rasa

Offers: Gold-standard deterministic testing — rasa test nlu/core (F1, confusion matrices, cross-validation), CALM E2E tests with ~11 assertions + two LLM-judge (“relevant”/“grounded”) checks, CI-first.
Philosophy: Conversation-Driven Development — turn real conversations into tests; testing as a first-class engineering discipline.
Gap: Rock-solid for classic NLU; thin for the stochastic LLM layer — no non-determinism, multi-run, or variance handling.

Cognigy

Offers: Playbooks (exact 1:1-match regression scripts) + a newer LLM-judge Simulator (synthetic conversations, success-rate scoring, scheduled runs) + Insights analytics.
Philosophy: Enterprise analytics-driven optimization — deterministic scripts plus a synthetic “performance lab.”
Gap: Playbooks break on stochastic output (official advice: mock with static data); the Simulator only watches aggregate rates — no per-case version diffing to catch silent regressions.

Kore.ai

Offers: Mature Batch Testing (NLU accuracy) + Conversation Testing (assertions, flow regression), plus a newer “Evaluation Studio” (agent/tool evaluators, LLM-judge, adversarial/persona sims, observability).
Philosophy: Continuous train → test → tune lifecycle across technical/quality/safety/business metrics.
Gap: Battle-tested tooling is deterministic NLU; the newer agent-eval names “LLM change management” but ships no documented noise-tolerant regression mechanism.

Parlant

Offers: No agent regression harness at all — only a design-time guideline linter (coherence checker for contradictory rules) plus runtime reasoning (ARQ) and traces. It once shipped a parlant-test entry point and removed it.
Philosophy: Misalignment is a structural design problem — fix behavior by editing guidelines, not by running test suites.
Gap: Nothing for users on stochastic output: no multi-run, variance, golden conversations, or version comparison. Tellingly, Emcie’s own CI uses a pytest-stochastics majority-vote plugin — they understand the problem, they just don’t ship it to customers.

Inkeep

Offers: A genuine eval layer — custom LLM-judge evaluators, datasets/test suites (CSV or SDK), batch + online (sampled) evals, retroactive scoring of past conversations, full CI API.
Philosophy: Programmatic, eval-driven + feedback-driven — codify judges, run offline and on live traffic.
Gap: Every item scored once; “reruns” compare point scores, so a 1–2-point judge swing from noise is indistinguishable from a real regression — left entirely to the user.

Salesforce Agentforce

Offers: Testing Center (sandbox), AI/synthetic test-case generation, batch/parallel testing (Topic/Action/Response pass %), a metadata Testing API, and sf agent test CI/CD (JUnit/TAP/JSON).
Philosophy: “Agentic Lifecycle Management” — test in sandbox to an “acceptable accuracy,” deploy, monitor.
Gap: Single-pass semantic pass/fail — no native multi-run, variance, or pass-rate thresholds; hallucinations can pass, and practitioners report hand-rolling “3+ run averaging.” No test coverage is required before production.

Yellow.ai

Offers: “Agentic testing” (Evaluation + multi-turn Simulation), LLM-judge accuracy/empathy thresholds, auto test-gen from KB/sessions, regression checks across sandbox→prod, plus legacy NLU tests + Insights analytics.
Philosophy: Automated QA “built for AI’s probabilistic nature” — parallel LLM-judged scenarios + a self-learning production loop.
Gap: Still single-run threshold pass/fail — no repeats, variance, or version-diff/drift detection. “Real regression or noise?” goes unaddressed.

Ada — managed, the contrast

Offers: Interactive testing, Simulations at scale (LLM-judge pass/fail vs expected outcomes), plus in-production Automated Resolutions scoring, a Reviewer Model, and a coaching loop.
Philosophy: Managed, but Ada preaches customer ownership — while supplying all the eval machinery itself.
Gap (why weaker targets): Eval is baked into the turnkey stack, so customers feel covered and the felt pain is low — even though Ada’s own docs concede “variability is expected between runs.” This is the managed pattern: their customers ≈ noise for us.

The takeaway

Same shape across all ten: real testing tooling, zero handling of stochasticity. They catch deterministic breaks and one-shot LLM outputs, but nobody gives their customers a stable signal that tells a genuine silent regression apart from random LLM variance — the most damning evidence being teams hand-rolling multi-run averaging (Agentforce) and a vendor running stochastic-vote tests internally while shipping none of it (Parlant). That’s the unmet pain in every builder’s customer base, and precisely what testpath sells. Ada shows the flip side: when the platform manages eval, the customer stops feeling the pain — which is exactly why we chase builders’ customers, not managed services’. Bonus: since testpath complements their eval rather than replacing it, each builder is also a potential integration partner.