Skip to content

User story: nightly quality gate — golden suite as regression blocker

As the rig operator

I want a nightly evaluation harness that runs the current rig against a fixed task set and fails the pipeline on >10% regression on any tracked metric

So that prompt changes, dependency bumps, and model upgrades cannot silently degrade quality; autonomy-tier promotion (T1 → T2 → T3) has a defensible data source; and the headline claim “~20 min issue→merge, ~$0.62/task” is an invariant with guard rails, not a snapshot.

See whats-next whitepaper §Priority 4 and the source quality-and-evaluation.md whitepaper.

Today the rig has zero automated quality gates. Every change is a hope — “probably fine, CI was green.” The source whitepaper is explicit: “the agents are doing well” is not evidence; a dashboard line is. Implementation-status tracks 7 quality-and-evaluation capabilities, 0 deployed.

The golden suite is not SWE-bench, it’s our tasks — the ones the rig actually runs in production, fixtured and replayable. SWE-bench Pro is a separate, weekly, broader check for trend lines. Per-incident regression cases grow the suite organically — every bug the rig misses becomes a new test.

  1. Golden suite of 10 internal tasks — seeded from real merged issues across rig-conductor, rig-docs, rig-gitops. Each task has a fixture (starting SHA, issue body, expected diff scope) and a grading rubric (compile clean, tests pass, acceptance-criteria checklist).
  2. Nightly harness deployed as a CronJob — runs at 02:00 UTC (low contention with dispatch). Spins up ephemeral agent runtimes against the golden suite; posts results to Phoenix (via Priority 2) and the rig-conductor event store.
  3. Grafana dashboard for the 10-task trend — pass rate, median latency, median cost, median turns. Available for screenshot in weekly reviews.
  4. Regression threshold> 10% degradation on any of {pass rate, median cost, median latency} versus rolling 7-day baseline fails the pipeline and alerts to Discord.
  5. Per-incident regression cases — every production bug the rig misses gets a row in regression-cases/ + a new task in the nightly suite within the same PR that fixes it. Discipline, not automation.
  6. Weekly SWE-bench Pro subset — ~20 tasks, runs Saturday 02:00 UTC; budget ~$20–40/week; trend line only, does not block the pipeline. Dropped first if budget tightens.
  7. Cost dashboard row — the nightly run’s spend is itself tracked via Priority 3 virtual keys; nightly-eval becomes a rig user for attribution.
  • Autonomy tier promotion from T1 to T2. Per trust-model.md, T2 requires “20 successful runs, zero rollbacks, quality metrics within tolerance.” That sentence is meaningless without a fixed definition of “successful run” — the nightly suite is that definition.
  • Per-PR regression gate (follow-up Phase 2) can be hung on the same scaffold: PRs touching prompts or brain content trigger a subset of the nightly suite before merge.
  • Property-based testing on labeled changes (Phase 2) uses the same harness plumbing.
  • LLM-as-judge sampling (10% T0, 100% T2) is a second consumer of the nightly pipeline.
  • Quarterly LiveCodeBench (deferred per quality-and-evaluation.md — drop first if budget tightens)
  • Inspect AI adoption (deferred — emerging; re-evaluate in Era 2)
  • DORA metrics adapted to agents (Phase 2 follow-up; same pipeline, different consumer)

Medium-high. Sequenced after Priority 3 so the nightly run’s spend is itself bounded by a virtual key. Can start the golden-suite curation in parallel with Priorities 1–3; the harness wiring waits.

  • AC 1 (golden suite seeding — 10 tasks): ~5 days. Task selection, fixture capture, rubric writing. Hardest step — needs real engineering taste.
  • AC 2 (harness CronJob): ~5 days. Kustomize manifest + agent-runtime invocation loop + Phoenix span emission.
  • AC 3 (Grafana dashboard): ~2 days. TraceQL over Phoenix/Langfuse span attributes.
  • AC 4 (regression threshold + alert): ~3 days. Projection + Discord webhook.
  • AC 5 (regression-cases discipline): ongoing, process only — no code.
  • AC 6 (SWE-bench Pro weekly): ~3 days — or defer.
  • AC 7 (cost attribution via virtual key): ~1 day.

Total: ~2.5 weeks focused, sequential after Priority 3.

Per quality-and-evaluation.md: ~$3–8/night × 365 = $1.1–2.9k/year for the golden suite. ~$20–40/week × 52 = $1.0–2.1k/year for SWE-bench Pro. Total ceiling ~$5k/year — tracked via Priority 3 virtual key.