User story: nightly quality gate — golden suite as regression blocker
User story
Section titled “User story”As the rig operator
I want a nightly evaluation harness that runs the current rig against a fixed task set and fails the pipeline on >10% regression on any tracked metric
So that prompt changes, dependency bumps, and model upgrades cannot silently degrade quality; autonomy-tier promotion (T1 → T2 → T3) has a defensible data source; and the headline claim “~20 min issue→merge, ~$0.62/task” is an invariant with guard rails, not a snapshot.
Context
Section titled “Context”See whats-next whitepaper §Priority 4 and the source quality-and-evaluation.md whitepaper.
Today the rig has zero automated quality gates. Every change is a hope — “probably fine, CI was green.” The source whitepaper is explicit: “the agents are doing well” is not evidence; a dashboard line is. Implementation-status tracks 7 quality-and-evaluation capabilities, 0 deployed.
The golden suite is not SWE-bench, it’s our tasks — the ones the rig actually runs in production, fixtured and replayable. SWE-bench Pro is a separate, weekly, broader check for trend lines. Per-incident regression cases grow the suite organically — every bug the rig misses becomes a new test.
Acceptance criteria
Section titled “Acceptance criteria”- ⏳ Golden suite of 10 internal tasks — seeded from real merged issues across rig-conductor, rig-docs, rig-gitops. Each task has a fixture (starting SHA, issue body, expected diff scope) and a grading rubric (compile clean, tests pass, acceptance-criteria checklist).
- ⏳ Nightly harness deployed as a CronJob — runs at 02:00 UTC (low contention with dispatch). Spins up ephemeral agent runtimes against the golden suite; posts results to Phoenix (via Priority 2) and the rig-conductor event store.
- ⏳ Grafana dashboard for the 10-task trend — pass rate, median latency, median cost, median turns. Available for screenshot in weekly reviews.
- ⏳ Regression threshold —
> 10%degradation on any of {pass rate, median cost, median latency} versus rolling 7-day baseline fails the pipeline and alerts to Discord. - ⏳ Per-incident regression cases — every production bug the rig misses gets a row in
regression-cases/+ a new task in the nightly suite within the same PR that fixes it. Discipline, not automation. - ⏳ Weekly SWE-bench Pro subset — ~20 tasks, runs Saturday 02:00 UTC; budget ~$20–40/week; trend line only, does not block the pipeline. Dropped first if budget tightens.
- ⏳ Cost dashboard row — the nightly run’s spend is itself tracked via Priority 3 virtual keys;
nightly-evalbecomes a rig user for attribution.
What it unblocks
Section titled “What it unblocks”- Autonomy tier promotion from T1 to T2. Per trust-model.md, T2 requires “20 successful runs, zero rollbacks, quality metrics within tolerance.” That sentence is meaningless without a fixed definition of “successful run” — the nightly suite is that definition.
- Per-PR regression gate (follow-up Phase 2) can be hung on the same scaffold: PRs touching prompts or brain content trigger a subset of the nightly suite before merge.
- Property-based testing on labeled changes (Phase 2) uses the same harness plumbing.
- LLM-as-judge sampling (10% T0, 100% T2) is a second consumer of the nightly pipeline.
Out of scope
Section titled “Out of scope”- Quarterly LiveCodeBench (deferred per quality-and-evaluation.md — drop first if budget tightens)
- Inspect AI adoption (deferred — emerging; re-evaluate in Era 2)
- DORA metrics adapted to agents (Phase 2 follow-up; same pipeline, different consumer)
Priority
Section titled “Priority”Medium-high. Sequenced after Priority 3 so the nightly run’s spend is itself bounded by a virtual key. Can start the golden-suite curation in parallel with Priorities 1–3; the harness wiring waits.
Estimated effort
Section titled “Estimated effort”- AC 1 (golden suite seeding — 10 tasks): ~5 days. Task selection, fixture capture, rubric writing. Hardest step — needs real engineering taste.
- AC 2 (harness CronJob): ~5 days. Kustomize manifest + agent-runtime invocation loop + Phoenix span emission.
- AC 3 (Grafana dashboard): ~2 days. TraceQL over Phoenix/Langfuse span attributes.
- AC 4 (regression threshold + alert): ~3 days. Projection + Discord webhook.
- AC 5 (regression-cases discipline): ongoing, process only — no code.
- AC 6 (SWE-bench Pro weekly): ~3 days — or defer.
- AC 7 (cost attribution via virtual key): ~1 day.
Total: ~2.5 weeks focused, sequential after Priority 3.
Budget
Section titled “Budget”Per quality-and-evaluation.md: ~$3–8/night × 365 = $1.1–2.9k/year for the golden suite. ~$20–40/week × 52 = $1.0–2.1k/year for SWE-bench Pro. Total ceiling ~$5k/year — tracked via Priority 3 virtual key.