User story: nightly quality gate — golden suite as regression blocker

User story

As the rig operator

I want a nightly evaluation harness that runs the current rig against a fixed task set and fails the pipeline on >10% regression on any tracked metric

So that prompt changes, dependency bumps, and model upgrades cannot silently degrade quality; autonomy-tier promotion (T1 → T2 → T3) has a defensible data source; and the headline claim “~20 min issue→merge, ~$0.62/task” is an invariant with guard rails, not a snapshot.

Context

See whats-next whitepaper §Priority 4 and the source quality-and-evaluation.md whitepaper.

Today the rig has zero automated quality gates. Every change is a hope — “probably fine, CI was green.” The source whitepaper is explicit: “the agents are doing well” is not evidence; a dashboard line is. Implementation-status tracks 7 quality-and-evaluation capabilities, 0 deployed.

The golden suite is not SWE-bench, it’s our tasks — the ones the rig actually runs in production, fixtured and replayable. SWE-bench Pro is a separate, weekly, broader check for trend lines. Per-incident regression cases grow the suite organically — every bug the rig misses becomes a new test.

Acceptance criteria

⏳ Golden suite of 10 internal tasks — seeded from real merged issues across rig-conductor, rig-docs, rig-gitops. Each task has a fixture (starting SHA, issue body, expected diff scope) and a grading rubric (compile clean, tests pass, acceptance-criteria checklist).
⏳ Nightly harness deployed as a CronJob — runs at 02:00 UTC (low contention with dispatch). Spins up ephemeral agent runtimes against the golden suite; posts results to Phoenix (via Priority 2) and the rig-conductor event store.
⏳ Grafana dashboard for the 10-task trend — pass rate, median latency, median cost, median turns. Available for screenshot in weekly reviews.
⏳ Regression threshold — > 10% degradation on any of {pass rate, median cost, median latency} versus rolling 7-day baseline fails the pipeline and alerts to Discord.
⏳ Per-incident regression cases — every production bug the rig misses gets a row in regression-cases/ + a new task in the nightly suite within the same PR that fixes it. Discipline, not automation.
⏳ Weekly SWE-bench Pro subset — ~20 tasks, runs Saturday 02:00 UTC; budget ~$20–40/week; trend line only, does not block the pipeline. Dropped first if budget tightens.
⏳ Cost dashboard row — the nightly run’s spend is itself tracked via Priority 3 virtual keys; nightly-eval becomes a rig user for attribution.

What it unblocks

Autonomy tier promotion from T1 to T2. Per trust-model.md, T2 requires “20 successful runs, zero rollbacks, quality metrics within tolerance.” That sentence is meaningless without a fixed definition of “successful run” — the nightly suite is that definition.
Per-PR regression gate (follow-up Phase 2) can be hung on the same scaffold: PRs touching prompts or brain content trigger a subset of the nightly suite before merge.
Property-based testing on labeled changes (Phase 2) uses the same harness plumbing.
LLM-as-judge sampling (10% T0, 100% T2) is a second consumer of the nightly pipeline.

Out of scope

Quarterly LiveCodeBench (deferred per quality-and-evaluation.md — drop first if budget tightens)
Inspect AI adoption (deferred — emerging; re-evaluate in Era 2)
DORA metrics adapted to agents (Phase 2 follow-up; same pipeline, different consumer)

Priority

Medium-high. Sequenced after Priority 3 so the nightly run’s spend is itself bounded by a virtual key. Can start the golden-suite curation in parallel with Priorities 1–3; the harness wiring waits.

Estimated effort

AC 1 (golden suite seeding — 10 tasks): ~5 days. Task selection, fixture capture, rubric writing. Hardest step — needs real engineering taste.
AC 2 (harness CronJob): ~5 days. Kustomize manifest + agent-runtime invocation loop + Phoenix span emission.
AC 3 (Grafana dashboard): ~2 days. TraceQL over Phoenix/Langfuse span attributes.
AC 4 (regression threshold + alert): ~3 days. Projection + Discord webhook.
AC 5 (regression-cases discipline): ongoing, process only — no code.
AC 6 (SWE-bench Pro weekly): ~3 days — or defer.
AC 7 (cost attribution via virtual key): ~1 day.

Total: ~2.5 weeks focused, sequential after Priority 3.

Budget

Per quality-and-evaluation.md: ~$3–8/night × 365 = $1.1–2.9k/year for the golden suite. ~$20–40/week × 52 = $1.0–2.1k/year for SWE-bench Pro. Total ceiling ~$5k/year — tracked via Priority 3 virtual key.