Postmortem — main-guard false positive on `cancelled` runs (2026-05-15)
Postmortem — main-guard false positive on cancelled runs (2026-05-15)
Section titled “Postmortem — main-guard false positive on cancelled runs (2026-05-15)”main-guard (Phase 1) treated a workflow run with
conclusion: cancelledas a main-red incident. The cancellation came fromconcurrency.cancel-in-progress: truewhen the forward-fix PR superseded the prior build — main was already green at the time of escalation.
| Item | Value |
|---|---|
| Reported incident | rig-docs#267 — “no fix within 30m for PR #264” |
| Forward-fix issue | rig-docs#265 (already closed) |
| Cited “failing run” | 25933361209 on commit c633016 (the merge commit of PR #264) |
| Actual run conclusion | cancelled (per issue body: Conclusion: cancelled) |
| State of main at escalation | Green — npm ci && AUTO_DERIVE=off npm run brain:check && npm run build && node scripts/check-whitepaper-routes.mjs all pass on c633016 |
| Forward fix needed | None — PR #264 already corrected the github_issue URL regression from PR #263 |
| Doc fix | This postmortem; close rig-docs#267 as not-a-bug |
| Systemic fix | main-guard must filter conclusion in {failure, timed_out, startup_failure} and ignore {success, cancelled, neutral, skipped, action_required} |
What happened
Section titled “What happened”- PR #263 merged with
github_issue: "dashecorp/rig-docs#262"— a bare reference, not a URL. The Zod schema insrc/content.config.tsenforcesz.string().url(), so the post-mergeBuild and deployworkflow onmainfailed. - main-guard filed
rig-docs#265(the forward-fix issue). - PR #264 opened by
dev-e-bot— single-line fix replacing the bare reference withhttps://github.com/dashecorp/rig-docs/issues/262. Merged at2026-05-09T05:55:35Z. - The
Build and deployrun on PR #264’s merge commit (c633016) was cancelled byconcurrency.cancel-in-progress: trueindeploy.yml. The next push (or scheduled hourly run) immediately superseded it. The replacement run was green; the site deployed. - main-guard sampled the cancelled run, classified it as “still red,” waited for the 30-minute SLA, and filed
rig-docs#267— even thoughrig-docs#265was already closed and main was green.
Reproduction
Section titled “Reproduction”git fetch origin maingit checkout c6330160fab857526238c5e52a0bd4a45d5c570enpm ciAUTO_DERIVE=off npm run brain:checknpm run buildnode scripts/check-whitepaper-routes.mjsAll four commands exit 0.
Why the misclassification matters
Section titled “Why the misclassification matters”| Conclusion | Meaning | Should escalate? |
|---|---|---|
success | Job passed | No |
failure | Job failed | Yes |
timed_out | Job exceeded timeout | Yes |
startup_failure | Runner / setup failed | Yes |
cancelled | Operator-cancelled or cancel-in-progress superseded | No — almost always replaced by a newer run |
neutral, skipped, action_required | Informational | No |
Treating cancelled as red guarantees a false positive every time a cancel-in-progress concurrency group fires — which is the default pattern for Build and deploy-style workflows. main-guard’s sampling must look at the latest run for the SHA, not any historical run, and must filter on conclusion in {failure, timed_out, startup_failure}.
Recommended fix (in rig-conductor)
Section titled “Recommended fix (in rig-conductor)”Track in dashecorp/rig-conductor (Phase 1 main-guard implementation):
- Resolve the latest
workflow_runfor the head SHA ofmain, not the first failure encountered. - Map conclusions:
- Escalate only on
failure | timed_out | startup_failure. - Treat
cancelled | success | neutral | skipped | action_requiredas not-red.
- Escalate only on
- Before filing the SLA-breach incident, re-check the latest run on
mainHEAD. If it’s green, skip the file. - Log the run id and conclusion in the incident body so the receiving agent can audit the decision.
This postmortem is the artifact for filing that systemic fix as a rig-conductor issue.
Closing actions for this incident
Section titled “Closing actions for this incident”- This PR closes
rig-docs#267— no forward fix needed; main is already green. - No revert of PR #264 — reverting would re-introduce the original Zod URL-validation regression from PR #263.
- File a follow-up against
rig-conductorreferencing this postmortem so the conclusion-filter fix lands in main-guard Phase 1.