Skip to content

Postmortem — main-guard false positive on `cancelled` runs (2026-05-15)

Postmortem — main-guard false positive on cancelled runs (2026-05-15)

Section titled “Postmortem — main-guard false positive on cancelled runs (2026-05-15)”

main-guard (Phase 1) treated a workflow run with conclusion: cancelled as a main-red incident. The cancellation came from concurrency.cancel-in-progress: true when the forward-fix PR superseded the prior build — main was already green at the time of escalation.

ItemValue
Reported incidentrig-docs#267 — “no fix within 30m for PR #264”
Forward-fix issuerig-docs#265 (already closed)
Cited “failing run”25933361209 on commit c633016 (the merge commit of PR #264)
Actual run conclusioncancelled (per issue body: Conclusion: cancelled)
State of main at escalationGreennpm ci && AUTO_DERIVE=off npm run brain:check && npm run build && node scripts/check-whitepaper-routes.mjs all pass on c633016
Forward fix neededNone — PR #264 already corrected the github_issue URL regression from PR #263
Doc fixThis postmortem; close rig-docs#267 as not-a-bug
Systemic fixmain-guard must filter conclusion in {failure, timed_out, startup_failure} and ignore {success, cancelled, neutral, skipped, action_required}
  1. PR #263 merged with github_issue: "dashecorp/rig-docs#262" — a bare reference, not a URL. The Zod schema in src/content.config.ts enforces z.string().url(), so the post-merge Build and deploy workflow on main failed.
  2. main-guard filed rig-docs#265 (the forward-fix issue).
  3. PR #264 opened by dev-e-bot — single-line fix replacing the bare reference with https://github.com/dashecorp/rig-docs/issues/262. Merged at 2026-05-09T05:55:35Z.
  4. The Build and deploy run on PR #264’s merge commit (c633016) was cancelled by concurrency.cancel-in-progress: true in deploy.yml. The next push (or scheduled hourly run) immediately superseded it. The replacement run was green; the site deployed.
  5. main-guard sampled the cancelled run, classified it as “still red,” waited for the 30-minute SLA, and filed rig-docs#267 — even though rig-docs#265 was already closed and main was green.
Terminal window
git fetch origin main
git checkout c6330160fab857526238c5e52a0bd4a45d5c570e
npm ci
AUTO_DERIVE=off npm run brain:check
npm run build
node scripts/check-whitepaper-routes.mjs

All four commands exit 0.

ConclusionMeaningShould escalate?
successJob passedNo
failureJob failedYes
timed_outJob exceeded timeoutYes
startup_failureRunner / setup failedYes
cancelledOperator-cancelled or cancel-in-progress supersededNo — almost always replaced by a newer run
neutral, skipped, action_requiredInformationalNo

Treating cancelled as red guarantees a false positive every time a cancel-in-progress concurrency group fires — which is the default pattern for Build and deploy-style workflows. main-guard’s sampling must look at the latest run for the SHA, not any historical run, and must filter on conclusion in {failure, timed_out, startup_failure}.

Track in dashecorp/rig-conductor (Phase 1 main-guard implementation):

  1. Resolve the latest workflow_run for the head SHA of main, not the first failure encountered.
  2. Map conclusions:
    • Escalate only on failure | timed_out | startup_failure.
    • Treat cancelled | success | neutral | skipped | action_required as not-red.
  3. Before filing the SLA-breach incident, re-check the latest run on main HEAD. If it’s green, skip the file.
  4. Log the run id and conclusion in the incident body so the receiving agent can audit the decision.

This postmortem is the artifact for filing that systemic fix as a rig-conductor issue.

  • This PR closes rig-docs#267 — no forward fix needed; main is already green.
  • No revert of PR #264 — reverting would re-introduce the original Zod URL-validation regression from PR #263.
  • File a follow-up against rig-conductor referencing this postmortem so the conclusion-filter fix lands in main-guard Phase 1.