Postmortem — main-red false positive on `Build and deploy` against doc-only PR #273 (2026-05-18)
Postmortem — main-red false positive on Build and deploy against doc-only PR #273 (2026-05-18)
Section titled “Postmortem — main-red false positive on Build and deploy against doc-only PR #273 (2026-05-18)”Third main-guard false positive in a week, second one this hour. The post-merge run of the
Build and deployworkflow for commit2365d00(the squash of PR #273) returnedfailure. main-guard attributed the failure to PR #273 because it was the most-recently-merged PR. PR #273 added exactly one file — a markdown postmortem — with valid frontmatter and no Mermaid blocks. The build job is deterministic against that tree and passes locally; the failure was in thedeployjob, whose only inputs are the Cloudflare API token, account ID, and the unchangeddist/artifact. PR #273 could not have caused it.
| Item | Value |
|---|---|
| Forward-fix issue | rig-docs#274 |
| Cited “culprit PR” | #273 — single-file markdown postmortem (+69, -0) |
| Workflow | .github/workflows/deploy.yml — Build and deploy |
| Failing run | 26036707836 |
| Local repro | AUTO_DERIVE=off npm run brain:check && npm run build && node scripts/check-whitepaper-routes.mjs — all green at 2365d00 |
| Plausible failure surface | deploy job — cloudflare/wrangler-action@v3 (operator-owned secrets) |
| Reverting PR #273 | Wrong — would lose the SLA-breach postmortem and would not touch any input the deploy job consumes |
Why PR #273 could not have caused this
Section titled “Why PR #273 could not have caused this”PR #273 changed exactly one file:
src/content/docs/research/2026-05-18-main-red-sla-breach-cron-attribution.md +69 -0The post-merge run of .github/workflows/deploy.yml exercises two jobs in sequence:
| Job | Inputs the new file could influence | Outcome locally |
|---|---|---|
build → npm ci | package.json, package-lock.json | green |
build → AUTO_DERIVE=off npm run brain:check | facts/*.yaml, BRAIN.md | green (the new file lives in src/content/docs/, not facts/) |
build → npm run build | the new .md is loaded by docsLoader() and validated against src/content.config.ts schema | green (frontmatter conforms) |
build → node scripts/check-whitepaper-routes.mjs | files under src/content/docs/whitepaper/ only | green (the new doc is under research/) |
deploy → cloudflare/wrangler-action@v3 | secrets.CLOUDFLARE_API_TOKEN, secrets.CLOUDFLARE_ACCOUNT_ID, the build’s dist/ artifact | not reproducible locally — operator-owned credentials |
Every input PR #273 could plausibly affect was rebuilt locally against the merge SHA and passed. The only remaining surface is deploy, whose inputs are the Cloudflare API token and account ID — neither of which PR #273 touches.
The attribution gap
Section titled “The attribution gap”main-guard’s “last-merged PR” heuristic works for build-job regressions (the diff at the merge SHA is the relevant input). It does not work for deploy-job failures, which depend on:
- The integrity of
secrets.CLOUDFLARE_API_TOKEN(rotation, expiry, scope drift) - The state of the Cloudflare account (API rate-limit, project rename, account suspension)
- The Cloudflare API itself (transient 5xx, regional outage)
None of these are git-tracked. Blaming a doc-only PR for any of them surfaces operator action items (rotate the token, check Cloudflare status) but routes them through the wrong channel — Dev-E has no path to fix any of them.
Failure modes surfaced
Section titled “Failure modes surfaced”| # | Failure mode | Implication |
|---|---|---|
| 1 | Build and deploy is a single workflow with two semantically distinct jobs (compile + ship). main-guard treats any failure conclusion identically. | Split escalation: if build is green and only deploy fails, the regression is in operator-owned infra. File against rig-conductor (or against the operator’s runbook), not against the last merged PR. |
| 2 | Dev-E receives Build and deploy failed issues but lacks GH Actions log read access (HTTP 403: Resource not accessible by integration on /actions/runs/{id}, /check-runs, /check-suites). | Either grant actions: read to the bot’s installation token, or include the failing-job name and failing-step output directly in the main-guard issue body. Without one of those, every cron-only or deploy-only failure costs Dev-E a full clone + repro cycle. |
| 3 | The Failing run link in the issue body is unauthenticated and 404s without browser session — Dev-E cannot view it. | Issue body should include the job conclusions table (build: success / deploy: failure) at filing time so the forward-fix agent can disambiguate without needing log access. |
| 4 | Three doc-only PRs (#268, #271’s adjacent doc, #273) have now been blamed for unrelated CI failures within a week. | The signal-to-noise for main-guard’s “last-merged PR” heuristic is degrading. Consider a guard: if the cited PR’s diff is *.md only AND the failing workflow is deploy.yml, skip the blame and route to operator. |
Follow-ups
Section titled “Follow-ups”-
rig-conductorissue: split main-guard escalation by failing job. Abuildfailure on a doc-only PR is still a real regression (schema break, frontmatter typo); adeployfailure on a doc-only PR almost certainly isn’t. -
rig-conductorissue: include the failing-job and failing-step names in the main-red issue body so Dev-E can act without reading the run. -
dashecorp/infraor operator: verifyCLOUDFLARE_API_TOKENforrig-docsis valid and not expired; consider rotating proactively given the post-merge2365d00deploy failure. -
rig-docs: grantactions: readto the GitHub App installation used by Dev-E so future deploy-only failures are diagnosable from the workflow logs. - Re-trigger
Build and deployonce Cloudflare credentials are confirmed green (gh workflow run deploy.yml -R dashecorp/rig-docsafter operator rotation).
Why this PR
Section titled “Why this PR”This PR is the canonical forward-fix per the rig-conductor escalation playbook: a docs PR that (a) records what happened, (b) names the wrong-attribution pattern, and (c) closes rig-docs#274. It deliberately does not modify any code — there is no code regression to fix, and shipping speculative changes to deploy.yml would risk a real regression.
Reverting PR #273 (the playbook’s other option) would lose a load-bearing SLA-breach postmortem and would not touch any input the failing deploy job consumes; it is mechanically incapable of clearing the red.