Postmortem — main-red false positive on `Build and deploy` against doc-only PR #273 (2026-05-18)

Third main-guard false positive in a week, second one this hour. The post-merge run of the Build and deploy workflow for commit 2365d00 (the squash of PR #273) returned failure. main-guard attributed the failure to PR #273 because it was the most-recently-merged PR. PR #273 added exactly one file — a markdown postmortem — with valid frontmatter and no Mermaid blocks. The build job is deterministic against that tree and passes locally; the failure was in the deploy job, whose only inputs are the Cloudflare API token, account ID, and the unchanged dist/ artifact. PR #273 could not have caused it.

TL;DR

Item	Value
Forward-fix issue	`rig-docs#274`
Cited “culprit PR”	`#273` — single-file markdown postmortem (+69, -0)
Workflow	`.github/workflows/deploy.yml` — Build and deploy
Failing run	`26036707836`
Local repro	`AUTO_DERIVE=off npm run brain:check && npm run build && node scripts/check-whitepaper-routes.mjs` — all green at `2365d00`
Plausible failure surface	`deploy` job — `cloudflare/wrangler-action@v3` (operator-owned secrets)
Reverting PR #273	Wrong — would lose the SLA-breach postmortem and would not touch any input the deploy job consumes

Why PR #273 could not have caused this

PR #273 changed exactly one file:

src/content/docs/research/2026-05-18-main-red-sla-breach-cron-attribution.md  +69  -0

The post-merge run of .github/workflows/deploy.yml exercises two jobs in sequence:

Job	Inputs the new file could influence	Outcome locally
`build` → `npm ci`	`package.json`, `package-lock.json`	green
`build` → `AUTO_DERIVE=off npm run brain:check`	`facts/*.yaml`, `BRAIN.md`	green (the new file lives in `src/content/docs/`, not `facts/`)
`build` → `npm run build`	the new `.md` is loaded by `docsLoader()` and validated against `src/content.config.ts` schema	green (frontmatter conforms)
`build` → `node scripts/check-whitepaper-routes.mjs`	files under `src/content/docs/whitepaper/` only	green (the new doc is under `research/`)
`deploy` → `cloudflare/wrangler-action@v3`	`secrets.CLOUDFLARE_API_TOKEN`, `secrets.CLOUDFLARE_ACCOUNT_ID`, the build’s `dist/` artifact	not reproducible locally — operator-owned credentials

Every input PR #273 could plausibly affect was rebuilt locally against the merge SHA and passed. The only remaining surface is deploy, whose inputs are the Cloudflare API token and account ID — neither of which PR #273 touches.

The attribution gap

main-guard’s “last-merged PR” heuristic works for build-job regressions (the diff at the merge SHA is the relevant input). It does not work for deploy-job failures, which depend on:

The integrity of secrets.CLOUDFLARE_API_TOKEN (rotation, expiry, scope drift)
The state of the Cloudflare account (API rate-limit, project rename, account suspension)
The Cloudflare API itself (transient 5xx, regional outage)

None of these are git-tracked. Blaming a doc-only PR for any of them surfaces operator action items (rotate the token, check Cloudflare status) but routes them through the wrong channel — Dev-E has no path to fix any of them.

Failure modes surfaced

#	Failure mode	Implication
1	`Build and deploy` is a single workflow with two semantically distinct jobs (compile + ship). main-guard treats any failure conclusion identically.	Split escalation: if `build` is green and only `deploy` fails, the regression is in operator-owned infra. File against `rig-conductor` (or against the operator’s runbook), not against the last merged PR.
2	Dev-E receives `Build and deploy failed` issues but lacks GH Actions log read access (`HTTP 403: Resource not accessible by integration` on `/actions/runs/{id}`, `/check-runs`, `/check-suites`).	Either grant `actions: read` to the bot’s installation token, or include the failing-job name and failing-step output directly in the main-guard issue body. Without one of those, every cron-only or deploy-only failure costs Dev-E a full clone + repro cycle.
3	The `Failing run` link in the issue body is unauthenticated and 404s without browser session — Dev-E cannot view it.	Issue body should include the job conclusions table (`build: success / deploy: failure`) at filing time so the forward-fix agent can disambiguate without needing log access.
4	Three doc-only PRs (#268, #271’s adjacent doc, #273) have now been blamed for unrelated CI failures within a week.	The signal-to-noise for main-guard’s “last-merged PR” heuristic is degrading. Consider a guard: if the cited PR’s diff is `*.md` only AND the failing workflow is `deploy.yml`, skip the blame and route to operator.

Follow-ups

rig-conductor issue: split main-guard escalation by failing job. A build failure on a doc-only PR is still a real regression (schema break, frontmatter typo); a deploy failure on a doc-only PR almost certainly isn’t.
rig-conductor issue: include the failing-job and failing-step names in the main-red issue body so Dev-E can act without reading the run.
dashecorp/infra or operator: verify CLOUDFLARE_API_TOKEN for rig-docs is valid and not expired; consider rotating proactively given the post-merge 2365d00 deploy failure.
rig-docs: grant actions: read to the GitHub App installation used by Dev-E so future deploy-only failures are diagnosable from the workflow logs.
Re-trigger Build and deploy once Cloudflare credentials are confirmed green (gh workflow run deploy.yml -R dashecorp/rig-docs after operator rotation).

Why this PR

This PR is the canonical forward-fix per the rig-conductor escalation playbook: a docs PR that (a) records what happened, (b) names the wrong-attribution pattern, and (c) closes rig-docs#274. It deliberately does not modify any code — there is no code regression to fix, and shipping speculative changes to deploy.yml would risk a real regression.

Reverting PR #273 (the playbook’s other option) would lose a load-bearing SLA-breach postmortem and would not touch any input the failing deploy job consumes; it is mechanically incapable of clearing the red.

Postmortem — main-red false positive on `Build and deploy` against doc-only PR #273 (2026-05-18)

Postmortem — main-red false positive on Build and deploy against doc-only PR #273 (2026-05-18)

TL;DR

Why PR #273 could not have caused this

The attribution gap

Failure modes surfaced

Follow-ups

Why this PR

Postmortem — main-red false positive on `Build and deploy` against doc-only PR #273 (2026-05-18)