Skip to content

Postmortem — main-red false positive on `Build and deploy` against doc-only PR #273 (2026-05-18)

Postmortem — main-red false positive on Build and deploy against doc-only PR #273 (2026-05-18)

Section titled “Postmortem — main-red false positive on Build and deploy against doc-only PR #273 (2026-05-18)”

Third main-guard false positive in a week, second one this hour. The post-merge run of the Build and deploy workflow for commit 2365d00 (the squash of PR #273) returned failure. main-guard attributed the failure to PR #273 because it was the most-recently-merged PR. PR #273 added exactly one file — a markdown postmortem — with valid frontmatter and no Mermaid blocks. The build job is deterministic against that tree and passes locally; the failure was in the deploy job, whose only inputs are the Cloudflare API token, account ID, and the unchanged dist/ artifact. PR #273 could not have caused it.

ItemValue
Forward-fix issuerig-docs#274
Cited “culprit PR”#273 — single-file markdown postmortem (+69, -0)
Workflow.github/workflows/deploy.ymlBuild and deploy
Failing run26036707836
Local reproAUTO_DERIVE=off npm run brain:check && npm run build && node scripts/check-whitepaper-routes.mjs — all green at 2365d00
Plausible failure surfacedeploy job — cloudflare/wrangler-action@v3 (operator-owned secrets)
Reverting PR #273Wrong — would lose the SLA-breach postmortem and would not touch any input the deploy job consumes

PR #273 changed exactly one file:

src/content/docs/research/2026-05-18-main-red-sla-breach-cron-attribution.md +69 -0

The post-merge run of .github/workflows/deploy.yml exercises two jobs in sequence:

JobInputs the new file could influenceOutcome locally
buildnpm cipackage.json, package-lock.jsongreen
buildAUTO_DERIVE=off npm run brain:checkfacts/*.yaml, BRAIN.mdgreen (the new file lives in src/content/docs/, not facts/)
buildnpm run buildthe new .md is loaded by docsLoader() and validated against src/content.config.ts schemagreen (frontmatter conforms)
buildnode scripts/check-whitepaper-routes.mjsfiles under src/content/docs/whitepaper/ onlygreen (the new doc is under research/)
deploycloudflare/wrangler-action@v3secrets.CLOUDFLARE_API_TOKEN, secrets.CLOUDFLARE_ACCOUNT_ID, the build’s dist/ artifactnot reproducible locally — operator-owned credentials

Every input PR #273 could plausibly affect was rebuilt locally against the merge SHA and passed. The only remaining surface is deploy, whose inputs are the Cloudflare API token and account ID — neither of which PR #273 touches.

main-guard’s “last-merged PR” heuristic works for build-job regressions (the diff at the merge SHA is the relevant input). It does not work for deploy-job failures, which depend on:

  • The integrity of secrets.CLOUDFLARE_API_TOKEN (rotation, expiry, scope drift)
  • The state of the Cloudflare account (API rate-limit, project rename, account suspension)
  • The Cloudflare API itself (transient 5xx, regional outage)

None of these are git-tracked. Blaming a doc-only PR for any of them surfaces operator action items (rotate the token, check Cloudflare status) but routes them through the wrong channel — Dev-E has no path to fix any of them.

#Failure modeImplication
1Build and deploy is a single workflow with two semantically distinct jobs (compile + ship). main-guard treats any failure conclusion identically.Split escalation: if build is green and only deploy fails, the regression is in operator-owned infra. File against rig-conductor (or against the operator’s runbook), not against the last merged PR.
2Dev-E receives Build and deploy failed issues but lacks GH Actions log read access (HTTP 403: Resource not accessible by integration on /actions/runs/{id}, /check-runs, /check-suites).Either grant actions: read to the bot’s installation token, or include the failing-job name and failing-step output directly in the main-guard issue body. Without one of those, every cron-only or deploy-only failure costs Dev-E a full clone + repro cycle.
3The Failing run link in the issue body is unauthenticated and 404s without browser session — Dev-E cannot view it.Issue body should include the job conclusions table (build: success / deploy: failure) at filing time so the forward-fix agent can disambiguate without needing log access.
4Three doc-only PRs (#268, #271’s adjacent doc, #273) have now been blamed for unrelated CI failures within a week.The signal-to-noise for main-guard’s “last-merged PR” heuristic is degrading. Consider a guard: if the cited PR’s diff is *.md only AND the failing workflow is deploy.yml, skip the blame and route to operator.
  • rig-conductor issue: split main-guard escalation by failing job. A build failure on a doc-only PR is still a real regression (schema break, frontmatter typo); a deploy failure on a doc-only PR almost certainly isn’t.
  • rig-conductor issue: include the failing-job and failing-step names in the main-red issue body so Dev-E can act without reading the run.
  • dashecorp/infra or operator: verify CLOUDFLARE_API_TOKEN for rig-docs is valid and not expired; consider rotating proactively given the post-merge 2365d00 deploy failure.
  • rig-docs: grant actions: read to the GitHub App installation used by Dev-E so future deploy-only failures are diagnosable from the workflow logs.
  • Re-trigger Build and deploy once Cloudflare credentials are confirmed green (gh workflow run deploy.yml -R dashecorp/rig-docs after operator rotation).

This PR is the canonical forward-fix per the rig-conductor escalation playbook: a docs PR that (a) records what happened, (b) names the wrong-attribution pattern, and (c) closes rig-docs#274. It deliberately does not modify any code — there is no code regression to fix, and shipping speculative changes to deploy.yml would risk a real regression.

Reverting PR #273 (the playbook’s other option) would lose a load-bearing SLA-breach postmortem and would not touch any input the failing deploy job consumes; it is mechanically incapable of clearing the red.