Skip to content

Give main-targeting deploy runs a unique concurrency group

main-guard filed rig-docs#265 after PR #264 merged on 2026-05-09 at 05:55:35 UTC. The post-merge Build and deploy run was marked cancelled (not failure), and main-guard treats cancelled as a regression.

Reproducing the build locally against the merge commit (c633016) produced a green build:

npm ci # ok
AUTO_DERIVE=off npm run brain:check # ok
npm run build # 103 pages, completed in 79.79s
node scripts/check-whitepaper-routes.mjs # all 18 whitepaper routes present

So the regression was not in the code that shipped. The cancellation came from the workflow’s own concurrency policy.

deploy.yml had:

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

The workflow has three triggers that all resolve to github.ref = refs/heads/main:

TriggerRefSame group?
push: branches: [main]refs/heads/mainyes
schedule: '0 * * * *'refs/heads/main (default branch)yes
repository_dispatchrefs/heads/main (default branch)yes

PR #264 merged at 05:55:35 UTC. The post-merge push run started immediately and would typically take 2–4 minutes (npm ci + brain:check + astro build + Pagefind + artifact upload). At 06:00:00 UTC the hourly cron fired against main, landed in the same concurrency group, and cancel-in-progress: true killed the still-running merge build. GitHub recorded the merge run as cancelled, main-guard interpreted that as red.

Pending-run cancellation — the rule the first fix missed

Section titled “Pending-run cancellation — the rule the first fix missed”

PR #266’s first revision only flipped cancel-in-progress to be PR-only:

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: ${{ github.event_name == 'pull_request' }}

Review-E correctly flagged this as incomplete. From GitHub Actions workflow syntax — concurrency:

“When a concurrent job or workflow is queued, if another job or workflow using the same concurrency group in the repository is in progress, the queued job or workflow will be pending. Any previously pending job or workflow in the concurrency group will be canceled.

This holds regardless of cancel-in-progress. So with all main-targeting triggers in one group, the race that tripped main-guard could still happen:

  1. 06:00 UTC cron starts, runs for ~3 min.
  2. PR merges at 06:01 UTC → push run lands pending behind the cron.
  3. 06:01 UTC repository_dispatch from rig-gitops arrives → it becomes the new pending run, and the queued push is cancelled to make room.
  4. main-guard sees the push run with conclusion cancelled and files a main-red issue — identical symptom to #265.

The only robust fix is to make sure main-targeting runs never share a queue slot.

Key the concurrency group on github.run_id for every non-PR trigger so each main-targeting run lands in its own group of one. PR runs keep the ref-based shared group so a new commit on the same PR still cancels the older build.

concurrency:
group: ${{ github.workflow }}-${{ github.event_name == 'pull_request' && github.ref || format('{0}-{1}', github.ref, github.run_id) }}
cancel-in-progress: ${{ github.event_name == 'pull_request' }}
EventGroup keyBehaviour
pull_requestBuild and deploy-refs/pull/N/merge (shared per PR)Newer commit cancels older build for the same PR. Other PRs and main runs are unaffected.
push to mainBuild and deploy-refs/heads/main-<run_id> (unique)Runs to completion. Cannot be cancelled by a later trigger.
schedule (hourly)Build and deploy-refs/heads/main-<run_id> (unique)Runs in parallel with any in-flight push/cron.
workflow_dispatchBuild and deploy-refs/heads/main-<run_id> (unique)Runs in parallel; never blocks a merge build.
repository_dispatchBuild and deploy-refs/heads/main-<run_id> (unique)Runs in parallel; never cancels a pending merge build.

Why not keep ref-level serialisation for main

Section titled “Why not keep ref-level serialisation for main”

Considered keeping group: workflow-ref for non-PR events and relying on a “multi-pending” queue. GitHub Actions does not expose a multi-pending queue mode — the single-pending behaviour is fixed. The only way to guarantee a post-merge push run is never cancelled while pending is to ensure it never queues.

Considered keying on event_name instead of run_id. Rejected because two near-simultaneous push events (rare, but possible with rapid-fire PR merges) would still share a group and reintroduce the same single-pending hazard.

Considered splitting deploy.yml into two workflows (PR-only and main-only). Rejected as accidental complexity: the build steps are identical, and any drift between the two workflows would be a new failure mode. A single workflow with one branchy concurrency expression is the smaller change.

Dropping ref-level serialisation means two main-targeting runs may now upload to Cloudflare Pages in parallel. This is safe because Cloudflare Pages serialises deploys internally per project: every wrangler pages deploy call produces an immutable deployment, and the project alias for --branch=main is atomically advanced to the most recent successful deployment. Parallel uploads converge on “latest successful deployment wins” without races on the alias.

If a future change to the build produced non-deterministic output (different bytes from the same commit), the alias could briefly point at an older deployment before the newer upload finishes — but the build today is deterministic for a given commit SHA, and wrangler pages deploy does not roll back successful deployments.

  • Post-merge Build and deploy runs cannot be cancelled by any later trigger entering the queue. main-guard will not file cancelled-conclusion issues against the workflow for this class of race anymore.
  • Two main-targeting runs may execute concurrently. CI minutes per commit are unchanged because each run is one of {push, cron, dispatch} — overlap is the exception, not the rule.
  • The Cloudflare Pages alias for --branch=main may briefly point at an older deployment if a long cron overlaps a fast push, then catches up when the newer build finishes. Acceptable: the older deployment is still a valid recent commit’s output, and the cadence is sub-hour.
  • The hourly cron and operator-triggered workflow_dispatch no longer have to wait on each other; both deploy promptly.

Forward-fix PR for #265 — see PR #266 review feedback for the original miss and the corrected design.