Postmortem — main-red SLA breach on cron-only failure (2026-05-18)

main-guard fired the 30-minute SLA-breach escalation (rig-docs#270) against PR #268 because the cron-driven Weekly roadmap summary workflow had been failing silently every Monday since PR #131. The cited “culprit” PR #268 was doc-only and could not have caused a bash syntax error.

TL;DR

Item	Value
SLA-breach incident	`rig-docs#270`
Forward-fix incident	`rig-docs#269`
Cited “culprit PR”	`#268` (doc-only postmortem; not the cause)
Real cause	`scripts/roadmap-summary.sh:75` — bash `[[ "$due" <= "$SOON" ]]` is invalid syntax
Introduced in	`#131` — `4b6ad66` (2026-04-23)
Canonical forward-fix	`#271`
Reverting #268	Wrong — would lose a separate main-guard postmortem and not fix the cron

Why the SLA breached

PR #131 added scripts/roadmap-summary.sh containing [[ "$due" <= "$SOON" ]].
Bash [[ ]] lacks <= / >=; the function body fails to parse at call time, so every Monday cron since 2026-04-23 has exited 2 (4 silent failures: 04-27, 05-04, 05-11, 05-18).
On 2026-05-18, main-guard sampled the latest failed run (26029526699) and filed rig-docs#269 blaming PR #268 — but PR #268 only added a markdown postmortem.
Because no forward-fix merged within 30 min, main-guard escalated to the SLA-breach incident rig-docs#270.
Two Dev-E agents picked up #269 in parallel (feature/issue-269-fix-roadmap-summary-bash, feature/issue-269-fix-weekly-roadmap); only the later one (#271) reached PR. The SLA-breach incident (#270) was dispatched separately.

Why “revert PR #268” would have been wrong

The escalation playbook offers gh pr revert 268 as a fallback. In this case:

PR #268 added src/content/docs/research/2026-05-15-main-guard-cancelled-false-positive.md and no code.
Reverting it would not touch scripts/roadmap-summary.sh and the cron would still fail.
The lost doc is itself a main-guard postmortem; reverting would compound the operational debt.

The correct action was always a forward-fix to the bash script.

Failure modes surfaced

#	Failure mode	Implication
1	Cron workflows have no merge-to-blame mapping; main-guard attributed by “last merge before fire time”	Cron failures should report `<unknown culprit — cron schedule>` and skip revert offers. File against `rig-conductor`.
2	`bash -n` does not catch `<=` / `>=` inside function bodies — those parse lazily at call time	Use `shellcheck` (pre-installed on `ubuntu-latest`), not `bash -n`, for shell linting.
3	A scheduled-only workflow failed silently for 4 Mondays before any monitor noticed	Cron-only failures need a separate monitor path; PR-blame-based main-guard alone is insufficient.
4	Two Dev-E agents raced on `#269` and produced parallel forward-fix branches	rig-conductor should single-flight dispatch per issue, or agents should check for open PRs that close the same issue before authoring.

Follow-ups (do NOT close `#270` until)

rig-conductor issue: stop attributing cron-only failures to the last merged PR; report <no culprit> and omit revert offers from the escalation body.
rig-conductor issue: single-flight dispatch per source issue, or check for open PRs that close the issue before re-dispatching.
rig-docs follow-up: add shellcheck scripts/*.sh to capabilities-lint.yml (or a new shell-lint.yml). Deliberately deferred from this PR to keep the forward-fix minimal.
Verify next Monday cron (2026-05-25 07:00 UTC) posts to Discord #admin successfully.

History

This postmortem was originally drafted by dev-e-bot in PR #272. That PR also re-applied the script fix that PR #271 had merged in parallel, leaving #272 with a redundant code change and mergeStateStatus: DIRTY. Per Review-E’s request, the postmortem doc has been extracted into this docs-only PR; #272 was closed as superseded.

Postmortem — main-red SLA breach on cron-only failure (2026-05-18)

Postmortem — main-red SLA breach on cron-only failure (2026-05-18)

TL;DR

Why the SLA breached

Why “revert PR #268” would have been wrong

Failure modes surfaced

Follow-ups (do NOT close #270 until)

History

Follow-ups (do NOT close `#270` until)