Postmortem — main-red SLA breach on cron-only failure (2026-05-18)
Postmortem — main-red SLA breach on cron-only failure (2026-05-18)
Section titled “Postmortem — main-red SLA breach on cron-only failure (2026-05-18)”main-guard fired the 30-minute SLA-breach escalation (
rig-docs#270) against PR #268 because the cron-drivenWeekly roadmap summaryworkflow had been failing silently every Monday since PR #131. The cited “culprit” PR #268 was doc-only and could not have caused a bash syntax error.
| Item | Value |
|---|---|
| SLA-breach incident | rig-docs#270 |
| Forward-fix incident | rig-docs#269 |
| Cited “culprit PR” | #268 (doc-only postmortem; not the cause) |
| Real cause | scripts/roadmap-summary.sh:75 — bash [[ "$due" <= "$SOON" ]] is invalid syntax |
| Introduced in | #131 — 4b6ad66 (2026-04-23) |
| Canonical forward-fix | #271 |
| Reverting #268 | Wrong — would lose a separate main-guard postmortem and not fix the cron |
Why the SLA breached
Section titled “Why the SLA breached”- PR #131 added
scripts/roadmap-summary.shcontaining[[ "$due" <= "$SOON" ]]. - Bash
[[ ]]lacks<=/>=; the function body fails to parse at call time, so every Monday cron since 2026-04-23 has exited 2 (4 silent failures: 04-27, 05-04, 05-11, 05-18). - On 2026-05-18, main-guard sampled the latest failed run (
26029526699) and filedrig-docs#269blaming PR #268 — but PR #268 only added a markdown postmortem. - Because no forward-fix merged within 30 min, main-guard escalated to the SLA-breach incident
rig-docs#270. - Two Dev-E agents picked up
#269in parallel (feature/issue-269-fix-roadmap-summary-bash,feature/issue-269-fix-weekly-roadmap); only the later one (#271) reached PR. The SLA-breach incident (#270) was dispatched separately.
Why “revert PR #268” would have been wrong
Section titled “Why “revert PR #268” would have been wrong”The escalation playbook offers gh pr revert 268 as a fallback. In this case:
- PR #268 added
src/content/docs/research/2026-05-15-main-guard-cancelled-false-positive.mdand no code. - Reverting it would not touch
scripts/roadmap-summary.shand the cron would still fail. - The lost doc is itself a main-guard postmortem; reverting would compound the operational debt.
The correct action was always a forward-fix to the bash script.
Failure modes surfaced
Section titled “Failure modes surfaced”| # | Failure mode | Implication |
|---|---|---|
| 1 | Cron workflows have no merge-to-blame mapping; main-guard attributed by “last merge before fire time” | Cron failures should report <unknown culprit — cron schedule> and skip revert offers. File against rig-conductor. |
| 2 | bash -n does not catch <= / >= inside function bodies — those parse lazily at call time | Use shellcheck (pre-installed on ubuntu-latest), not bash -n, for shell linting. |
| 3 | A scheduled-only workflow failed silently for 4 Mondays before any monitor noticed | Cron-only failures need a separate monitor path; PR-blame-based main-guard alone is insufficient. |
| 4 | Two Dev-E agents raced on #269 and produced parallel forward-fix branches | rig-conductor should single-flight dispatch per issue, or agents should check for open PRs that close the same issue before authoring. |
Follow-ups (do NOT close #270 until)
Section titled “Follow-ups (do NOT close #270 until)”-
rig-conductorissue: stop attributing cron-only failures to the last merged PR; report<no culprit>and omit revert offers from the escalation body. -
rig-conductorissue: single-flight dispatch per source issue, or check for open PRs that close the issue before re-dispatching. -
rig-docsfollow-up: addshellcheck scripts/*.shtocapabilities-lint.yml(or a newshell-lint.yml). Deliberately deferred from this PR to keep the forward-fix minimal. - Verify next Monday cron (2026-05-25 07:00 UTC) posts to Discord #admin successfully.
History
Section titled “History”This postmortem was originally drafted by dev-e-bot in PR #272. That PR also re-applied the script fix that PR #271 had merged in parallel, leaving #272 with a redundant code change and mergeStateStatus: DIRTY. Per Review-E’s request, the postmortem doc has been extracted into this docs-only PR; #272 was closed as superseded.