Skip to content

Postmortem — main-red SLA breach on cron-only failure (2026-05-18)

Postmortem — main-red SLA breach on cron-only failure (2026-05-18)

Section titled “Postmortem — main-red SLA breach on cron-only failure (2026-05-18)”

main-guard fired the 30-minute SLA-breach escalation (rig-docs#270) against PR #268 because the cron-driven Weekly roadmap summary workflow had been failing silently every Monday since PR #131. The cited “culprit” PR #268 was doc-only and could not have caused a bash syntax error.

ItemValue
SLA-breach incidentrig-docs#270
Forward-fix incidentrig-docs#269
Cited “culprit PR”#268 (doc-only postmortem; not the cause)
Real causescripts/roadmap-summary.sh:75 — bash [[ "$due" <= "$SOON" ]] is invalid syntax
Introduced in#1314b6ad66 (2026-04-23)
Canonical forward-fix#271
Reverting #268Wrong — would lose a separate main-guard postmortem and not fix the cron
  1. PR #131 added scripts/roadmap-summary.sh containing [[ "$due" <= "$SOON" ]].
  2. Bash [[ ]] lacks <= / >=; the function body fails to parse at call time, so every Monday cron since 2026-04-23 has exited 2 (4 silent failures: 04-27, 05-04, 05-11, 05-18).
  3. On 2026-05-18, main-guard sampled the latest failed run (26029526699) and filed rig-docs#269 blaming PR #268 — but PR #268 only added a markdown postmortem.
  4. Because no forward-fix merged within 30 min, main-guard escalated to the SLA-breach incident rig-docs#270.
  5. Two Dev-E agents picked up #269 in parallel (feature/issue-269-fix-roadmap-summary-bash, feature/issue-269-fix-weekly-roadmap); only the later one (#271) reached PR. The SLA-breach incident (#270) was dispatched separately.

Why “revert PR #268” would have been wrong

Section titled “Why “revert PR #268” would have been wrong”

The escalation playbook offers gh pr revert 268 as a fallback. In this case:

  • PR #268 added src/content/docs/research/2026-05-15-main-guard-cancelled-false-positive.md and no code.
  • Reverting it would not touch scripts/roadmap-summary.sh and the cron would still fail.
  • The lost doc is itself a main-guard postmortem; reverting would compound the operational debt.

The correct action was always a forward-fix to the bash script.

#Failure modeImplication
1Cron workflows have no merge-to-blame mapping; main-guard attributed by “last merge before fire time”Cron failures should report <unknown culprit — cron schedule> and skip revert offers. File against rig-conductor.
2bash -n does not catch <= / >= inside function bodies — those parse lazily at call timeUse shellcheck (pre-installed on ubuntu-latest), not bash -n, for shell linting.
3A scheduled-only workflow failed silently for 4 Mondays before any monitor noticedCron-only failures need a separate monitor path; PR-blame-based main-guard alone is insufficient.
4Two Dev-E agents raced on #269 and produced parallel forward-fix branchesrig-conductor should single-flight dispatch per issue, or agents should check for open PRs that close the same issue before authoring.
  • rig-conductor issue: stop attributing cron-only failures to the last merged PR; report <no culprit> and omit revert offers from the escalation body.
  • rig-conductor issue: single-flight dispatch per source issue, or check for open PRs that close the issue before re-dispatching.
  • rig-docs follow-up: add shellcheck scripts/*.sh to capabilities-lint.yml (or a new shell-lint.yml). Deliberately deferred from this PR to keep the forward-fix minimal.
  • Verify next Monday cron (2026-05-25 07:00 UTC) posts to Discord #admin successfully.

This postmortem was originally drafted by dev-e-bot in PR #272. That PR also re-applied the script fix that PR #271 had merged in parallel, leaving #272 with a redundant code change and mergeStateStatus: DIRTY. Per Review-E’s request, the postmortem doc has been extracted into this docs-only PR; #272 was closed as superseded.