Postmortem — chicken-and-egg dispatch halt (2026-04-30)
Postmortem — chicken-and-egg dispatch halt (2026-04-30)
Section titled “Postmortem — chicken-and-egg dispatch halt (2026-04-30)”The Planner-E rollout was blocked for ~6 hours by three independent bugs that combined to prevent the rig from self-healing. The fix PR could not merge through the rig’s own pipeline because the very bug it fixed was preventing dispatch. An operator
gh pr merge --adminbroke the cycle.
| Item | Value |
|---|---|
| Outage window | ~6h, 2026-04-29 ~21:30 UTC → 2026-04-30 ~05:58 UTC |
| User-visible impact | Three agent-ready issues idle for 2 days; Planner-E rollout blocked; staging-rig epic (rig-gitops#207) replan blocked behind it |
| Independent bugs | 3 (dispatch silent no-op + Review-E CoI false-positive + COMMENTED-as-final) |
| Layers of safety that should have caught it | 4 (IssueScan, Reconciliation, OrphanScan, halt detection) |
| Layers that did | 0 |
| Break-the-cycle action | Operator gh pr merge --admin on rig-conductor#608 |
| Systemic fixes filed | 5 (rc#609, rc#610, rc#612, rar#181, rar#183) |
Timeline
Section titled “Timeline”All timestamps UTC. Sourced from the GitHub created_at / closed_at fields on the cited issues and PRs.
| When | Event | Source |
|---|---|---|
| 2026-04-29 20:09 | Operator files rc#604: rig-agent-runtime intake silent for 2 days — three agent-ready issues idle | rc#604 |
| 2026-04-29 20:32 | rc#605 merges: StaleAgentReadyWatcher (1h scan) added as a recovery layer for queued && AgentId is null > 24h | rc#605 |
| 2026-04-29 21:03 | rc#607 filed: the root cause behind rc#604 is identified — SubmitEvent returns null on the dispatch path, silently dropping fresh agent-ready events | rc#607 |
| 2026-04-29 21:31 | rc#608 opened with the fix for the silent no-op | rc#608 |
| 2026-04-29 22:02 | rar#181 filed: Review-E falsely flags Dev-E-authored PRs as conflict-of-interest — 11+ occurrences observed (rc#429, rc#432). rc#608 is in the firing line. | rar#181 |
| 2026-04-30 ~early | Review-E posts a COMMENTED review on rc#608 stating it is a self-conflict abstention and explicitly promising “the binding approve / request-changes vote should come from an independent Review-E run on the next cron tick.” No follow-up vote ever fires. | rc#610 body |
| 2026-04-30 05:22 | rc#609 filed: meta-fix — unified halt-detection layer across all halt classes (“the biggest problem is that everything stops”) | rc#609 |
| 2026-04-30 05:56 | rc#610 filed: COMMENTED reviews treated as final; promised “next cron tick” re-review never fires | rc#610 |
| 2026-04-30 05:58 | Operator runs gh pr merge --admin on rc#608 to break the cycle. PR closes as MERGED. | rc#608 closed_at |
| 2026-04-30 06:01 | rc#612 and rar#183 filed in parallel: structured operator-override channel with audit trail; session/cache separation between Dev-E and Review-E | rc#612, rar#183 |
| 2026-04-30 06:32 | rc#610 closes (re-review-on-COMMENTED-with-abstention shipped) | rc#610 closed_at |
| 2026-04-30 07:47 | rc#612 closes (operator-override grammar + audit shipped) | rc#612 closed_at |
The three independent bugs
Section titled “The three independent bugs”Each bug was, on its own, a recoverable single-layer failure. The interaction is what halted the rig.
1. SubmitEvent silent no-op — rc#608
Section titled “1. SubmitEvent silent no-op — rc#608”SubmitEvent on the dispatch path returned null rather than persisting the event for fresh agent-ready issues. The producer believed the event was accepted; the projection never advanced. Visible only as “two days of nothing happening” until rc#604 surfaced it. Root cause behind rc#604’s “intake silent” symptom; rc#605’s StaleAgentReadyWatcher patched the symptom (re-emit after 24h). rc#608 patched the actual bug.
2. Review-E CoI predicate over-anchored on session context — rar#181
Section titled “2. Review-E CoI predicate over-anchored on session context — rar#181”The conflict-of-interest gate is a real safety property: a single agent must not approve its own work. The implementation, however, anchored on signals shared by the Dev-E and Review-E sessions rather than on the actual PR author. Result: a Dev-E → Review-E handoff was rejected as self-review at a sustained 11+-occurrence rate (rc#429, rc#432), including rc#608 itself. Earlier rar#169 was a partial anchor fix; rar#181 is the full predicate fix; rar#183 removes the cause of the shared signal (session/cache separation).
3. Conductor treating COMMENTED reviews as final, no “next cron tick” — rc#610
Section titled “3. Conductor treating COMMENTED reviews as final, no “next cron tick” — rc#610”When Review-E correctly recognised the CoI condition and posted a COMMENTED review labeled as a self-conflict abstention, the conductor’s reconciler treated the PR as “reviewed” and stopped re-dispatching it. The promised re-review on the next 5-minute cron tick was never wired. The PR sat in REVIEW_REQUIRED for 2+ hours with a detailed COMMENTED review and zero binding votes.
Why each layer of safety failed
Section titled “Why each layer of safety failed”The rig has multiple layers explicitly designed to catch silent halts. None caught this one.
| Layer | What it should have caught | Why it didn’t |
|---|---|---|
IssueScanService | An agent-ready issue with no dispatch event | Scans for missing state, not for state where SubmitEvent returned null and never persisted. The projection looked correct because no event existed to disagree with it. |
ReconciliationService | Live GitHub state ≠ projection state | The PR existed in GitHub and in the projection — both layers agreed the PR existed. Neither was checking “PR has only abstention reviews and zero binding votes for >Xh.” |
OrphanScanService | Issues queued > threshold | Threshold tuned for normal traffic; the three idle issues had been moved past queued into assigned (then orphaned silently when SubmitEvent dropped the next event). They didn’t trip the orphan threshold. |
Halt detector for review-without-vote | A PR with COMMENTED-only reviews and no progress | Did not exist. rc#609 is the meta-fix that adds this class. |
The pattern: each layer is a single-purpose detector that assumes one specific shape of failure. None of them models “the rig is making no progress, why.”
The break-the-cycle action
Section titled “The break-the-cycle action”At 2026-04-30 05:58 UTC the operator ran:
gh pr merge --admin dashecorp/rig-conductor#608This was the only action available that does not depend on the rig’s own dispatch / review / merge pipeline working. The chicken-and-egg shape was: the fix for the silent dispatch could not be dispatched, reviewed, or merged through the very pipeline it fixed.
Once rc#608 merged, the rig recovered on its own — rc#609, rc#610, rc#612 all then flowed through the normal pipeline within ~2 hours.
Systemic fixes filed
Section titled “Systemic fixes filed”| Issue | Fixes |
|---|---|
rig-conductor#609 | Unified halt-detection layer — HaltDetectionService + pluggable IHaltDetector interface. New detectors include PrChangesRequestedNoIterationDetector, PrLabelChangedNoReviewDetector, PrLongIdleDetector, ApprovedNotMergedDetector, ReviewRecordPrUnmergedDetector, and (filed under this work) a review-without-binding-vote detector. |
rig-conductor#610 | Re-dispatch a PR for review when the only review is COMMENTED with self-conflict-abstention language, on the next 5-min reconciler tick, targeting a different agent session. |
rig-conductor#612 | Structured operator-override channel with grammar and append-only audit trail. Replaces ad-hoc gh pr merge --admin with a recorded, gate-bypassing path that future halt-detectors can correlate against. |
rig-agent-runtime#181 | Tighten the Review-E CoI predicate so it anchors on PR author rather than shared session context; emit structured CoiBlocked { Reason } events for diagnostics; backfill open false-positive-blocked PRs. |
rig-agent-runtime#183 | Architectural fix: enforce session/cache separation between Dev-E and Review-E so the CoI predicate can never see shared signals in the first place. |
Principle reinforced
Section titled “Principle reinforced”Single-path safety always has this failure mode. Defense in depth, or it doesn’t actually defend.
Every one of the four safety layers above was correctly implemented for the failure shape it was designed to catch. Each one would have caught a different halt in isolation. None of them caught this one because the halt’s actual shape was new, and no detector models “rig is making no progress, why” as a first-class concept.
The lesson is not “add another detector.” It is rc#609’s framing: a unified halt-detection layer where each new failure shape becomes one more pluggable IHaltDetector rather than a one-off scanner. The rig must monitor its own forward progress as a property, not enumerate the specific ways it can stall.
Trigger to re-evaluate the staging-rig epic
Section titled “Trigger to re-evaluate the staging-rig epic”rig-gitops#207 — the staging-rig epic — is currently paused pending the Planner-E rollout. All three of the bugs above would have been caught by a 24-hour soak environment before they reached production.
Action for future Planner-E: once Planner-E is live, replan rg#207. The cost-of-a-staging-rig calculation now has new evidence on its side: a 6-hour production halt that admin-merge-only recovery resolved. The staging soak environment is no longer a “nice to have.”
rig-conductor#604— initial symptomrig-conductor#605—StaleAgentReadyWatcher(symptom patch)rig-conductor#607— root-cause issuerig-conductor#608— the admin-merged fixrig-conductor#609— halt-detection meta-fixrig-conductor#610— COMMENTED-as-final fixrig-conductor#612— operator-override channelrig-agent-runtime#181— CoI predicate fixrig-agent-runtime#183— session/cache separationrig-gitops#207— staging-rig epic (paused)