Skip to content

Postmortem — chicken-and-egg dispatch halt (2026-04-30)

Postmortem — chicken-and-egg dispatch halt (2026-04-30)

Section titled “Postmortem — chicken-and-egg dispatch halt (2026-04-30)”

The Planner-E rollout was blocked for ~6 hours by three independent bugs that combined to prevent the rig from self-healing. The fix PR could not merge through the rig’s own pipeline because the very bug it fixed was preventing dispatch. An operator gh pr merge --admin broke the cycle.

ItemValue
Outage window~6h, 2026-04-29 ~21:30 UTC → 2026-04-30 ~05:58 UTC
User-visible impactThree agent-ready issues idle for 2 days; Planner-E rollout blocked; staging-rig epic (rig-gitops#207) replan blocked behind it
Independent bugs3 (dispatch silent no-op + Review-E CoI false-positive + COMMENTED-as-final)
Layers of safety that should have caught it4 (IssueScan, Reconciliation, OrphanScan, halt detection)
Layers that did0
Break-the-cycle actionOperator gh pr merge --admin on rig-conductor#608
Systemic fixes filed5 (rc#609, rc#610, rc#612, rar#181, rar#183)

All timestamps UTC. Sourced from the GitHub created_at / closed_at fields on the cited issues and PRs.

WhenEventSource
2026-04-29 20:09Operator files rc#604: rig-agent-runtime intake silent for 2 days — three agent-ready issues idlerc#604
2026-04-29 20:32rc#605 merges: StaleAgentReadyWatcher (1h scan) added as a recovery layer for queued && AgentId is null > 24hrc#605
2026-04-29 21:03rc#607 filed: the root cause behind rc#604 is identified — SubmitEvent returns null on the dispatch path, silently dropping fresh agent-ready eventsrc#607
2026-04-29 21:31rc#608 opened with the fix for the silent no-oprc#608
2026-04-29 22:02rar#181 filed: Review-E falsely flags Dev-E-authored PRs as conflict-of-interest — 11+ occurrences observed (rc#429, rc#432). rc#608 is in the firing line.rar#181
2026-04-30 ~earlyReview-E posts a COMMENTED review on rc#608 stating it is a self-conflict abstention and explicitly promising “the binding approve / request-changes vote should come from an independent Review-E run on the next cron tick.” No follow-up vote ever fires.rc#610 body
2026-04-30 05:22rc#609 filed: meta-fix — unified halt-detection layer across all halt classes (“the biggest problem is that everything stops”)rc#609
2026-04-30 05:56rc#610 filed: COMMENTED reviews treated as final; promised “next cron tick” re-review never firesrc#610
2026-04-30 05:58Operator runs gh pr merge --admin on rc#608 to break the cycle. PR closes as MERGED.rc#608 closed_at
2026-04-30 06:01rc#612 and rar#183 filed in parallel: structured operator-override channel with audit trail; session/cache separation between Dev-E and Review-Erc#612, rar#183
2026-04-30 06:32rc#610 closes (re-review-on-COMMENTED-with-abstention shipped)rc#610 closed_at
2026-04-30 07:47rc#612 closes (operator-override grammar + audit shipped)rc#612 closed_at

Each bug was, on its own, a recoverable single-layer failure. The interaction is what halted the rig.

SubmitEvent on the dispatch path returned null rather than persisting the event for fresh agent-ready issues. The producer believed the event was accepted; the projection never advanced. Visible only as “two days of nothing happening” until rc#604 surfaced it. Root cause behind rc#604’s “intake silent” symptom; rc#605’s StaleAgentReadyWatcher patched the symptom (re-emit after 24h). rc#608 patched the actual bug.

The conflict-of-interest gate is a real safety property: a single agent must not approve its own work. The implementation, however, anchored on signals shared by the Dev-E and Review-E sessions rather than on the actual PR author. Result: a Dev-E → Review-E handoff was rejected as self-review at a sustained 11+-occurrence rate (rc#429, rc#432), including rc#608 itself. Earlier rar#169 was a partial anchor fix; rar#181 is the full predicate fix; rar#183 removes the cause of the shared signal (session/cache separation).

3. Conductor treating COMMENTED reviews as final, no “next cron tick” — rc#610

Section titled “3. Conductor treating COMMENTED reviews as final, no “next cron tick” — rc#610”

When Review-E correctly recognised the CoI condition and posted a COMMENTED review labeled as a self-conflict abstention, the conductor’s reconciler treated the PR as “reviewed” and stopped re-dispatching it. The promised re-review on the next 5-minute cron tick was never wired. The PR sat in REVIEW_REQUIRED for 2+ hours with a detailed COMMENTED review and zero binding votes.

The rig has multiple layers explicitly designed to catch silent halts. None caught this one.

LayerWhat it should have caughtWhy it didn’t
IssueScanServiceAn agent-ready issue with no dispatch eventScans for missing state, not for state where SubmitEvent returned null and never persisted. The projection looked correct because no event existed to disagree with it.
ReconciliationServiceLive GitHub state ≠ projection stateThe PR existed in GitHub and in the projection — both layers agreed the PR existed. Neither was checking “PR has only abstention reviews and zero binding votes for >Xh.”
OrphanScanServiceIssues queued > thresholdThreshold tuned for normal traffic; the three idle issues had been moved past queued into assigned (then orphaned silently when SubmitEvent dropped the next event). They didn’t trip the orphan threshold.
Halt detector for review-without-voteA PR with COMMENTED-only reviews and no progressDid not exist. rc#609 is the meta-fix that adds this class.

The pattern: each layer is a single-purpose detector that assumes one specific shape of failure. None of them models “the rig is making no progress, why.”

At 2026-04-30 05:58 UTC the operator ran:

Terminal window
gh pr merge --admin dashecorp/rig-conductor#608

This was the only action available that does not depend on the rig’s own dispatch / review / merge pipeline working. The chicken-and-egg shape was: the fix for the silent dispatch could not be dispatched, reviewed, or merged through the very pipeline it fixed.

Once rc#608 merged, the rig recovered on its own — rc#609, rc#610, rc#612 all then flowed through the normal pipeline within ~2 hours.

IssueFixes
rig-conductor#609Unified halt-detection layer — HaltDetectionService + pluggable IHaltDetector interface. New detectors include PrChangesRequestedNoIterationDetector, PrLabelChangedNoReviewDetector, PrLongIdleDetector, ApprovedNotMergedDetector, ReviewRecordPrUnmergedDetector, and (filed under this work) a review-without-binding-vote detector.
rig-conductor#610Re-dispatch a PR for review when the only review is COMMENTED with self-conflict-abstention language, on the next 5-min reconciler tick, targeting a different agent session.
rig-conductor#612Structured operator-override channel with grammar and append-only audit trail. Replaces ad-hoc gh pr merge --admin with a recorded, gate-bypassing path that future halt-detectors can correlate against.
rig-agent-runtime#181Tighten the Review-E CoI predicate so it anchors on PR author rather than shared session context; emit structured CoiBlocked { Reason } events for diagnostics; backfill open false-positive-blocked PRs.
rig-agent-runtime#183Architectural fix: enforce session/cache separation between Dev-E and Review-E so the CoI predicate can never see shared signals in the first place.

Single-path safety always has this failure mode. Defense in depth, or it doesn’t actually defend.

Every one of the four safety layers above was correctly implemented for the failure shape it was designed to catch. Each one would have caught a different halt in isolation. None of them caught this one because the halt’s actual shape was new, and no detector models “rig is making no progress, why” as a first-class concept.

The lesson is not “add another detector.” It is rc#609’s framing: a unified halt-detection layer where each new failure shape becomes one more pluggable IHaltDetector rather than a one-off scanner. The rig must monitor its own forward progress as a property, not enumerate the specific ways it can stall.

Trigger to re-evaluate the staging-rig epic

Section titled “Trigger to re-evaluate the staging-rig epic”

rig-gitops#207 — the staging-rig epic — is currently paused pending the Planner-E rollout. All three of the bugs above would have been caught by a 24-hour soak environment before they reached production.

Action for future Planner-E: once Planner-E is live, replan rg#207. The cost-of-a-staging-rig calculation now has new evidence on its side: a 6-hour production halt that admin-merge-only recovery resolved. The staging soak environment is no longer a “nice to have.”