Postmortem — chicken-and-egg dispatch halt (2026-04-30)

The Planner-E rollout was blocked for ~6 hours by three independent bugs that combined to prevent the rig from self-healing. The fix PR could not merge through the rig’s own pipeline because the very bug it fixed was preventing dispatch. An operator gh pr merge --admin broke the cycle.

TL;DR

Item	Value
Outage window	~6h, 2026-04-29 ~21:30 UTC → 2026-04-30 ~05:58 UTC
User-visible impact	Three `agent-ready` issues idle for 2 days; Planner-E rollout blocked; staging-rig epic (`rig-gitops#207`) replan blocked behind it
Independent bugs	3 (dispatch silent no-op + Review-E CoI false-positive + COMMENTED-as-final)
Layers of safety that should have caught it	4 (IssueScan, Reconciliation, OrphanScan, halt detection)
Layers that did	0
Break-the-cycle action	Operator `gh pr merge --admin` on `rig-conductor#608`
Systemic fixes filed	5 (rc#609, rc#610, rc#612, rar#181, rar#183)

Timeline

All timestamps UTC. Sourced from the GitHub created_at / closed_at fields on the cited issues and PRs.

When	Event	Source
2026-04-29 20:09	Operator files `rc#604`: rig-agent-runtime intake silent for 2 days — three agent-ready issues idle	rc#604
2026-04-29 20:32	`rc#605` merges: `StaleAgentReadyWatcher` (1h scan) added as a recovery layer for `queued && AgentId is null > 24h`	rc#605
2026-04-29 21:03	`rc#607` filed: the root cause behind rc#604 is identified — `SubmitEvent` returns null on the dispatch path, silently dropping fresh `agent-ready` events	rc#607
2026-04-29 21:31	`rc#608` opened with the fix for the silent no-op	rc#608
2026-04-29 22:02	`rar#181` filed: Review-E falsely flags Dev-E-authored PRs as conflict-of-interest — 11+ occurrences observed (rc#429, rc#432). rc#608 is in the firing line.	rar#181
2026-04-30 ~early	Review-E posts a `COMMENTED` review on rc#608 stating it is a self-conflict abstention and explicitly promising “the binding approve / request-changes vote should come from an independent Review-E run on the next cron tick.” No follow-up vote ever fires.	rc#610 body
2026-04-30 05:22	`rc#609` filed: meta-fix — unified halt-detection layer across all halt classes (“the biggest problem is that everything stops”)	rc#609
2026-04-30 05:56	`rc#610` filed: COMMENTED reviews treated as final; promised “next cron tick” re-review never fires	rc#610
2026-04-30 05:58	Operator runs `gh pr merge --admin` on rc#608 to break the cycle. PR closes as MERGED.	rc#608 closed_at
2026-04-30 06:01	`rc#612` and `rar#183` filed in parallel: structured operator-override channel with audit trail; session/cache separation between Dev-E and Review-E	rc#612, rar#183
2026-04-30 06:32	rc#610 closes (re-review-on-COMMENTED-with-abstention shipped)	rc#610 closed_at
2026-04-30 07:47	rc#612 closes (operator-override grammar + audit shipped)	rc#612 closed_at

The three independent bugs

Each bug was, on its own, a recoverable single-layer failure. The interaction is what halted the rig.

1. `SubmitEvent` silent no-op — `rc#608`

SubmitEvent on the dispatch path returned null rather than persisting the event for fresh agent-ready issues. The producer believed the event was accepted; the projection never advanced. Visible only as “two days of nothing happening” until rc#604 surfaced it. Root cause behind rc#604’s “intake silent” symptom; rc#605’s StaleAgentReadyWatcher patched the symptom (re-emit after 24h). rc#608 patched the actual bug.

2. Review-E CoI predicate over-anchored on session context — `rar#181`

The conflict-of-interest gate is a real safety property: a single agent must not approve its own work. The implementation, however, anchored on signals shared by the Dev-E and Review-E sessions rather than on the actual PR author. Result: a Dev-E → Review-E handoff was rejected as self-review at a sustained 11+-occurrence rate (rc#429, rc#432), including rc#608 itself. Earlier rar#169 was a partial anchor fix; rar#181 is the full predicate fix; rar#183 removes the cause of the shared signal (session/cache separation).

3. Conductor treating `COMMENTED` reviews as final, no “next cron tick” — `rc#610`

When Review-E correctly recognised the CoI condition and posted a COMMENTED review labeled as a self-conflict abstention, the conductor’s reconciler treated the PR as “reviewed” and stopped re-dispatching it. The promised re-review on the next 5-minute cron tick was never wired. The PR sat in REVIEW_REQUIRED for 2+ hours with a detailed COMMENTED review and zero binding votes.

Why each layer of safety failed

The rig has multiple layers explicitly designed to catch silent halts. None caught this one.

Layer	What it should have caught	Why it didn’t
`IssueScanService`	An agent-ready issue with no dispatch event	Scans for missing state, not for state where SubmitEvent returned null and never persisted. The projection looked correct because no event existed to disagree with it.
`ReconciliationService`	Live GitHub state ≠ projection state	The PR existed in GitHub and in the projection — both layers agreed the PR existed. Neither was checking “PR has only abstention reviews and zero binding votes for >Xh.”
`OrphanScanService`	Issues queued > threshold	Threshold tuned for normal traffic; the three idle issues had been moved past `queued` into `assigned` (then orphaned silently when SubmitEvent dropped the next event). They didn’t trip the orphan threshold.
Halt detector for `review-without-vote`	A PR with COMMENTED-only reviews and no progress	Did not exist. rc#609 is the meta-fix that adds this class.

The pattern: each layer is a single-purpose detector that assumes one specific shape of failure. None of them models “the rig is making no progress, why.”

The break-the-cycle action

At 2026-04-30 05:58 UTC the operator ran:

gh pr merge --admin dashecorp/rig-conductor#608

This was the only action available that does not depend on the rig’s own dispatch / review / merge pipeline working. The chicken-and-egg shape was: the fix for the silent dispatch could not be dispatched, reviewed, or merged through the very pipeline it fixed.

Once rc#608 merged, the rig recovered on its own — rc#609, rc#610, rc#612 all then flowed through the normal pipeline within ~2 hours.

Systemic fixes filed

Issue	Fixes
`rig-conductor#609`	Unified halt-detection layer — `HaltDetectionService` + pluggable `IHaltDetector` interface. New detectors include `PrChangesRequestedNoIterationDetector`, `PrLabelChangedNoReviewDetector`, `PrLongIdleDetector`, `ApprovedNotMergedDetector`, `ReviewRecordPrUnmergedDetector`, and (filed under this work) a review-without-binding-vote detector.
`rig-conductor#610`	Re-dispatch a PR for review when the only review is `COMMENTED` with self-conflict-abstention language, on the next 5-min reconciler tick, targeting a different agent session.
`rig-conductor#612`	Structured operator-override channel with grammar and append-only audit trail. Replaces ad-hoc `gh pr merge --admin` with a recorded, gate-bypassing path that future halt-detectors can correlate against.
`rig-agent-runtime#181`	Tighten the Review-E CoI predicate so it anchors on PR author rather than shared session context; emit structured `CoiBlocked { Reason }` events for diagnostics; backfill open false-positive-blocked PRs.
`rig-agent-runtime#183`	Architectural fix: enforce session/cache separation between Dev-E and Review-E so the CoI predicate can never see shared signals in the first place.

Principle reinforced

Single-path safety always has this failure mode. Defense in depth, or it doesn’t actually defend.

Every one of the four safety layers above was correctly implemented for the failure shape it was designed to catch. Each one would have caught a different halt in isolation. None of them caught this one because the halt’s actual shape was new, and no detector models “rig is making no progress, why” as a first-class concept.

The lesson is not “add another detector.” It is rc#609’s framing: a unified halt-detection layer where each new failure shape becomes one more pluggable IHaltDetector rather than a one-off scanner. The rig must monitor its own forward progress as a property, not enumerate the specific ways it can stall.

Trigger to re-evaluate the staging-rig epic

rig-gitops#207 — the staging-rig epic — is currently paused pending the Planner-E rollout. All three of the bugs above would have been caught by a 24-hour soak environment before they reached production.

Action for future Planner-E: once Planner-E is live, replan rg#207. The cost-of-a-staging-rig calculation now has new evidence on its side: a 6-hour production halt that admin-merge-only recovery resolved. The staging soak environment is no longer a “nice to have.”

Refs

rig-conductor#604 — initial symptom
rig-conductor#605 — StaleAgentReadyWatcher (symptom patch)
rig-conductor#607 — root-cause issue
rig-conductor#608 — the admin-merged fix
rig-conductor#609 — halt-detection meta-fix
rig-conductor#610 — COMMENTED-as-final fix
rig-conductor#612 — operator-override channel
rig-agent-runtime#181 — CoI predicate fix
rig-agent-runtime#183 — session/cache separation
rig-gitops#207 — staging-rig epic (paused)

Postmortem — chicken-and-egg dispatch halt (2026-04-30)

Postmortem — chicken-and-egg dispatch halt (2026-04-30)

TL;DR

Timeline

The three independent bugs

1. SubmitEvent silent no-op — rc#608

2. Review-E CoI predicate over-anchored on session context — rar#181

3. Conductor treating COMMENTED reviews as final, no “next cron tick” — rc#610

Why each layer of safety failed

The break-the-cycle action

Systemic fixes filed

Principle reinforced

Trigger to re-evaluate the staging-rig epic

Refs

1. `SubmitEvent` silent no-op — `rc#608`

2. Review-E CoI predicate over-anchored on session context — `rar#181`

3. Conductor treating `COMMENTED` reviews as final, no “next cron tick” — `rc#610`