Story: Fix a stuck reconciliation loop
It was a Tuesday morning. The conductor dashboard showed 30 open issues, all with agent-ready, none moving. Dev-E-node’s last heartbeat was 47 minutes old. Something had gone quiet.
This is what happened next — and why it took six minutes, not six hours.
The stuck state
Section titled “The stuck state”Dev-E-node had been assigned rig-agent-runtime#88 — a refactor of the heartbeat emission logic. Midway through a claude session, the pod lost its API connection. The session terminated without posting a CLI_COMPLETED event. From the conductor’s perspective, Dev-E had the issue claimed and was working.
Except she wasn’t. She was dead.
The conductor’s StaleHeartbeatService runs on a 10-minute cron. It scans for agents whose last HEARTBEAT event is older than the configured threshold (default: 30 minutes). When it finds one, it:
- Emits
AGENT_STUCKwith the agent ID, issue number, and elapsed time - Posts a Discord notification to
#dev-ewith the stuck detail - Re-queues the issue as unassigned — clearing the lock
At the 47-minute mark, the service fired. The Discord notification landed. The issue was freed.
The backlog drain
Section titled “The backlog drain”With the stuck issue back in the queue, Dev-E-node restarted (Kubernetes restartPolicy: Always — the pod had already recycled once). She polled GET /api/assignments/next and got rig-agent-runtime#88 back.
But there were 29 other issues. With one agent instance handling one issue at a time, a 30-issue backlog would drain serially. The operator had three choices:
| Option | Tradeoff |
|---|---|
| Wait (serial drain) | ~30 hours at 1 hr/issue average |
| Scale Dev-E replicas temporarily | Faster drain; more concurrent API spend |
| Prioritize — close/defer low-value issues | Reduces queue without scaling |
The operator chose a hybrid: scale the node variant from replicas: 1 to replicas: 3 for the drain, then scale back. This required a one-line edit to the HelmRelease in rig-gitops and a Flux reconcile.
With three concurrent instances, each polling GET /api/assignments/next, the queue drained in parallel. The assignment API’s optimistic concurrency guarantee prevented double-assignment: each call claimed a distinct issue.
10:14 Dev-E-node-1 ← rig-agent-runtime#8810:14 Dev-E-node-2 ← rig-docs#17610:14 Dev-E-node-3 ← infra#13110:15 ...29 issues assigned in 47 seconds11:52 Final issue mergedThe backlog that looked like a day of work drained in 98 minutes.
What the rig learned
Section titled “What the rig learned”After the drain, Dev-E wrote two memories:
kind: errortitle: "Heartbeat gap caused 47-min stuck state"content: "Dev-E lost API connection mid-session on rig-agent-runtime#88. StaleHeartbeatService caught it at 47 min (threshold: 30 min). Issue was re-queued automatically. No human action needed."importance: 3kind: decisiontitle: "Scale replicas temporarily for backlog drain"content: "replicas: 1 → 3 via HelmRelease edit + flux reconcile. Drains 30 issues in ~90 min vs ~30 hrs serially. Scale back to 1 after queue clear."importance: 4The next time a backlog piles up, any Dev-E instance reading those memories will know the pattern: check heartbeat thresholds, confirm auto-re-queue worked, consider temporary replica scale if queue > 15 issues.
Closing
Section titled “Closing”The stuck state is not a failure. It’s an expected condition in a system where agents run inside pods that can be evicted, lose connectivity, or hit token limits. The failure would be a stuck state that required a human to diagnose and manually re-queue.
The rig’s answer: fixed detection (StaleHeartbeatService), automatic recovery (re-queue on threshold breach), and memory (so future agents know the playbook without reading a postmortem).
Next: what happens when you want two Dev-E-dotnet instances instead of one — and why naively doubling replicas breaks the message queue.