Skip to content

Story: Fix a stuck reconciliation loop

It was a Tuesday morning. The conductor dashboard showed 30 open issues, all with agent-ready, none moving. Dev-E-node’s last heartbeat was 47 minutes old. Something had gone quiet.

This is what happened next — and why it took six minutes, not six hours.


Dev-E-node had been assigned rig-agent-runtime#88 — a refactor of the heartbeat emission logic. Midway through a claude session, the pod lost its API connection. The session terminated without posting a CLI_COMPLETED event. From the conductor’s perspective, Dev-E had the issue claimed and was working.

Except she wasn’t. She was dead.

The conductor’s StaleHeartbeatService runs on a 10-minute cron. It scans for agents whose last HEARTBEAT event is older than the configured threshold (default: 30 minutes). When it finds one, it:

  1. Emits AGENT_STUCK with the agent ID, issue number, and elapsed time
  2. Posts a Discord notification to #dev-e with the stuck detail
  3. Re-queues the issue as unassigned — clearing the lock

At the 47-minute mark, the service fired. The Discord notification landed. The issue was freed.


With the stuck issue back in the queue, Dev-E-node restarted (Kubernetes restartPolicy: Always — the pod had already recycled once). She polled GET /api/assignments/next and got rig-agent-runtime#88 back.

But there were 29 other issues. With one agent instance handling one issue at a time, a 30-issue backlog would drain serially. The operator had three choices:

OptionTradeoff
Wait (serial drain)~30 hours at 1 hr/issue average
Scale Dev-E replicas temporarilyFaster drain; more concurrent API spend
Prioritize — close/defer low-value issuesReduces queue without scaling

The operator chose a hybrid: scale the node variant from replicas: 1 to replicas: 3 for the drain, then scale back. This required a one-line edit to the HelmRelease in rig-gitops and a Flux reconcile.

With three concurrent instances, each polling GET /api/assignments/next, the queue drained in parallel. The assignment API’s optimistic concurrency guarantee prevented double-assignment: each call claimed a distinct issue.

10:14 Dev-E-node-1 ← rig-agent-runtime#88
10:14 Dev-E-node-2 ← rig-docs#176
10:14 Dev-E-node-3 ← infra#131
10:15 ...29 issues assigned in 47 seconds
11:52 Final issue merged

The backlog that looked like a day of work drained in 98 minutes.


After the drain, Dev-E wrote two memories:

kind: error
title: "Heartbeat gap caused 47-min stuck state"
content: "Dev-E lost API connection mid-session on rig-agent-runtime#88.
StaleHeartbeatService caught it at 47 min (threshold: 30 min).
Issue was re-queued automatically. No human action needed."
importance: 3
kind: decision
title: "Scale replicas temporarily for backlog drain"
content: "replicas: 1 → 3 via HelmRelease edit + flux reconcile.
Drains 30 issues in ~90 min vs ~30 hrs serially.
Scale back to 1 after queue clear."
importance: 4

The next time a backlog piles up, any Dev-E instance reading those memories will know the pattern: check heartbeat thresholds, confirm auto-re-queue worked, consider temporary replica scale if queue > 15 issues.


The stuck state is not a failure. It’s an expected condition in a system where agents run inside pods that can be evicted, lose connectivity, or hit token limits. The failure would be a stuck state that required a human to diagnose and manually re-queue.

The rig’s answer: fixed detection (StaleHeartbeatService), automatic recovery (re-queue on threshold breach), and memory (so future agents know the playbook without reading a postmortem).

Next: what happens when you want two Dev-E-dotnet instances instead of one — and why naively doubling replicas breaks the message queue.