Story: Scale an agent horizontally
The dashecorp/rig-conductor codebase had developed a queue of C# issues that Dev-E-dotnet could handle but wasn’t getting to fast enough — one instance, one issue at a time. The fix seemed obvious: replicas: 2.
It wasn’t that simple. The first attempt produced two Dev-E-dotnet pods both consuming the same issue from the Redis stream, stepping on each other’s work. Here’s why — and the fix that made horizontal scaling safe.
Background: how the conductor stream works
Section titled “Background: how the conductor stream works”Rig-conductor uses Redis Streams (XREADGROUP) for distributing work to agents. Each agent variant has a consumer group on the stream. When an issue is dispatched, the conductor appends an ISSUE_ASSIGNED event to the stream. Agents poll the group with:
XREADGROUP GROUP dev-e-dotnet <consumer-id> COUNT 1 BLOCK 5000 STREAMS issues >The <consumer-id> is the key. In Redis consumer groups, each consumer ID gets its own pending-entry list (PEL). If two pods register the same consumer ID, Redis treats them as the same consumer — and both pods can receive the same pending message on reconnect.
The first attempt: static consumer IDs
Section titled “The first attempt: static consumer IDs”The original HelmRelease configured a static CONSUMER_ID=dev-e-dotnet-worker environment variable. With replicas: 1, this worked fine — one pod, one consumer ID, no conflict.
With replicas: 2, both pods registered as dev-e-dotnet-worker. When the first pod ACKed a message, the second pod’s pending entry for the same message remained in Redis. On the next reconnect, the second pod re-received the ACKed message and started working the same issue.
The result:
- Two PRs opened against the same issue
- Two feature branches, both named
feature/issue-N-...(second got a suffix collision) - Review-E assigned to review two conflicting implementations
- Operator had to close one PR manually
The fix: HOSTNAME-based consumer IDs
Section titled “The fix: HOSTNAME-based consumer IDs”The root cause was using a static consumer ID shared across pods. The fix: use the pod’s HOSTNAME environment variable (which Kubernetes sets to the pod name, unique per pod) as the consumer ID.
env: - name: CONSUMER_ID valueFrom: fieldRef: fieldPath: metadata.name # pod name: dev-e-dotnet-6f7b9c-xk2p1With pod-name consumer IDs:
dev-e-dotnet-6f7b9c-xk2p1claims issue Adev-e-dotnet-6f7b9c-m8r3qclaims issue B- Each pod has its own PEL; no shared pending entries
- Pod restart creates a new pod name → new consumer ID → clean state
After the fix
Section titled “After the fix”With HOSTNAME-based consumer IDs and replicas: 2:
09:00 dev-e-dotnet-abc ← rig-conductor#201 (unit test coverage gap)09:00 dev-e-dotnet-xyz ← rig-conductor#202 (missing validation on event schema)09:47 rig-conductor#201 PR merged09:52 rig-conductor#202 PR merged09:53 dev-e-dotnet-abc ← rig-conductor#20309:53 dev-e-dotnet-xyz ← rig-conductor#204Two issues in parallel, clean assignments, no duplicates. The dotnet queue that had been running at 1 issue/hour was now clearing at 2/hour — not a doubling in practice (issues vary in complexity) but a meaningful throughput improvement for the sustained C# backlog.
What the rig learned
Section titled “What the rig learned”This gotcha is now in memory with importance: 5 (the highest):
title: "XREADGROUP duplicate consumption on static consumer IDs"content: "Naively scaling XREADGROUP consumers to replicas > 1 with a static CONSUMER_ID causes duplicate message delivery. Fix: use metadata.name (pod name) as CONSUMER_ID. The per-pod partitioning pattern is the correct approach."tags: [redis, xreadgroup, scaling, gotcha]importance: 5Any agent session in the rig that reads memories with the redis tag finds this entry immediately — before attempting naive horizontal scaling.
Next: adding a brand-new agent persona to the rig, from mascot image to first PR review.