Story: Scale an agent horizontally

Hook

The dashecorp/rig-conductor codebase had developed a queue of C# issues that Dev-E-dotnet could handle but wasn’t getting to fast enough — one instance, one issue at a time. The fix seemed obvious: replicas: 2.

It wasn’t that simple. The first attempt produced two Dev-E-dotnet pods both consuming the same issue from the Redis stream, stepping on each other’s work. Here’s why — and the fix that made horizontal scaling safe.

Background: how the conductor stream works

Rig-conductor uses Redis Streams (XREADGROUP) for distributing work to agents. Each agent variant has a consumer group on the stream. When an issue is dispatched, the conductor appends an ISSUE_ASSIGNED event to the stream. Agents poll the group with:

XREADGROUP GROUP dev-e-dotnet <consumer-id> COUNT 1 BLOCK 5000 STREAMS issues >

The <consumer-id> is the key. In Redis consumer groups, each consumer ID gets its own pending-entry list (PEL). If two pods register the same consumer ID, Redis treats them as the same consumer — and both pods can receive the same pending message on reconnect.

The first attempt: static consumer IDs

The original HelmRelease configured a static CONSUMER_ID=dev-e-dotnet-worker environment variable. With replicas: 1, this worked fine — one pod, one consumer ID, no conflict.

With replicas: 2, both pods registered as dev-e-dotnet-worker. When the first pod ACKed a message, the second pod’s pending entry for the same message remained in Redis. On the next reconnect, the second pod re-received the ACKed message and started working the same issue.

The result:

Two PRs opened against the same issue
Two feature branches, both named feature/issue-N-... (second got a suffix collision)
Review-E assigned to review two conflicting implementations
Operator had to close one PR manually

The fix: HOSTNAME-based consumer IDs

The root cause was using a static consumer ID shared across pods. The fix: use the pod’s HOSTNAME environment variable (which Kubernetes sets to the pod name, unique per pod) as the consumer ID.

env:
  - name: CONSUMER_ID
    valueFrom:
      fieldRef:
        fieldPath: metadata.name  # pod name: dev-e-dotnet-6f7b9c-xk2p1

With pod-name consumer IDs:

dev-e-dotnet-6f7b9c-xk2p1 claims issue A
dev-e-dotnet-6f7b9c-m8r3q claims issue B
Each pod has its own PEL; no shared pending entries
Pod restart creates a new pod name → new consumer ID → clean state

After the fix

With HOSTNAME-based consumer IDs and replicas: 2:

09:00  dev-e-dotnet-abc ← rig-conductor#201 (unit test coverage gap)
09:00  dev-e-dotnet-xyz ← rig-conductor#202 (missing validation on event schema)
09:47  rig-conductor#201 PR merged
09:52  rig-conductor#202 PR merged
09:53  dev-e-dotnet-abc ← rig-conductor#203
09:53  dev-e-dotnet-xyz ← rig-conductor#204

Two issues in parallel, clean assignments, no duplicates. The dotnet queue that had been running at 1 issue/hour was now clearing at 2/hour — not a doubling in practice (issues vary in complexity) but a meaningful throughput improvement for the sustained C# backlog.

What the rig learned

This gotcha is now in memory with importance: 5 (the highest):

title: "XREADGROUP duplicate consumption on static consumer IDs"
content: "Naively scaling XREADGROUP consumers to replicas > 1 with
  a static CONSUMER_ID causes duplicate message delivery.
  Fix: use metadata.name (pod name) as CONSUMER_ID.
  The per-pod partitioning pattern is the correct approach."
tags: [redis, xreadgroup, scaling, gotcha]
importance: 5

Any agent session in the rig that reads memories with the redis tag finds this entry immediately — before attempting naive horizontal scaling.

Next: adding a brand-new agent persona to the rig, from mascot image to first PR review.