Per-pod consumer partitioning: XREADGROUP gotcha and the HOSTNAME fix

Background

Rig-conductor uses Redis Streams with consumer groups (XREADGROUP) to distribute work from the event stream to agent pods. The pattern is:

Conductor appends ISSUE_ASSIGNED event to stream rig:dispatch
Dev-E pods call XREADGROUP GROUP dev-e-node <consumer-id> COUNT 1 BLOCK 5000 STREAMS rig:dispatch >
The > means “give me only new, undelivered messages”
After processing, the pod calls XACK rig:dispatch dev-e-node <message-id>

This is a solid pattern for exactly-once delivery — with one critical caveat.

The gotcha: static consumer IDs with replicas > 1

Redis consumer groups track per-consumer pending-entry lists (PEL). Each consumer ID has its own list of messages delivered but not yet ACKed. If a consumer disconnects before ACKing, its PEL entries remain and can be reclaimed.

The problem emerges when multiple pods share the same consumer ID:

Pod A: CONSUMER_ID=dev-e-worker → receives message M1
Pod B: CONSUMER_ID=dev-e-worker → also registered as dev-e-worker

# Redis sees one consumer named dev-e-worker with M1 in its PEL
# Pod A ACKs M1 — PEL cleared
# Pod B reconnects — Redis still thinks dev-e-worker has pending work
# Redis re-delivers M1 to Pod B
# Result: duplicate processing

In practice, this manifested as two Dev-E-dotnet pods both working the same issue, producing two branches and two PRs. The second PR had a branch name collision (got a -2 suffix). Review-E was assigned both. The operator had to manually close one.

Why it’s subtle

This doesn’t show up with replicas: 1. It only manifests when:

Two pods are alive simultaneously AND sharing a consumer ID, OR
A pod restarts and the new instance inherits the old pod’s consumer ID before the PEL reclaim timeout fires

The second case is especially tricky: after a restart, the new pod may immediately pick up the previous pod’s unACKed message. Even without two concurrent pods, a restart loop can cause repeated delivery of the same message.

The fix: HOSTNAME-based consumer IDs

Kubernetes sets HOSTNAME to the pod name, which is unique and stable within a pod’s lifetime. Using it as the consumer ID gives each pod a distinct identity in the consumer group:

# In HelmRelease values or Deployment env:
env:
  - name: CONSUMER_ID
    valueFrom:
      fieldRef:
        fieldPath: metadata.name

In the agent runtime:

const consumerId = process.env.CONSUMER_ID || process.env.HOSTNAME;
// Results in: dev-e-dotnet-6f7b9c-xk2p1

Each pod’s consumer ID is now:

Unique — no two live pods share an ID
Stable — the same pod always uses the same ID across message deliveries
Ephemeral — a pod restart produces a new ID, leaving the old PEL for the reclaim sweep

The reclaim sweep

Abandoned PEL entries (from crashed pods) are reclaimed by the conductor’s PEL cleanup cron:

// Find entries pending longer than 5 minutes from consumers
// that are no longer alive (not in the active pod list)
const stale = await redis.xpending(
  'rig:dispatch', 'dev-e-node',
  '-', '+', 100
);
for (const entry of stale) {
  if (entry.elapsedMs > 5 * 60 * 1000 && !activePods.has(entry.consumer)) {
    await redis.xclaim('rig:dispatch', 'dev-e-node', 'reclaimer', 0, entry.id);
    // Re-enqueue for live consumers
  }
}

This means a pod that dies mid-task has its work reclaimed and re-dispatched within 5 minutes. Combined with rig-conductor’s optimistic concurrency on the Marten event stream, this gives at-least-once delivery at the Redis layer and exactly-once claim at the Marten layer.

Summary

Approach	Single replica	Multiple replicas	Restart safety
Static consumer ID	✅ Works	❌ Duplicates	⚠️ Re-delivers PEL entries
HOSTNAME-based consumer ID	✅ Works	✅ Safe	✅ New ID on restart

Rule: Never use a static consumer ID in a horizontally-scaled XREADGROUP setup. Always use a pod-unique identifier (pod name, pod UID, or a UUID generated at pod start).