Skip to content

Per-pod consumer partitioning: XREADGROUP gotcha and the HOSTNAME fix

Per-pod consumer partitioning: XREADGROUP gotcha and the HOSTNAME fix

Section titled “Per-pod consumer partitioning: XREADGROUP gotcha and the HOSTNAME fix”

Rig-conductor uses Redis Streams with consumer groups (XREADGROUP) to distribute work from the event stream to agent pods. The pattern is:

  1. Conductor appends ISSUE_ASSIGNED event to stream rig:dispatch
  2. Dev-E pods call XREADGROUP GROUP dev-e-node <consumer-id> COUNT 1 BLOCK 5000 STREAMS rig:dispatch >
  3. The > means “give me only new, undelivered messages”
  4. After processing, the pod calls XACK rig:dispatch dev-e-node <message-id>

This is a solid pattern for exactly-once delivery — with one critical caveat.

The gotcha: static consumer IDs with replicas > 1

Section titled “The gotcha: static consumer IDs with replicas > 1”

Redis consumer groups track per-consumer pending-entry lists (PEL). Each consumer ID has its own list of messages delivered but not yet ACKed. If a consumer disconnects before ACKing, its PEL entries remain and can be reclaimed.

The problem emerges when multiple pods share the same consumer ID:

Pod A: CONSUMER_ID=dev-e-worker → receives message M1
Pod B: CONSUMER_ID=dev-e-worker → also registered as dev-e-worker
# Redis sees one consumer named dev-e-worker with M1 in its PEL
# Pod A ACKs M1 — PEL cleared
# Pod B reconnects — Redis still thinks dev-e-worker has pending work
# Redis re-delivers M1 to Pod B
# Result: duplicate processing

In practice, this manifested as two Dev-E-dotnet pods both working the same issue, producing two branches and two PRs. The second PR had a branch name collision (got a -2 suffix). Review-E was assigned both. The operator had to manually close one.

This doesn’t show up with replicas: 1. It only manifests when:

  • Two pods are alive simultaneously AND sharing a consumer ID, OR
  • A pod restarts and the new instance inherits the old pod’s consumer ID before the PEL reclaim timeout fires

The second case is especially tricky: after a restart, the new pod may immediately pick up the previous pod’s unACKed message. Even without two concurrent pods, a restart loop can cause repeated delivery of the same message.

Kubernetes sets HOSTNAME to the pod name, which is unique and stable within a pod’s lifetime. Using it as the consumer ID gives each pod a distinct identity in the consumer group:

# In HelmRelease values or Deployment env:
env:
- name: CONSUMER_ID
valueFrom:
fieldRef:
fieldPath: metadata.name

In the agent runtime:

const consumerId = process.env.CONSUMER_ID || process.env.HOSTNAME;
// Results in: dev-e-dotnet-6f7b9c-xk2p1

Each pod’s consumer ID is now:

  • Unique — no two live pods share an ID
  • Stable — the same pod always uses the same ID across message deliveries
  • Ephemeral — a pod restart produces a new ID, leaving the old PEL for the reclaim sweep

Abandoned PEL entries (from crashed pods) are reclaimed by the conductor’s PEL cleanup cron:

// Find entries pending longer than 5 minutes from consumers
// that are no longer alive (not in the active pod list)
const stale = await redis.xpending(
'rig:dispatch', 'dev-e-node',
'-', '+', 100
);
for (const entry of stale) {
if (entry.elapsedMs > 5 * 60 * 1000 && !activePods.has(entry.consumer)) {
await redis.xclaim('rig:dispatch', 'dev-e-node', 'reclaimer', 0, entry.id);
// Re-enqueue for live consumers
}
}

This means a pod that dies mid-task has its work reclaimed and re-dispatched within 5 minutes. Combined with rig-conductor’s optimistic concurrency on the Marten event stream, this gives at-least-once delivery at the Redis layer and exactly-once claim at the Marten layer.

ApproachSingle replicaMultiple replicasRestart safety
Static consumer ID✅ Works❌ Duplicates⚠️ Re-delivers PEL entries
HOSTNAME-based consumer ID✅ Works✅ Safe✅ New ID on restart

Rule: Never use a static consumer ID in a horizontally-scaled XREADGROUP setup. Always use a pod-unique identifier (pod name, pod UID, or a UUID generated at pod start).