Per-pod consumer partitioning: XREADGROUP gotcha and the HOSTNAME fix
Per-pod consumer partitioning: XREADGROUP gotcha and the HOSTNAME fix
Section titled “Per-pod consumer partitioning: XREADGROUP gotcha and the HOSTNAME fix”Background
Section titled “Background”Rig-conductor uses Redis Streams with consumer groups (XREADGROUP) to distribute work from the event stream to agent pods. The pattern is:
- Conductor appends
ISSUE_ASSIGNEDevent to streamrig:dispatch - Dev-E pods call
XREADGROUP GROUP dev-e-node <consumer-id> COUNT 1 BLOCK 5000 STREAMS rig:dispatch > - The
>means “give me only new, undelivered messages” - After processing, the pod calls
XACK rig:dispatch dev-e-node <message-id>
This is a solid pattern for exactly-once delivery — with one critical caveat.
The gotcha: static consumer IDs with replicas > 1
Section titled “The gotcha: static consumer IDs with replicas > 1”Redis consumer groups track per-consumer pending-entry lists (PEL). Each consumer ID has its own list of messages delivered but not yet ACKed. If a consumer disconnects before ACKing, its PEL entries remain and can be reclaimed.
The problem emerges when multiple pods share the same consumer ID:
Pod A: CONSUMER_ID=dev-e-worker → receives message M1Pod B: CONSUMER_ID=dev-e-worker → also registered as dev-e-worker
# Redis sees one consumer named dev-e-worker with M1 in its PEL# Pod A ACKs M1 — PEL cleared# Pod B reconnects — Redis still thinks dev-e-worker has pending work# Redis re-delivers M1 to Pod B# Result: duplicate processingIn practice, this manifested as two Dev-E-dotnet pods both working the same issue, producing two branches and two PRs. The second PR had a branch name collision (got a -2 suffix). Review-E was assigned both. The operator had to manually close one.
Why it’s subtle
Section titled “Why it’s subtle”This doesn’t show up with replicas: 1. It only manifests when:
- Two pods are alive simultaneously AND sharing a consumer ID, OR
- A pod restarts and the new instance inherits the old pod’s consumer ID before the PEL reclaim timeout fires
The second case is especially tricky: after a restart, the new pod may immediately pick up the previous pod’s unACKed message. Even without two concurrent pods, a restart loop can cause repeated delivery of the same message.
The fix: HOSTNAME-based consumer IDs
Section titled “The fix: HOSTNAME-based consumer IDs”Kubernetes sets HOSTNAME to the pod name, which is unique and stable within a pod’s lifetime. Using it as the consumer ID gives each pod a distinct identity in the consumer group:
# In HelmRelease values or Deployment env:env: - name: CONSUMER_ID valueFrom: fieldRef: fieldPath: metadata.nameIn the agent runtime:
const consumerId = process.env.CONSUMER_ID || process.env.HOSTNAME;// Results in: dev-e-dotnet-6f7b9c-xk2p1Each pod’s consumer ID is now:
- Unique — no two live pods share an ID
- Stable — the same pod always uses the same ID across message deliveries
- Ephemeral — a pod restart produces a new ID, leaving the old PEL for the reclaim sweep
The reclaim sweep
Section titled “The reclaim sweep”Abandoned PEL entries (from crashed pods) are reclaimed by the conductor’s PEL cleanup cron:
// Find entries pending longer than 5 minutes from consumers// that are no longer alive (not in the active pod list)const stale = await redis.xpending( 'rig:dispatch', 'dev-e-node', '-', '+', 100);for (const entry of stale) { if (entry.elapsedMs > 5 * 60 * 1000 && !activePods.has(entry.consumer)) { await redis.xclaim('rig:dispatch', 'dev-e-node', 'reclaimer', 0, entry.id); // Re-enqueue for live consumers }}This means a pod that dies mid-task has its work reclaimed and re-dispatched within 5 minutes. Combined with rig-conductor’s optimistic concurrency on the Marten event stream, this gives at-least-once delivery at the Redis layer and exactly-once claim at the Marten layer.
Summary
Section titled “Summary”| Approach | Single replica | Multiple replicas | Restart safety |
|---|---|---|---|
| Static consumer ID | ✅ Works | ❌ Duplicates | ⚠️ Re-delivers PEL entries |
| HOSTNAME-based consumer ID | ✅ Works | ✅ Safe | ✅ New ID on restart |
Rule: Never use a static consumer ID in a horizontally-scaled XREADGROUP setup. Always use a pod-unique identifier (pod name, pod UID, or a UUID generated at pod start).