Skip to content

Cost model: token attribution, cache-read economics, and KEDA scale-to-zero

Cost model: token attribution, cache-read economics, and KEDA scale-to-zero

Section titled “Cost model: token attribution, cache-read economics, and KEDA scale-to-zero”

A fleet of LLM agents running 30+ issues/day generates thousands of API calls. Without attribution, you know the monthly bill but not which work drove it. The rig solves this with a three-layer attribution model:

Layer 1: Agent emits TOKEN_USAGE event after every CLI session
{ agentId, repo, issueNumber, inputTokens, outputTokens, cacheReadTokens, model }
Layer 2: Conductor's TokenUsageProjection aggregates by issue
{ issueKey, totalInputTokens, totalOutputTokens, totalCacheRead, estimatedCost }
Layer 3: CostProjection aggregates by day/agent/repo
{ date, agentId, repo, totalCost, issueCount }

The result: you can answer “what did issue rig-conductor#162 cost?” ($0.83, 14 minutes of Review-E time) and “what did Dev-E-node cost last Tuesday?” ($4.20 across 11 issues) from the same dashboard.

Claude’s pricing model has three distinct token types with different costs:

Token typeDescriptionRelative cost
Input tokensFresh context sent to the model1x baseline
Output tokensModel-generated response5x baseline
Cache-read tokensPrompt prefix served from Anthropic’s 5-min cache~0.1x baseline

Cache-read tokens are the most interesting. Claude Code automatically caches large, stable context blocks (BRAIN.md, AGENTS.md, long system prompts) with a 5-minute TTL. When an agent session starts within 5 minutes of a previous session using the same context, the cached prefix is served at ~10% of the input token cost.

BRAIN.md is ~27 KB ≈ 7,000 tokens. For a fleet running 30 issues/day:

Without cache: 30 issues × 7,000 tokens × $3/MTok = $0.63/day
With cache hit: 30 issues × 7,000 tokens × $0.30/MTok = $0.063/day
Savings: ~$0.57/day = ~$17/month just on BRAIN.md reads

At $17/month, BRAIN.md cache savings cover a meaningful fraction of the fleet’s base operating cost. This is why the rig keeps BRAIN.md as a single stable document rather than splitting it — cache hit rate depends on prefix stability.

LLM agents cost money only when running sessions. Idle agents (waiting for issues) cost compute (the pod running) but not LLM tokens. The split:

StateLLM costCompute cost
Running (session active)HighLow (pod)
Polling (waiting for work)ZeroLow (pod)
Scaled to zeroZeroZero

For low-throughput agent variants (Dev-E-python, Dev-E-dotnet), long idle periods at 1 replica waste compute. KEDA (Kubernetes Event-Driven Autoscaling) can scale them to zero when the dispatch queue is empty and back to 1 (or N) when work arrives.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: dev-e-dotnet-scaler
spec:
scaleTargetRef:
name: dev-e-dotnet
minReplicaCount: 0 # scale to zero when idle
maxReplicaCount: 3 # burst to 3 on backlog
triggers:
- type: redis-streams
metadata:
address: redis.rig-conductor:6379
stream: rig:dispatch
consumerGroup: dev-e-dotnet
pendingEntriesCount: "1" # scale up when ≥1 pending entry

With this config:

  • No dotnet issues → Dev-E-dotnet scales to 0 (zero compute cost)
  • One dotnet issue arrives → scales to 1 within ~30 seconds
  • 3+ dotnet issues pending → scales to 3

Status: KEDA scale-to-zero is planned, not yet deployed. The current fleet runs at fixed replicas: 1 per variant. KEDA is the next cost-reduction lever after the LiteLLM proxy (which enables hard budget caps).

Rough fleet-wide spend as of 2026-04-23:

Line itemEstimateNotes
LLM tokens (Dev-E fleet)$5–10/dayHighly variable; complex epics cost more
LLM tokens (Review-E)$2–4/dayReview sessions shorter than implementation
GCE VM (k3s host)$0.50–01/dayn2-standard-4 or similar
Postgres$0.10/dayIncluded in VM or managed service
Cloudflare Pages$0Free tier for static sites
Total~$8–15/dayOrder-of-magnitude only

These are operator estimates, not LiteLLM-precise. Until the LiteLLM proxy is deployed, token counts come from TOKEN_USAGE events emitted by agents themselves — trusted but not cryptographically verified against the Anthropic API response.