Cost model: token attribution, cache-read economics, and KEDA scale-to-zero
Cost model: token attribution, cache-read economics, and KEDA scale-to-zero
Section titled “Cost model: token attribution, cache-read economics, and KEDA scale-to-zero”The attribution problem
Section titled “The attribution problem”A fleet of LLM agents running 30+ issues/day generates thousands of API calls. Without attribution, you know the monthly bill but not which work drove it. The rig solves this with a three-layer attribution model:
Layer 1: Agent emits TOKEN_USAGE event after every CLI session { agentId, repo, issueNumber, inputTokens, outputTokens, cacheReadTokens, model }
Layer 2: Conductor's TokenUsageProjection aggregates by issue { issueKey, totalInputTokens, totalOutputTokens, totalCacheRead, estimatedCost }
Layer 3: CostProjection aggregates by day/agent/repo { date, agentId, repo, totalCost, issueCount }The result: you can answer “what did issue rig-conductor#162 cost?” ($0.83, 14 minutes of Review-E time) and “what did Dev-E-node cost last Tuesday?” ($4.20 across 11 issues) from the same dashboard.
Token types and their costs
Section titled “Token types and their costs”Claude’s pricing model has three distinct token types with different costs:
| Token type | Description | Relative cost |
|---|---|---|
| Input tokens | Fresh context sent to the model | 1x baseline |
| Output tokens | Model-generated response | 5x baseline |
| Cache-read tokens | Prompt prefix served from Anthropic’s 5-min cache | ~0.1x baseline |
Cache-read tokens are the most interesting. Claude Code automatically caches large, stable context blocks (BRAIN.md, AGENTS.md, long system prompts) with a 5-minute TTL. When an agent session starts within 5 minutes of a previous session using the same context, the cached prefix is served at ~10% of the input token cost.
Cache economics example
Section titled “Cache economics example”BRAIN.md is ~27 KB ≈ 7,000 tokens. For a fleet running 30 issues/day:
Without cache: 30 issues × 7,000 tokens × $3/MTok = $0.63/dayWith cache hit: 30 issues × 7,000 tokens × $0.30/MTok = $0.063/daySavings: ~$0.57/day = ~$17/month just on BRAIN.md readsAt $17/month, BRAIN.md cache savings cover a meaningful fraction of the fleet’s base operating cost. This is why the rig keeps BRAIN.md as a single stable document rather than splitting it — cache hit rate depends on prefix stability.
Scale-to-zero with KEDA
Section titled “Scale-to-zero with KEDA”LLM agents cost money only when running sessions. Idle agents (waiting for issues) cost compute (the pod running) but not LLM tokens. The split:
| State | LLM cost | Compute cost |
|---|---|---|
| Running (session active) | High | Low (pod) |
| Polling (waiting for work) | Zero | Low (pod) |
| Scaled to zero | Zero | Zero |
For low-throughput agent variants (Dev-E-python, Dev-E-dotnet), long idle periods at 1 replica waste compute. KEDA (Kubernetes Event-Driven Autoscaling) can scale them to zero when the dispatch queue is empty and back to 1 (or N) when work arrives.
KEDA ScaledObject design
Section titled “KEDA ScaledObject design”apiVersion: keda.sh/v1alpha1kind: ScaledObjectmetadata: name: dev-e-dotnet-scalerspec: scaleTargetRef: name: dev-e-dotnet minReplicaCount: 0 # scale to zero when idle maxReplicaCount: 3 # burst to 3 on backlog triggers: - type: redis-streams metadata: address: redis.rig-conductor:6379 stream: rig:dispatch consumerGroup: dev-e-dotnet pendingEntriesCount: "1" # scale up when ≥1 pending entryWith this config:
- No dotnet issues → Dev-E-dotnet scales to 0 (zero compute cost)
- One dotnet issue arrives → scales to 1 within ~30 seconds
- 3+ dotnet issues pending → scales to 3
Status: KEDA scale-to-zero is planned, not yet deployed. The current fleet runs at fixed replicas: 1 per variant. KEDA is the next cost-reduction lever after the LiteLLM proxy (which enables hard budget caps).
Current cost estimates
Section titled “Current cost estimates”Rough fleet-wide spend as of 2026-04-23:
| Line item | Estimate | Notes |
|---|---|---|
| LLM tokens (Dev-E fleet) | $5–10/day | Highly variable; complex epics cost more |
| LLM tokens (Review-E) | $2–4/day | Review sessions shorter than implementation |
| GCE VM (k3s host) | $0.50–01/day | n2-standard-4 or similar |
| Postgres | $0.10/day | Included in VM or managed service |
| Cloudflare Pages | $0 | Free tier for static sites |
| Total | ~$8–15/day | Order-of-magnitude only |
These are operator estimates, not LiteLLM-precise. Until the LiteLLM proxy is deployed, token counts come from TOKEN_USAGE events emitted by agents themselves — trusted but not cryptographically verified against the Anthropic API response.