Cost model: token attribution, cache-read economics, and KEDA scale-to-zero

The attribution problem

A fleet of LLM agents running 30+ issues/day generates thousands of API calls. Without attribution, you know the monthly bill but not which work drove it. The rig solves this with a three-layer attribution model:

Layer 1: Agent emits TOKEN_USAGE event after every CLI session
  { agentId, repo, issueNumber, inputTokens, outputTokens, cacheReadTokens, model }

Layer 2: Conductor's TokenUsageProjection aggregates by issue
  { issueKey, totalInputTokens, totalOutputTokens, totalCacheRead, estimatedCost }

Layer 3: CostProjection aggregates by day/agent/repo
  { date, agentId, repo, totalCost, issueCount }

The result: you can answer “what did issue rig-conductor#162 cost?” ($0.83, 14 minutes of Review-E time) and “what did Dev-E-node cost last Tuesday?” ($4.20 across 11 issues) from the same dashboard.

Token types and their costs

Claude’s pricing model has three distinct token types with different costs:

Token type	Description	Relative cost
Input tokens	Fresh context sent to the model	1x baseline
Output tokens	Model-generated response	5x baseline
Cache-read tokens	Prompt prefix served from Anthropic’s 5-min cache	~0.1x baseline

Cache-read tokens are the most interesting. Claude Code automatically caches large, stable context blocks (BRAIN.md, AGENTS.md, long system prompts) with a 5-minute TTL. When an agent session starts within 5 minutes of a previous session using the same context, the cached prefix is served at ~10% of the input token cost.

Cache economics example

BRAIN.md is ~27 KB ≈ 7,000 tokens. For a fleet running 30 issues/day:

Without cache: 30 issues × 7,000 tokens × $3/MTok = $0.63/day
With cache hit: 30 issues × 7,000 tokens × $0.30/MTok = $0.063/day
Savings: ~$0.57/day = ~$17/month just on BRAIN.md reads

At $17/month, BRAIN.md cache savings cover a meaningful fraction of the fleet’s base operating cost. This is why the rig keeps BRAIN.md as a single stable document rather than splitting it — cache hit rate depends on prefix stability.

Scale-to-zero with KEDA

LLM agents cost money only when running sessions. Idle agents (waiting for issues) cost compute (the pod running) but not LLM tokens. The split:

State	LLM cost	Compute cost
Running (session active)	High	Low (pod)
Polling (waiting for work)	Zero	Low (pod)
Scaled to zero	Zero	Zero

For low-throughput agent variants (Dev-E-python, Dev-E-dotnet), long idle periods at 1 replica waste compute. KEDA (Kubernetes Event-Driven Autoscaling) can scale them to zero when the dispatch queue is empty and back to 1 (or N) when work arrives.

KEDA ScaledObject design

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: dev-e-dotnet-scaler
spec:
  scaleTargetRef:
    name: dev-e-dotnet
  minReplicaCount: 0   # scale to zero when idle
  maxReplicaCount: 3   # burst to 3 on backlog
  triggers:
    - type: redis-streams
      metadata:
        address: redis.rig-conductor:6379
        stream: rig:dispatch
        consumerGroup: dev-e-dotnet
        pendingEntriesCount: "1"  # scale up when ≥1 pending entry

With this config:

No dotnet issues → Dev-E-dotnet scales to 0 (zero compute cost)
One dotnet issue arrives → scales to 1 within ~30 seconds
3+ dotnet issues pending → scales to 3

Status: KEDA scale-to-zero is planned, not yet deployed. The current fleet runs at fixed replicas: 1 per variant. KEDA is the next cost-reduction lever after the LiteLLM proxy (which enables hard budget caps).

Current cost estimates

Rough fleet-wide spend as of 2026-04-23:

Line item	Estimate	Notes
LLM tokens (Dev-E fleet)	$5–10/day	Highly variable; complex epics cost more
LLM tokens (Review-E)	$2–4/day	Review sessions shorter than implementation
GCE VM (k3s host)	$0.50–01/day	n2-standard-4 or similar
Postgres	$0.10/day	Included in VM or managed service
Cloudflare Pages	$0	Free tier for static sites
Total	~$8–15/day	Order-of-magnitude only

These are operator estimates, not LiteLLM-precise. Until the LiteLLM proxy is deployed, token counts come from TOKEN_USAGE events emitted by agents themselves — trusted but not cryptographically verified against the Anthropic API response.