Skip to content

User story: hard cost ceiling — LiteLLM proxy with per-agent virtual keys

As the rig operator

I want an enforced dollar ceiling on every agent’s hourly spend — at the proxy layer, not the agent layer — with per-agent virtual keys and hard 429s before the request reaches the LLM provider

So that one looping agent cannot burn the shared rate-limit budget for every other agent, and a future tenant can be onboarded with a hard per-key cap as a config row, not an engineering project.

See whats-next whitepaper §Priority 3 and the source cost-framework.md whitepaper.

Today only the cheapest, lowest-guarantee layer exists: a TokenUsageProjection aggregates per-agent × per-repo cost after the fact. Layer 3 — proxy-level hard ceiling — is unbuilt. Without it, cost control is trust-based (“don’t let agents loop”), which is exactly the failure mode a shared infrastructure cannot accept.

Honest caveat from the source whitepaper: LiteLLM issue #12905 shows user-level budgets are not enforced inside team configurations. The proxy is the primary defense, not an absolute one. Every LiteLLM upgrade needs a synthetic budget-overrun test.

  1. LiteLLM proxy deployed to the rig cluster. Helm release under apps/litellm/. Postgres-backed config (not in-memory) so virtual keys and budgets survive restarts.
  2. Per-agent virtual keys — one key per agent (dev-e, review-e, macos-e) plus one per onboarded tenant project (Dashecorp today; a virtual key is added for each new tenant). Each key has a hard daily budget and an hourly rate limit.
  3. Anthropic as the default backend, OpenAI and Gemini configured as fallbacks via LiteLLM fallback_models (deferred per cost-framework.md — enable only when multi-provider config exists).
  4. Agent pods use virtual keys, not the raw Anthropic key — the Anthropic account key exists only in the LiteLLM config. Agent pods get their virtual key via SealedSecret.
  5. Synthetic budget-overrun test — a dedicated CronJob deliberately exceeds a test key’s daily cap. CI asserts: (a) 429 fires, (b) event emitted to rig-conductor, (c) provider not billed past the cap. Run on every LiteLLM upgrade.
  6. Weekly budget review projectionTokenUsageProjection extended with virtual_key dimension; per-key costs visible in the rig-conductor dashboard.
  7. Pre-flight cost prediction (nice-to-have, Phase 2 of cost-framework.md) — cheap model (Haiku) estimates tokens before dispatch; abort if estimated cost exceeds the task’s budget fraction.
  • Multi-tenant onboarding. A new tenant becomes a LiteLLM virtual-key config row with daily_budget: 50 and models: [sonnet-4.6, haiku-4.5]. No per-tenant engineering.
  • Priority 1 Cilium egress policy (AC 5 of safety-foundation) can narrow its allowlist to the LiteLLM proxy service only — no direct LLM egress from agent pods. Defense in depth.
  • Cost attribution by tenantTokenUsage events carry virtual_key / project dimension; billing becomes queryable rather than reconstructed.
  • Circuit breaker on 529 storms — the proxy is the natural place to implement “pause dispatch for 5 min after 3 consecutive 529s” without per-agent code.
  • Cross-provider fallback routing (deferred per cost-framework.md; adopt when multi-provider is actually configured)
  • Prompt caching verification (Claude Code does this automatically; verify via cost dashboard, not a code change)
  • flagd/OpenFeature feature flags (YAGNI — Kustomize + env vars cover today)

High. Sequenced after Priority 2 because cost dashboards need the trace store; before Priority 4 because nightly quality-gate runs need a bounded budget to be safe.

  • AC 1 (proxy deploy): ~5 days. HelmRelease + Postgres + SealedSecret for the master Anthropic key.
  • AC 2 (virtual keys + caps): ~3 days. LiteLLM admin UI or /key/generate API; document the key-rotation runbook.
  • AC 3 (Anthropic default, fallbacks deferred): ~1 day. Config only.
  • AC 4 (agents use virtual keys): ~2 days. SealedSecret rotation per agent; helm chart plumb.
  • AC 5 (synthetic overrun test): ~2 days. CronJob + assertion + alert. Load-bearing AC — the #12905 caveat requires this to validate reality.
  • AC 6 (budget-review projection): ~3 days. Marten projection + dashboard panel.
  • AC 7 (pre-flight prediction, nice-to-have): ~5 days. Defer if time pressured.

Total: ~2.5 weeks focused, ~3.5 weeks with AC 7.