User story: hard cost ceiling — LiteLLM proxy with per-agent virtual keys

User story

As the rig operator

I want an enforced dollar ceiling on every agent’s hourly spend — at the proxy layer, not the agent layer — with per-agent virtual keys and hard 429s before the request reaches the LLM provider

So that one looping agent cannot burn the shared rate-limit budget for every other agent, and a future tenant can be onboarded with a hard per-key cap as a config row, not an engineering project.

Context

See whats-next whitepaper §Priority 3 and the source cost-framework.md whitepaper.

Today only the cheapest, lowest-guarantee layer exists: a TokenUsageProjection aggregates per-agent × per-repo cost after the fact. Layer 3 — proxy-level hard ceiling — is unbuilt. Without it, cost control is trust-based (“don’t let agents loop”), which is exactly the failure mode a shared infrastructure cannot accept.

Honest caveat from the source whitepaper: LiteLLM issue #12905 shows user-level budgets are not enforced inside team configurations. The proxy is the primary defense, not an absolute one. Every LiteLLM upgrade needs a synthetic budget-overrun test.

Acceptance criteria

⏳ LiteLLM proxy deployed to the rig cluster. Helm release under apps/litellm/. Postgres-backed config (not in-memory) so virtual keys and budgets survive restarts.
⏳ Per-agent virtual keys — one key per agent (dev-e, review-e, macos-e) plus one per onboarded tenant project (Dashecorp today; a virtual key is added for each new tenant). Each key has a hard daily budget and an hourly rate limit.
⏳ Anthropic as the default backend, OpenAI and Gemini configured as fallbacks via LiteLLM fallback_models (deferred per cost-framework.md — enable only when multi-provider config exists).
⏳ Agent pods use virtual keys, not the raw Anthropic key — the Anthropic account key exists only in the LiteLLM config. Agent pods get their virtual key via SealedSecret.
⏳ Synthetic budget-overrun test — a dedicated CronJob deliberately exceeds a test key’s daily cap. CI asserts: (a) 429 fires, (b) event emitted to rig-conductor, (c) provider not billed past the cap. Run on every LiteLLM upgrade.
⏳ Weekly budget review projection — TokenUsageProjection extended with virtual_key dimension; per-key costs visible in the rig-conductor dashboard.
⏳ Pre-flight cost prediction (nice-to-have, Phase 2 of cost-framework.md) — cheap model (Haiku) estimates tokens before dispatch; abort if estimated cost exceeds the task’s budget fraction.

What it unblocks

Multi-tenant onboarding. A new tenant becomes a LiteLLM virtual-key config row with daily_budget: 50 and models: [sonnet-4.6, haiku-4.5]. No per-tenant engineering.
Priority 1 Cilium egress policy (AC 5 of safety-foundation) can narrow its allowlist to the LiteLLM proxy service only — no direct LLM egress from agent pods. Defense in depth.
Cost attribution by tenant — TokenUsage events carry virtual_key / project dimension; billing becomes queryable rather than reconstructed.
Circuit breaker on 529 storms — the proxy is the natural place to implement “pause dispatch for 5 min after 3 consecutive 529s” without per-agent code.

Out of scope

Cross-provider fallback routing (deferred per cost-framework.md; adopt when multi-provider is actually configured)
Prompt caching verification (Claude Code does this automatically; verify via cost dashboard, not a code change)
flagd/OpenFeature feature flags (YAGNI — Kustomize + env vars cover today)

Priority

High. Sequenced after Priority 2 because cost dashboards need the trace store; before Priority 4 because nightly quality-gate runs need a bounded budget to be safe.

Estimated effort

AC 1 (proxy deploy): ~5 days. HelmRelease + Postgres + SealedSecret for the master Anthropic key.
AC 2 (virtual keys + caps): ~3 days. LiteLLM admin UI or /key/generate API; document the key-rotation runbook.
AC 3 (Anthropic default, fallbacks deferred): ~1 day. Config only.
AC 4 (agents use virtual keys): ~2 days. SealedSecret rotation per agent; helm chart plumb.
AC 5 (synthetic overrun test): ~2 days. CronJob + assertion + alert. Load-bearing AC — the #12905 caveat requires this to validate reality.
AC 6 (budget-review projection): ~3 days. Marten projection + dashboard panel.
AC 7 (pre-flight prediction, nice-to-have): ~5 days. Defer if time pressured.

Total: ~2.5 weeks focused, ~3.5 weeks with AC 7.