LiteLLM passthrough spike for AC 5 redesign — findings
TL;DR. A 30-minute live spike on the rig cluster confirms LiteLLM’s
/anthropic/{endpoint}passthrough (present inmain-stable, not inmain-v1.52.0-stable) does reach api.anthropic.com — we got a real Anthropicrequest_idback on a fake-key 401, proving the request body was forwarded upstream. However, error responses are wrapped in LiteLLM’s{"error": {"message": "<escaped upstream JSON>", ...}}envelope rather than returned as native Anthropic error JSON. That is a real compatibility risk for Claude Code’s error handling (retries, rate-limit backoff, 529 detection). Success-path fidelity (prompt caching, SSE streaming, tool_use) could not be verified without a real API key — next spike needs one. Recommendation: before committing AC 5 redesign to the LiteLLM path, run a second spike with a real key and a real Claude Code session; if error-wrapping turns out to break retries, fall back to an Envoy egress gateway (less featureful but truly transparent).
What this spike answered and what it didn’t
Section titled “What this spike answered and what it didn’t”| Question | Answer from spike |
|---|---|
| Does LiteLLM have a passthrough route for Anthropic in 2026? | ✅ /anthropic/{endpoint} in main-stable. Not in older v1.52.0-stable. |
| Does that route actually forward to api.anthropic.com? | ✅ Real Anthropic request_id on fake-key 401 |
Does it preserve cache_control in the forwarded body? | ⏳ Likely yes (body was forwarded; upstream validated auth before body), but requires real-key round-trip to check cache_read_input_tokens in the response |
Does it preserve tools and native tool-use schema? | ⏳ Same — body forwarded, upstream reached, can’t verify response-side without real key |
| Does it preserve SSE streaming frame-for-frame? | ⏳ Not testable with auth-failed request; need real-key stream |
| Does it return Anthropic’s native error JSON on error? | ❌ No — errors are wrapped in LiteLLM’s {"error": {"message": "<escaped>", ...}} envelope |
Spike setup
Section titled “Spike setup”litellm-spikenamespace on the rig k3s cluster (torn down at end)- Two images tested:
ghcr.io/berriai/litellm:main-v1.52.0-stable— only exposes/v1/messages, which is the Anthropic-compatible endpoint (goes through LiteLLM’sllms/anthropic/chat/handler.py, i.e. adapter, not passthrough). Rejectedclaude-sonnet-4-20250514withInvalid model name. Not viable for Claude Code.ghcr.io/berriai/litellm:main-stable— exposes/anthropic/{endpoint},/vertex_ai/{endpoint},/bedrock/{endpoint},/gemini/{endpoint},/openai_passthrough/{endpoint}. This is the real passthrough surface.
- Config: minimal
model_listwith"anthropic/*"and"claude-*"wildcard entries plusmaster_key: sk-spike-dummy.ANTHROPIC_API_KEYset to a dummy value. - Tested via an in-cluster
curlimages/curlpod.
Concrete evidence
Section titled “Concrete evidence”1. Request reaches Anthropic
Section titled “1. Request reaches Anthropic”POST /anthropic/v1/messagesbody: { model: claude-sonnet-4-20250514, max_tokens: 100, system: [{type: text, text: ..., cache_control: {type: ephemeral}}], messages: [...], tools: [...] }
response: 401body: {"error":{"message":"{\"type\":\"error\",\"error\":{\"type\":\"authentication_error\", \"message\":\"invalid x-api-key\"},\"request_id\":\"req_011CaJd13qvypQ2wop9kCqrn\"}", "type":"None","param":"None","code":"401"}}Two things to note:
request_id: req_011CaJd13qvypQ2wop9kCqrn— that prefix is Anthropic’s, not LiteLLM’s. The request body was forwarded upstream. This eliminates the concern that LiteLLM was rejectingcache_controlortoolsbefore forwarding.- The outer envelope is LiteLLM’s. Anthropic’s native error JSON is escaped as a string inside LiteLLM’s wrapper. A client that parses
response.json().error.type === "rate_limit_error"to trigger backoff will miss the signal — it seeserror.type === "None"instead, with the real type buried inerror.messageas an escaped string.
2. Older LiteLLM versions do not have passthrough
Section titled “2. Older LiteLLM versions do not have passthrough”v1.52.0-stable only has /v1/messages (compatible mode). Any pin-to-a-version policy needs to be on ≥ the version that shipped /anthropic/{endpoint} — confirmed present in main-stable as of the spike date.
3. Wildcard model config is mandatory
Section titled “3. Wildcard model config is mandatory”Without "anthropic/*" in model_list, LiteLLM rejects claude-sonnet-4-20250514 at /v1/messages (and the Anthropic-adapter path it routes to). On the passthrough route itself the model-list check does not block, but you still need auth to succeed.
Concerns still open after the spike
Section titled “Concerns still open after the spike”A. Error-response wrapping breaks retry logic
Section titled “A. Error-response wrapping breaks retry logic”The Anthropic SDK and Claude Code both look at response error.type to distinguish:
overloaded_error→ 5-minute backoff (see memoryRate Limit Rules)rate_limit_error→ shorter retryauthentication_error→ fatal, don’t retry
With wrapping, all three look identical at the JSON-path level. Possible mitigations:
- LiteLLM config flag to disable error wrapping on the passthrough route (check if one exists; not obvious from the openapi schema).
- A tiny shim in front of LiteLLM (nginx or Envoy) that unwraps on 4xx/5xx — ugly but works.
- Accept degraded retry behavior; agent runtime catches all errors anyway and restarts from the stream.
B. Prompt cache preservation (untested)
Section titled “B. Prompt cache preservation (untested)”Anthropic charges ~10% for cache reads vs full input tokens. A long Claude Code session without cache preservation is ~10× more expensive. We do not yet have direct evidence that cache_control in the forwarded body produces cache_read_input_tokens in the response through LiteLLM. The forwarding path looks right but response-side transformation could strip fields.
C. SSE streaming (untested)
Section titled “C. SSE streaming (untested)”Claude Code uses SSE. If LiteLLM buffers, re-chunks, or drops keep-alive frames, long-running tool-use streams may hang or break. Needs a real stream.
D. Session continuation (untested, low risk)
Section titled “D. Session continuation (untested, low risk)”--resume in Claude Code replays the local transcript. Each resulting API call passes through independently; session state is client-side. Low risk unless (B) or (C) breaks individual calls.
Decision framework
Section titled “Decision framework”| Path | AC 5 unblocks? | Cost ceiling? | Provider portability? | Observability hook? | Risk |
|---|---|---|---|---|---|
| LiteLLM passthrough | ✅ if (A)(B)(C) pass | ✅ | ✅ | ✅ | Medium — spike incomplete |
| Envoy egress gateway | ✅ | ❌ | ❌ | Partial (network-level) | Low — bytes-transparent |
| Hybrid (LiteLLM for LLMs, Envoy for GitHub/npm) | ✅ | ✅ | ✅ | ✅ | Medium — two components |
Recommendation
Section titled “Recommendation”Do one more spike before committing to the LiteLLM path:
- Use the existing dev-e Anthropic key (the OAuth token works; so does a real API key if one is set).
- Start LiteLLM with the same
main-stableimage + wildcard model config, real key inANTHROPIC_API_KEY. - Run a single Claude Code session through it (
ANTHROPIC_BASE_URL=http://litellm.../anthropic claude-code ...), at least two turns with an explicit cache_control. - Verify from the response JSON:
cache_read_input_tokens > 0on turn 2 or later, stream chunks arrive in natural order, tool_use blocks survive round-trip. - If any of these fail, default to the Envoy egress gateway path — simpler, less featureful, but transparent.
Budget: ~1.5h. One research PR either way with the verdict.
What I would not do
Section titled “What I would not do”- Do not ship LiteLLM as Anthropic-adapter mode (i.e.,
/v1/messageswith a named model inmodel_list). The model-name churn is real (every new Claude model requires a config PR), and the adapter path involves more transformation surface than passthrough. - Do not attempt a Cloudflare-CIDR allowlist. 1500 rotating prefixes. Was already ruled out in the April 21 research; today’s AC 5 revert confirmed why.
- Do not self-host Cilium on k3s just for
toFQDNs. The cluster is a single-node k3s on one GCE VM; the operational cost of a CNI swap is enormous relative to the agent count.
Sources
Section titled “Sources”- LiteLLM Anthropic passthrough docs: https://docs.litellm.ai/docs/pass_through/anthropic_completion
- Anthropic prompt caching: https://platform.claude.com/docs/en/build-with-claude/prompt-caching
- Earlier egress research (retracted):
/research/2026-04-21-egress-policy-options/ - Cloudflare-fronted pitfall (today’s retrospective):
/research/2026-04-22-egress-policy-pitfall-cloudflare-fronted-apis/ - AC 5 reverts: dashecorp/rig-gitops#143, #144