Skip to content

LiteLLM passthrough spike for AC 5 redesign — findings

TL;DR. A 30-minute live spike on the rig cluster confirms LiteLLM’s /anthropic/{endpoint} passthrough (present in main-stable, not in main-v1.52.0-stable) does reach api.anthropic.com — we got a real Anthropic request_id back on a fake-key 401, proving the request body was forwarded upstream. However, error responses are wrapped in LiteLLM’s {"error": {"message": "<escaped upstream JSON>", ...}} envelope rather than returned as native Anthropic error JSON. That is a real compatibility risk for Claude Code’s error handling (retries, rate-limit backoff, 529 detection). Success-path fidelity (prompt caching, SSE streaming, tool_use) could not be verified without a real API key — next spike needs one. Recommendation: before committing AC 5 redesign to the LiteLLM path, run a second spike with a real key and a real Claude Code session; if error-wrapping turns out to break retries, fall back to an Envoy egress gateway (less featureful but truly transparent).

What this spike answered and what it didn’t

Section titled “What this spike answered and what it didn’t”
QuestionAnswer from spike
Does LiteLLM have a passthrough route for Anthropic in 2026?/anthropic/{endpoint} in main-stable. Not in older v1.52.0-stable.
Does that route actually forward to api.anthropic.com?✅ Real Anthropic request_id on fake-key 401
Does it preserve cache_control in the forwarded body?⏳ Likely yes (body was forwarded; upstream validated auth before body), but requires real-key round-trip to check cache_read_input_tokens in the response
Does it preserve tools and native tool-use schema?⏳ Same — body forwarded, upstream reached, can’t verify response-side without real key
Does it preserve SSE streaming frame-for-frame?⏳ Not testable with auth-failed request; need real-key stream
Does it return Anthropic’s native error JSON on error?No — errors are wrapped in LiteLLM’s {"error": {"message": "<escaped>", ...}} envelope
  • litellm-spike namespace on the rig k3s cluster (torn down at end)
  • Two images tested:
    • ghcr.io/berriai/litellm:main-v1.52.0-stable — only exposes /v1/messages, which is the Anthropic-compatible endpoint (goes through LiteLLM’s llms/anthropic/chat/handler.py, i.e. adapter, not passthrough). Rejected claude-sonnet-4-20250514 with Invalid model name. Not viable for Claude Code.
    • ghcr.io/berriai/litellm:main-stable — exposes /anthropic/{endpoint}, /vertex_ai/{endpoint}, /bedrock/{endpoint}, /gemini/{endpoint}, /openai_passthrough/{endpoint}. This is the real passthrough surface.
  • Config: minimal model_list with "anthropic/*" and "claude-*" wildcard entries plus master_key: sk-spike-dummy. ANTHROPIC_API_KEY set to a dummy value.
  • Tested via an in-cluster curlimages/curl pod.
POST /anthropic/v1/messages
body: { model: claude-sonnet-4-20250514, max_tokens: 100,
system: [{type: text, text: ..., cache_control: {type: ephemeral}}],
messages: [...], tools: [...] }
response: 401
body: {"error":{"message":"{\"type\":\"error\",\"error\":{\"type\":\"authentication_error\",
\"message\":\"invalid x-api-key\"},\"request_id\":\"req_011CaJd13qvypQ2wop9kCqrn\"}",
"type":"None","param":"None","code":"401"}}

Two things to note:

  • request_id: req_011CaJd13qvypQ2wop9kCqrn — that prefix is Anthropic’s, not LiteLLM’s. The request body was forwarded upstream. This eliminates the concern that LiteLLM was rejecting cache_control or tools before forwarding.
  • The outer envelope is LiteLLM’s. Anthropic’s native error JSON is escaped as a string inside LiteLLM’s wrapper. A client that parses response.json().error.type === "rate_limit_error" to trigger backoff will miss the signal — it sees error.type === "None" instead, with the real type buried in error.message as an escaped string.

2. Older LiteLLM versions do not have passthrough

Section titled “2. Older LiteLLM versions do not have passthrough”

v1.52.0-stable only has /v1/messages (compatible mode). Any pin-to-a-version policy needs to be on ≥ the version that shipped /anthropic/{endpoint} — confirmed present in main-stable as of the spike date.

Without "anthropic/*" in model_list, LiteLLM rejects claude-sonnet-4-20250514 at /v1/messages (and the Anthropic-adapter path it routes to). On the passthrough route itself the model-list check does not block, but you still need auth to succeed.

A. Error-response wrapping breaks retry logic

Section titled “A. Error-response wrapping breaks retry logic”

The Anthropic SDK and Claude Code both look at response error.type to distinguish:

  • overloaded_error → 5-minute backoff (see memory Rate Limit Rules)
  • rate_limit_error → shorter retry
  • authentication_error → fatal, don’t retry

With wrapping, all three look identical at the JSON-path level. Possible mitigations:

  1. LiteLLM config flag to disable error wrapping on the passthrough route (check if one exists; not obvious from the openapi schema).
  2. A tiny shim in front of LiteLLM (nginx or Envoy) that unwraps on 4xx/5xx — ugly but works.
  3. Accept degraded retry behavior; agent runtime catches all errors anyway and restarts from the stream.

Anthropic charges ~10% for cache reads vs full input tokens. A long Claude Code session without cache preservation is ~10× more expensive. We do not yet have direct evidence that cache_control in the forwarded body produces cache_read_input_tokens in the response through LiteLLM. The forwarding path looks right but response-side transformation could strip fields.

Claude Code uses SSE. If LiteLLM buffers, re-chunks, or drops keep-alive frames, long-running tool-use streams may hang or break. Needs a real stream.

D. Session continuation (untested, low risk)

Section titled “D. Session continuation (untested, low risk)”

--resume in Claude Code replays the local transcript. Each resulting API call passes through independently; session state is client-side. Low risk unless (B) or (C) breaks individual calls.

PathAC 5 unblocks?Cost ceiling?Provider portability?Observability hook?Risk
LiteLLM passthrough✅ if (A)(B)(C) passMedium — spike incomplete
Envoy egress gatewayPartial (network-level)Low — bytes-transparent
Hybrid (LiteLLM for LLMs, Envoy for GitHub/npm)Medium — two components

Do one more spike before committing to the LiteLLM path:

  1. Use the existing dev-e Anthropic key (the OAuth token works; so does a real API key if one is set).
  2. Start LiteLLM with the same main-stable image + wildcard model config, real key in ANTHROPIC_API_KEY.
  3. Run a single Claude Code session through it (ANTHROPIC_BASE_URL=http://litellm.../anthropic claude-code ...), at least two turns with an explicit cache_control.
  4. Verify from the response JSON: cache_read_input_tokens > 0 on turn 2 or later, stream chunks arrive in natural order, tool_use blocks survive round-trip.
  5. If any of these fail, default to the Envoy egress gateway path — simpler, less featureful, but transparent.

Budget: ~1.5h. One research PR either way with the verdict.

  • Do not ship LiteLLM as Anthropic-adapter mode (i.e., /v1/messages with a named model in model_list). The model-name churn is real (every new Claude model requires a config PR), and the adapter path involves more transformation surface than passthrough.
  • Do not attempt a Cloudflare-CIDR allowlist. 1500 rotating prefixes. Was already ruled out in the April 21 research; today’s AC 5 revert confirmed why.
  • Do not self-host Cilium on k3s just for toFQDNs. The cluster is a single-node k3s on one GCE VM; the operational cost of a CNI swap is enormous relative to the agent count.