Envoy SNI egress gateway — verified working for AC 5 Phase 1
✅ Final resolution 2026-04-22 evening — pod-scoped DNS path shipped for review-e. After the two integration paths sketched in the v2 correction (HTTPS_PROXY + cluster-wide CoreDNS) were both tried and reverted, the winning design is pod-scoped DNS (option c, new):
dashecorp/rig-agent-runtimechart1.1.0(#115) adds.Values.dnsPolicy/.Values.dnsConfigpass-through on every pod spec.- Dedicated CoreDNS in the
egress-gwnamespace (2 replicas, rig-gitops #161 + parser fix #162). Corefile usesrewrite name exact <host> <envoy-egress-gateway-cluster-internal-name> answer autoper allowlisted hostname; the target name is resolved through the real kube-dns. No Envoy ClusterIP is hardcoded — surviving Service recreation.- review-e wired via
dnsPolicy: None+ adnsConfig.nameserverspointing at the dedicated CoreDNS, with kube-dns kept as secondary for availability fallback.Live verification from a manually-scaled review-e pod: Discord gateway WSS connected; Anthropic API reached; GitHub App token minted; 3 MCP servers (advisor + github + memory) connected; Valkey stream consumer attached. Fresh heartbeat visible in rig-conductor
/api/agents.Third trap surfaced during this ship, worth remembering: CoreDNS’s
forwardplugin FROM field is a single zone.forward cluster.local in-addr.arpa ip6.arpa 10.43.0.10is rejected silently by the parser (treats the zones after the first as upstream IPs) — pods crashloop withplugin/forward: not an IP address or file: "in-addr.arpa". Use oneforwarddirective per zone. Second open trap: the egress-dns Corefile and the Envoy SNIfilter_chain_match.server_namesare two places holding the same allowlist. Drift → pod succeeds DNS, Envoy resets the TLS. A CI check is the next follow-up before opening up more traffic.Still pending on the AC 5 Phase 1 checklist: dev-e (same values-only pattern, 3 HelmReleases) after 24h burn-in on review-e, then the default-deny NetworkPolicy that terminates Phase 1 — must allow Postgres 5432 to rig-conductor (the first-spike gap), kube-dns + egress-dns (53 UDP/TCP), and Envoy (443 + 8443).
⚠️ Correction 2026-04-22 v2 (kept for the record). The TL;DR below claims
HTTPS_PROXYis the “simpler integration option.” That was wrong and caused a rollback. The Envoy config in this doc is a raw-TCP + TLS-SNI-inspector listener — it does NOT speak HTTPCONNECT, which is what clients usingHTTPS_PROXYsend first. Result: all requests throughHTTPS_PROXYfail at the CONNECT step, both allowlisted and non-allowlisted. Rolled back live: rig-gitops #155. Two real integration paths were sketched at that point: (a) reconfigure Envoy withhttp_connection_manager+connect_matcher+dynamic_forward_proxyto handle CONNECT properly; (b) CoreDNS rewrite so the allowlisted hostnames resolve directly to the Envoy service IP (no client env var needed). Gateway manifests (rig-gitops #153) stay deployed; only the client-wiring needed redesign. Path (b) was attempted cluster-wide first (#156 → #158) and caught Flux’s own github fetch; the pod-scoped variant (c) in the v3 note above is what shipped.The SNI-matching + TCP-proxying behavior in the spike IS correct and byte-transparent — it was verified via
curl --connect-to, which opens a direct TCP connection and skips the CONNECT step. That’s a valid integration pattern for dedicated per-host proxies (used e.g. for Istio egress), but not for a cluster-wide forward proxy.TL;DR. A ~50-line Envoy config deployed as a single pod on the rig’s k3s cluster correctly enforces a hostname-based egress allowlist via SNI inspection. Verified end-to-end:
api.anthropic.com,api.github.com,ghcr.ioall reach their real upstream (HTTP 404 / 200 / 301 respectively, with a normal TLS session — the client sees the real server’s cert, because Envoy only proxies TCP bytes).google.comandexample.comget connection reset at the filter-chain-match stage. This is the AC 5 Phase 1 short-term path recommended by the LiteLLM spike #2 research — ship this first, keep LiteLLM for the longer-term cost-ceiling story.
What was spiked
Section titled “What was spiked”Deployed to the rig cluster in a throwaway egress-gw-spike namespace (torn down at end). One Envoy pod, one service, one ConfigMap with the Envoy YAML. One probe pod with curlimages/curl.
Envoy config (verbatim, working)
Section titled “Envoy config (verbatim, working)”admin: address: { socket_address: { address: 0.0.0.0, port_value: 9901 } }static_resources: listeners: - name: https_listener address: { socket_address: { address: 0.0.0.0, port_value: 8443 } } listener_filters: - name: envoy.filters.listener.tls_inspector typed_config: "@type": type.googleapis.com/envoy.extensions.filters.listener.tls_inspector.v3.TlsInspector filter_chains: - filter_chain_match: server_names: - "api.anthropic.com" - "api.github.com" - "github.com" - "codeload.github.com" - "objects.githubusercontent.com" - "raw.githubusercontent.com" - "ghcr.io" - "registry.npmjs.org" filters: - name: envoy.filters.network.sni_dynamic_forward_proxy typed_config: "@type": type.googleapis.com/envoy.extensions.filters.network.sni_dynamic_forward_proxy.v3.FilterConfig port_value: 443 dns_cache_config: name: egress_cache dns_lookup_family: V4_ONLY - name: envoy.filters.network.tcp_proxy typed_config: "@type": type.googleapis.com/envoy.extensions.filters.network.tcp_proxy.v3.TcpProxy stat_prefix: tcp cluster: dynamic_forward_proxy_cluster clusters: - name: dynamic_forward_proxy_cluster lb_policy: CLUSTER_PROVIDED cluster_type: name: envoy.clusters.dynamic_forward_proxy typed_config: "@type": type.googleapis.com/envoy.extensions.clusters.dynamic_forward_proxy.v3.ClusterConfig dns_cache_config: name: egress_cache dns_lookup_family: V4_ONLY connect_timeout: 5sTwo critical filters to get right:
tls_inspectorlistener filter — extracts SNI from the ClientHello (which is sent in plaintext before the TLS handshake completes).sni_dynamic_forward_proxynetwork filter — takes the SNI that tls_inspector pulled out, asks the DNS cache for the IP, sets the upstream host on the connection. Without this, the TCP proxy has no upstream target and resets (which is what the first spike iteration did — this was the one-line fix).
How clients use it (revised)
Section titled “How clients use it (revised)”The config above is a raw-TCP listener with TLS SNI inspection. It assumes clients open a TCP connection directly to envoy:8443 and immediately start a TLS handshake — the ClientHello’s SNI extension is what the listener matches on. Three ways to get that flow in practice:
- DNS override (recommended for cluster-wide use): add a CoreDNS rewrite so allowlisted hostnames resolve to the Envoy service IP. Expose Envoy on port 443 (not 8443) so default TLS ports work. Clients use
https://api.anthropic.com/...transparently; DNS sends them to the gateway. No env var, no client awareness. Blast radius is cluster-wide DNS changes. - Reconfigure Envoy as an HTTP CONNECT proxy: add
http_connection_manager+connect_matcher+ an HTTP filter chain that callsdynamic_forward_proxy. ThenHTTPS_PROXY=http://envoy:8443works as expected. More Envoy YAML, but no cluster DNS changes. - Direct per-host rewriting in the client library: set
--connect-to api.anthropic.com:443:envoy:8443per-request (the spike used this). Only usable for one-offs, not for general-purpose agent code.
⚠️ HTTPS_PROXY env var against the raw-TCP-SNI config in this doc does NOT work — verified live in rig-gitops #154 and rolled back in #155. The CONNECT step fails before SNI inspection happens.
Verification results
Section titled “Verification results”From the probe pod with curl --resolve ... --connect-to ...:$ENVOY_IP:8443:
| Host | HTTP | Notes |
|---|---|---|
api.anthropic.com | 404 | Reached Anthropic. 404 is what Anthropic serves at / — the TLS handshake and HTTP request round-tripped intact. |
api.github.com | 200 | Reached GitHub API. Normal API root response. |
ghcr.io | 301 | Redirect to github.com/... . Registry reachable. |
google.com | 000 / exit 35 | Connection reset. No SNI match. |
example.com | 000 / exit 35 | Connection reset. No SNI match. |
The allowed hosts return their real upstream response codes. The blocked hosts fail at TLS connect (SSL_connect: Connection reset by peer), never reaching HTTP. This is exactly the semantics AC 5 calls for.
Why this works where ipBlock did not
Section titled “Why this works where ipBlock did not”- No Cloudflare CIDR problem. The gateway matches on SNI (hostname in the TLS ClientHello), not on IP. It doesn’t care that
api.anthropic.comresolves to 162.159.x.x today and 104.x.x.x tomorrow — the SNI value is stable. - No OAuth-vs-API-key translation. Envoy proxies TCP bytes after SNI inspection. The TLS session is end-to-end between client and the real upstream. The client’s credentials (OAuth or API key) never touch Envoy. This is the critical difference from the LiteLLM path (which terminates the session and reauthenticates, breaking OAuth — see spike #2).
- No error wrapping. Envoy doesn’t parse HTTP; Anthropic’s 429/529/rate-limit responses reach the client verbatim, so retry logic works normally.
- No model-list churn. Hostnames don’t rev with every Claude model release.
What this does NOT give you
Section titled “What this does NOT give you”- No cost ceiling. Envoy sees encrypted TCP. It cannot count tokens or enforce a budget. Cost-accounting stays at the agent level (the current
TokenUsageProjectionpath). Central cost control is the LiteLLM reason to exist; this is why the recommendation is to ship Envoy now and add LiteLLM later for Priority 3, not replace one with the other. - No provider portability. No model-name translation.
- No LLM-trace observability. For Langfuse/OTel spans you still need in-process instrumentation (which the rig already has in
claude-cli.js) or a future LiteLLM layer.
Shipping plan
Section titled “Shipping plan”One GitOps PR adds:
apps/egress-gw/directory with namespace, deployment (Envoy v1.32), ConfigMap, service.apps/egress-gw/networkpolicy.yaml— restricts the egress pod itself to DNS + Anthropic CIDR as well (defense in depth).- Inclusion of the egress-gw kustomization in Flux’s app root.
Separate PRs to switch dev-e + review-e over. Pick ONE of the two integration paths first (see “How clients use it, revised” above):
- Path A — CoreDNS rewrite: add coredns corefile entries redirecting
api.anthropic.com,api.github.com, etc. to the Envoy egress gateway’s cluster-internal name. Expose Envoy on port 443. No HelmRelease env-var changes. ~1h PR on the cluster DNS config. - Path B — Envoy CONNECT mode: replace the current listener with
http_connection_manager+connect_matcher+dynamic_forward_proxyHTTP filter. Then setHTTPS_PROXYon dev-e + review-e HelmReleases pointing at the Envoy egress gateway’s cluster-internal name + port.
After burn-in (~24h, one cycle of real agent runs), add back the default-deny egress NetworkPolicy on dev-e + review-e — this time allowing only (a) kube-dns, (b) rig-conductor namespace (api 8080 + valkey 6379 + postgres 5432), (c) the Envoy service. Because the proxy is internal, the allowlist becomes a pod selector, not an IP list — no CIDR churn, no Cloudflare problem.
Phase 1 of AC 5 is shippable this week on this path.
- Live manifests:
apps/egress-gw
Estimated effort
Section titled “Estimated effort”- Envoy manifests + GitOps PR: ~2h
- HelmRelease proxy-env changes + burn-in: ~2h (plus a monitored agent dispatch)
- Default-deny NetworkPolicy (revised for the proxy model): ~1h
Total ~5h of real work; ~24h calendar for the burn-in gate.
Sources
Section titled “Sources”- Envoy
sni_dynamic_forward_proxyfilter docs: https://www.envoyproxy.io/docs/envoy/latest/configuration/listeners/network_filters/sni_dynamic_forward_proxy_filter - Spike #1 (LiteLLM):
/research/2026-04-22-litellm-passthrough-spike/ - Spike #2 (LiteLLM + OAuth):
/research/2026-04-22-litellm-passthrough-spike2-oauth/ - AC 5 retrospective:
/research/2026-04-22-egress-policy-pitfall-cloudflare-fronted-apis/