Skip to content

Envoy SNI egress gateway — verified working for AC 5 Phase 1

✅ Final resolution 2026-04-22 evening — pod-scoped DNS path shipped for review-e. After the two integration paths sketched in the v2 correction (HTTPS_PROXY + cluster-wide CoreDNS) were both tried and reverted, the winning design is pod-scoped DNS (option c, new):

  1. dashecorp/rig-agent-runtime chart 1.1.0 (#115) adds .Values.dnsPolicy / .Values.dnsConfig pass-through on every pod spec.
  2. Dedicated CoreDNS in the egress-gw namespace (2 replicas, rig-gitops #161 + parser fix #162). Corefile uses rewrite name exact <host> <envoy-egress-gateway-cluster-internal-name> answer auto per allowlisted hostname; the target name is resolved through the real kube-dns. No Envoy ClusterIP is hardcoded — surviving Service recreation.
  3. review-e wired via dnsPolicy: None + a dnsConfig.nameservers pointing at the dedicated CoreDNS, with kube-dns kept as secondary for availability fallback.

Live verification from a manually-scaled review-e pod: Discord gateway WSS connected; Anthropic API reached; GitHub App token minted; 3 MCP servers (advisor + github + memory) connected; Valkey stream consumer attached. Fresh heartbeat visible in rig-conductor /api/agents.

Third trap surfaced during this ship, worth remembering: CoreDNS’s forward plugin FROM field is a single zone. forward cluster.local in-addr.arpa ip6.arpa 10.43.0.10 is rejected silently by the parser (treats the zones after the first as upstream IPs) — pods crashloop with plugin/forward: not an IP address or file: "in-addr.arpa". Use one forward directive per zone. Second open trap: the egress-dns Corefile and the Envoy SNI filter_chain_match.server_names are two places holding the same allowlist. Drift → pod succeeds DNS, Envoy resets the TLS. A CI check is the next follow-up before opening up more traffic.

Still pending on the AC 5 Phase 1 checklist: dev-e (same values-only pattern, 3 HelmReleases) after 24h burn-in on review-e, then the default-deny NetworkPolicy that terminates Phase 1 — must allow Postgres 5432 to rig-conductor (the first-spike gap), kube-dns + egress-dns (53 UDP/TCP), and Envoy (443 + 8443).

⚠️ Correction 2026-04-22 v2 (kept for the record). The TL;DR below claims HTTPS_PROXY is the “simpler integration option.” That was wrong and caused a rollback. The Envoy config in this doc is a raw-TCP + TLS-SNI-inspector listener — it does NOT speak HTTP CONNECT, which is what clients using HTTPS_PROXY send first. Result: all requests through HTTPS_PROXY fail at the CONNECT step, both allowlisted and non-allowlisted. Rolled back live: rig-gitops #155. Two real integration paths were sketched at that point: (a) reconfigure Envoy with http_connection_manager + connect_matcher + dynamic_forward_proxy to handle CONNECT properly; (b) CoreDNS rewrite so the allowlisted hostnames resolve directly to the Envoy service IP (no client env var needed). Gateway manifests (rig-gitops #153) stay deployed; only the client-wiring needed redesign. Path (b) was attempted cluster-wide first (#156 → #158) and caught Flux’s own github fetch; the pod-scoped variant (c) in the v3 note above is what shipped.

The SNI-matching + TCP-proxying behavior in the spike IS correct and byte-transparent — it was verified via curl --connect-to, which opens a direct TCP connection and skips the CONNECT step. That’s a valid integration pattern for dedicated per-host proxies (used e.g. for Istio egress), but not for a cluster-wide forward proxy.

TL;DR. A ~50-line Envoy config deployed as a single pod on the rig’s k3s cluster correctly enforces a hostname-based egress allowlist via SNI inspection. Verified end-to-end: api.anthropic.com, api.github.com, ghcr.io all reach their real upstream (HTTP 404 / 200 / 301 respectively, with a normal TLS session — the client sees the real server’s cert, because Envoy only proxies TCP bytes). google.com and example.com get connection reset at the filter-chain-match stage. This is the AC 5 Phase 1 short-term path recommended by the LiteLLM spike #2 research — ship this first, keep LiteLLM for the longer-term cost-ceiling story.

Deployed to the rig cluster in a throwaway egress-gw-spike namespace (torn down at end). One Envoy pod, one service, one ConfigMap with the Envoy YAML. One probe pod with curlimages/curl.

admin:
address: { socket_address: { address: 0.0.0.0, port_value: 9901 } }
static_resources:
listeners:
- name: https_listener
address: { socket_address: { address: 0.0.0.0, port_value: 8443 } }
listener_filters:
- name: envoy.filters.listener.tls_inspector
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.listener.tls_inspector.v3.TlsInspector
filter_chains:
- filter_chain_match:
server_names:
- "api.anthropic.com"
- "api.github.com"
- "github.com"
- "codeload.github.com"
- "objects.githubusercontent.com"
- "raw.githubusercontent.com"
- "ghcr.io"
- "registry.npmjs.org"
filters:
- name: envoy.filters.network.sni_dynamic_forward_proxy
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.sni_dynamic_forward_proxy.v3.FilterConfig
port_value: 443
dns_cache_config:
name: egress_cache
dns_lookup_family: V4_ONLY
- name: envoy.filters.network.tcp_proxy
typed_config:
"@type": type.googleapis.com/envoy.extensions.filters.network.tcp_proxy.v3.TcpProxy
stat_prefix: tcp
cluster: dynamic_forward_proxy_cluster
clusters:
- name: dynamic_forward_proxy_cluster
lb_policy: CLUSTER_PROVIDED
cluster_type:
name: envoy.clusters.dynamic_forward_proxy
typed_config:
"@type": type.googleapis.com/envoy.extensions.clusters.dynamic_forward_proxy.v3.ClusterConfig
dns_cache_config:
name: egress_cache
dns_lookup_family: V4_ONLY
connect_timeout: 5s

Two critical filters to get right:

  1. tls_inspector listener filter — extracts SNI from the ClientHello (which is sent in plaintext before the TLS handshake completes).
  2. sni_dynamic_forward_proxy network filter — takes the SNI that tls_inspector pulled out, asks the DNS cache for the IP, sets the upstream host on the connection. Without this, the TCP proxy has no upstream target and resets (which is what the first spike iteration did — this was the one-line fix).

The config above is a raw-TCP listener with TLS SNI inspection. It assumes clients open a TCP connection directly to envoy:8443 and immediately start a TLS handshake — the ClientHello’s SNI extension is what the listener matches on. Three ways to get that flow in practice:

  • DNS override (recommended for cluster-wide use): add a CoreDNS rewrite so allowlisted hostnames resolve to the Envoy service IP. Expose Envoy on port 443 (not 8443) so default TLS ports work. Clients use https://api.anthropic.com/... transparently; DNS sends them to the gateway. No env var, no client awareness. Blast radius is cluster-wide DNS changes.
  • Reconfigure Envoy as an HTTP CONNECT proxy: add http_connection_manager + connect_matcher + an HTTP filter chain that calls dynamic_forward_proxy. Then HTTPS_PROXY=http://envoy:8443 works as expected. More Envoy YAML, but no cluster DNS changes.
  • Direct per-host rewriting in the client library: set --connect-to api.anthropic.com:443:envoy:8443 per-request (the spike used this). Only usable for one-offs, not for general-purpose agent code.

⚠️ HTTPS_PROXY env var against the raw-TCP-SNI config in this doc does NOT work — verified live in rig-gitops #154 and rolled back in #155. The CONNECT step fails before SNI inspection happens.

From the probe pod with curl --resolve ... --connect-to ...:$ENVOY_IP:8443:

HostHTTPNotes
api.anthropic.com404Reached Anthropic. 404 is what Anthropic serves at / — the TLS handshake and HTTP request round-tripped intact.
api.github.com200Reached GitHub API. Normal API root response.
ghcr.io301Redirect to github.com/... . Registry reachable.
google.com000 / exit 35Connection reset. No SNI match.
example.com000 / exit 35Connection reset. No SNI match.

The allowed hosts return their real upstream response codes. The blocked hosts fail at TLS connect (SSL_connect: Connection reset by peer), never reaching HTTP. This is exactly the semantics AC 5 calls for.

  • No Cloudflare CIDR problem. The gateway matches on SNI (hostname in the TLS ClientHello), not on IP. It doesn’t care that api.anthropic.com resolves to 162.159.x.x today and 104.x.x.x tomorrow — the SNI value is stable.
  • No OAuth-vs-API-key translation. Envoy proxies TCP bytes after SNI inspection. The TLS session is end-to-end between client and the real upstream. The client’s credentials (OAuth or API key) never touch Envoy. This is the critical difference from the LiteLLM path (which terminates the session and reauthenticates, breaking OAuth — see spike #2).
  • No error wrapping. Envoy doesn’t parse HTTP; Anthropic’s 429/529/rate-limit responses reach the client verbatim, so retry logic works normally.
  • No model-list churn. Hostnames don’t rev with every Claude model release.
  • No cost ceiling. Envoy sees encrypted TCP. It cannot count tokens or enforce a budget. Cost-accounting stays at the agent level (the current TokenUsageProjection path). Central cost control is the LiteLLM reason to exist; this is why the recommendation is to ship Envoy now and add LiteLLM later for Priority 3, not replace one with the other.
  • No provider portability. No model-name translation.
  • No LLM-trace observability. For Langfuse/OTel spans you still need in-process instrumentation (which the rig already has in claude-cli.js) or a future LiteLLM layer.

One GitOps PR adds:

  • apps/egress-gw/ directory with namespace, deployment (Envoy v1.32), ConfigMap, service.
  • apps/egress-gw/networkpolicy.yaml — restricts the egress pod itself to DNS + Anthropic CIDR as well (defense in depth).
  • Inclusion of the egress-gw kustomization in Flux’s app root.

Separate PRs to switch dev-e + review-e over. Pick ONE of the two integration paths first (see “How clients use it, revised” above):

  • Path A — CoreDNS rewrite: add coredns corefile entries redirecting api.anthropic.com, api.github.com, etc. to the Envoy egress gateway’s cluster-internal name. Expose Envoy on port 443. No HelmRelease env-var changes. ~1h PR on the cluster DNS config.
  • Path B — Envoy CONNECT mode: replace the current listener with http_connection_manager + connect_matcher + dynamic_forward_proxy HTTP filter. Then set HTTPS_PROXY on dev-e + review-e HelmReleases pointing at the Envoy egress gateway’s cluster-internal name + port.

After burn-in (~24h, one cycle of real agent runs), add back the default-deny egress NetworkPolicy on dev-e + review-e — this time allowing only (a) kube-dns, (b) rig-conductor namespace (api 8080 + valkey 6379 + postgres 5432), (c) the Envoy service. Because the proxy is internal, the allowlist becomes a pod selector, not an IP list — no CIDR churn, no Cloudflare problem.

Phase 1 of AC 5 is shippable this week on this path.

  • Envoy manifests + GitOps PR: ~2h
  • HelmRelease proxy-env changes + burn-in: ~2h (plus a monitored agent dispatch)
  • Default-deny NetworkPolicy (revised for the proxy model): ~1h

Total ~5h of real work; ~24h calendar for the burn-in gate.