Default-deny egress on GKE Dataplane V2 — seven options, one layered recommendation
⚠️ RETRACTED 2026-04-22. This document assumed the rig runs on GKE Dataplane V2. It does not — the rig runs on k3s v1.34.6 on a single GCE VM (
invotek-k3sininvotek-github-infra), flannel CNI, NetworkPolicy enforced by k3s-embedded kube-router. GKE-specific options (FQDNNetworkPolicy, Dataplane V2, self-Cilium-alongside-DPv2) do not apply. The k8s-native parts of the analysis (default-deny NetworkPolicy + CIDR allowlist, egress gateway, sidecar proxy) still stand. Phase 1 shipped 2026-04-22 as a plainNetworkPolicywith default-deny + CIDR allowlist (Anthropic160.79.104.0/21+ GitHub/metasnapshot). Phase 2 hostname allowlisting is deferred pending fresh research — either a CNI swap (Calico/Cilium) or a plain-Envoy cluster egress gateway. Sections below are kept for historical context; the recommendation is superseded.Original TL;DR (retracted). The AC 5 spec in the safety-foundation user story names “Cilium L7 NetworkPolicy.” The rig runs on GKE Dataplane V2 which does not expose full Cilium CRDs. Seven options were evaluated; three viable. Layered recommendation: default-deny + CIDR, then FQDNNetworkPolicy, then egress gateway.
Why this research exists
Section titled “Why this research exists”AC 5 of the safety-foundation user story reads “Cilium L7 policy per agent namespace — everything else denied.” Good policy, wrong primitive for the cluster we actually run on. GKE Dataplane V2 supports standard Kubernetes NetworkPolicy (L3/L4, IP blocks, pod selectors) but not the Cilium CRDs that would give us toFQDNs. Shipping YAML blindly without this distinction would either fail to apply or enforce nothing useful.
This doc is the capability check that should have been the first step of AC 5.
The seven options
Section titled “The seven options”1. FQDNNetworkPolicy (GKE-native)
Section titled “1. FQDNNetworkPolicy (GKE-native)”Google’s own CRD (networking.gke.io/v1alpha1) that layers FQDN rules on top of Dataplane V2. Still Preview in April 2026. Requires Dataplane V2 + GKE 1.26.4-gke.500 / 1.27.1-gke.400+ + kube-dns or Cloud DNS (no custom DNS). Enable via gcloud container clusters update --enable-fqdn-network-policy; on Standard you must restart anetd.
Limits: 50 IPs per FQDN resolution, 100 IP/hostname quota, wildcards match only one label deep (*.company.com matches eu.company.com, not eu.api.company.com), incompatible with inter-node transparent encryption, cannot target ClusterIP/Headless services.
Low effort, zero extra infra, exactly the right primitive — but Preview means no SLA.
2. Cilium CNI alongside Dataplane V2
Section titled “2. Cilium CNI alongside Dataplane V2”Not supported. Dataplane V2 is Google’s managed Cilium fork; you cannot run upstream Cilium on top. To get CiliumNetworkPolicy + toFQDNs you’d need a new cluster with Dataplane V2 disabled and self-managed Cilium — DNS-intercept proxy, upgrade churn, zero Google support. Not worth it for a 2-agent team.
3. Proxy sidecar per pod
Section titled “3. Proxy sidecar per pod”Mature pattern (Netflix, Lyft, Airbnb variants via Envoy). NetworkPolicy allows egress only to localhost:<sidecar-port>; sidecar does SNI/HTTP host allowlisting.
Overhead: ~30–80 MB RAM + ~5–10 m CPU per pod plus sidecar lifecycle coupling. Strong defense-in-depth (TLS SNI inspection, per-request logs), but config drift multiplies with pod count. Too much overhead at current scale.
4. Cluster-level egress gateway
Section titled “4. Cluster-level egress gateway”A single Envoy Deployment (2 replicas) as HTTP/HTTPS forward proxy. Agent pods get HTTPS_PROXY=egress-gw:3128 + NetworkPolicy egress allows only egress-gw + kube-dns. Gateway enforces FQDN allowlist via SNI.
Less per-pod overhead than sidecars, one config file, single audit log. Downside: single point of failure (mitigated with 2 replicas + PDB), and apps must honor HTTPS_PROXY — Node/Python/curl do, some Go clients don’t by default. Istio egress gateway works but drags the whole mesh; a plain Envoy Deployment is ~80 lines of YAML.
5. Pure ipBlock CIDR allowlist
Section titled “5. Pure ipBlock CIDR allowlist”Standard Kubernetes NetworkPolicy with ipBlock entries. GitHub publishes api.github.com/meta (verified — returns stable CIDRs). Anthropic publishes 160.79.104.0/21 outbound at platform.claude.com/docs/en/api/ip-addresses, stated stable.
npmjs.org is Cloudflare-fronted — CIDRs are not stable (~1500 prefixes, rotate). Works for GitHub + Anthropic, fails for npm.
6. Anthropic-specific CIDR
Section titled “6. Anthropic-specific CIDR”160.79.104.0/21 outbound, 160.79.104.0/23 inbound. Published at the Anthropic IP addresses page, explicitly promised “will not change without notice.” No structured meta endpoint, but docs are canonical.
7. LiteLLM + separate egress gateway
Section titled “7. LiteLLM + separate egress gateway”Once Priority 3’s LiteLLM proxy ships: LiteLLM becomes the only pod allowed to reach Anthropic; agents reach LiteLLM via ClusterIP. A thin egress gateway only allowlists GitHub + npm. Clean separation, two components to operate — defer until LiteLLM is actually landing.
Comparison
Section titled “Comparison”| Option | Impl effort | Ops cost | Defense strength | Lock-in | Fit for DPv2 |
|---|---|---|---|---|---|
| 1 FQDNNetworkPolicy | Low | Low | Medium (DNS-race, Preview) | GKE | Native |
| 2 Self-Cilium | Very high | High | High | None | Poor (rebuild cluster) |
| 3 Sidecar proxy | Medium | Medium-high | High | None | Good |
| 4 Egress gateway | Medium | Low | High | None | Good |
| 5 ipBlock only | Low | Medium (drift) | Low–Medium | None | Native |
| 6 Anthropic CIDR | Trivial | Trivial | Medium | None | Native |
| 7 LiteLLM + gateway | High | Medium | High | None | Good |
Recommendation for Invotek
Section titled “Recommendation for Invotek”Layered, starting minimal.
Phase 1 — this week (~1 hour)
Section titled “Phase 1 — this week (~1 hour)”Default-deny egress NetworkPolicy on the agent namespace. Allow:
kube-dns(UDP/TCP 53)- Internal
rig-memory-mcpservice (viapodSelector) - Anthropic
160.79.104.0/21(viaipBlock) - GitHub
/metaCIDRs (snapshot; refreshed manually quarterly, or via a CronJob that updates a ConfigMap)
Block npm at the network layer entirely at this stage. Agents shouldn’t npm install at runtime; bake dependencies into agent images instead. This aligns with the pretool-guard blocklist convention for package installers.
Phase 2 — next sprint (~½ day)
Section titled “Phase 2 — next sprint (~½ day)”Add FQDNNetworkPolicy (Option 1) for api.github.com, api.anthropic.com, and — if truly needed at runtime — registry.npmjs.org. Preview status is acceptable for 2 agents; worst case the Phase-1 CIDR policy keeps the floor. Monitor anetd logs.
Phase 3 — when agent count ≥ 4 or a non-CIDR-stable vendor appears (~2 days)
Section titled “Phase 3 — when agent count ≥ 4 or a non-CIDR-stable vendor appears (~2 days)”Deploy a cluster egress gateway (plain Envoy, 2 replicas, SNI allowlist). Single reasoning point for outbound. Keep FQDNNetworkPolicy as belt-and-braces. Merge with Priority 3’s LiteLLM proxy when it ships — LiteLLM handles Anthropic, Envoy handles the rest.
What NOT to do
Section titled “What NOT to do”- Don’t run self-managed Cilium alongside DPv2. Unsupported, cluster rebuild required, zero benefit at our scale.
- Don’t ipBlock Cloudflare ranges for npm. 1500+ prefixes that rotate; you’ll be debugging
npm installfailures monthly. Bake deps into images or use FQDN instead. - Don’t use Istio just for an egress gateway. Mesh tax is enormous for a 2-pod use case. Plain Envoy Deployment.
- Don’t sidecar every pod at current scale. Revisit only if per-agent policy differentiation becomes a real requirement.
- Don’t skip the default-deny NetworkPolicy while waiting for the “right” long-term solution. Ship the CIDR version this week; FQDN upgrade is additive.
AC 5 scoping impact
Section titled “AC 5 scoping impact”The user story AC 5 originally read:
Cilium L7 policy per agent namespace. Allowlist: api.github.com, api.anthropic.com (or the LiteLLM proxy once Priority 3 lands), registry.npmjs.org (or pinned registry mirror), the rig-memory-mcp service. Everything else denied.
This research refines the scope:
- Phase 1 satisfies the default-deny + essential allowlist intent without Cilium CRDs — ship this first and claim partial credit.
- Phase 2 adds hostname-based filtering via the GKE-native FQDNNetworkPolicy — closer to the “L7-ish” layer the AC intended.
- Phase 3 adds the true-L7 enforcement (SNI at an egress gateway) when scale justifies the operational cost.
Updating the user story AC wording to reflect this phased interpretation is a small follow-up — the original AC isn’t achievable as literally specified on GKE Dataplane V2 in 2026.