Tool Choices — An ADR for Every Pick

Capabilities

🟢 k3s-gcp-vm — k3s on single GCP VM (8 GB). invotek-k3s live.
🟢 fluxcd-gitops — FluxCD GitOps. rig-gitops → cluster reconciliation active.
🟢 keda-autoscaling — KEDA event-driven autoscaling. ScaledObject per agent.

!!! abstract “TL;DR” Every tool named in the whitepaper gets a defensible answer to: what problem, what alternatives, why this, license and backing, pricing, lock-in risk, migration path. The exercise changed several of the original picks after honest re-evaluation — notably: drop Vault (overkill), SOPS + age is the deployed secrets pick (corrected through three rounds — see Secrets section; the rig was always on SOPS, earlier retractions assumed otherwise), add Phoenix alternative to Langfuse (8 GB VM reality), defer feature flags (YAGNI at our scale), hedge pgroll with inspectable SQL trail (single-vendor bus factor, correctly framed).

This document is the reasoning the other whitepaper docs just assert. Every line of the form “We use X” elsewhere has a row here that explains why X and not Y.

How to read each entry

Every pick is evaluated against the same rubric:

Column	What it captures
License	OSI-approved? Copyleft? Source-available-but-restricted? Specific license string (MIT, Apache-2.0, MPL-2.0, BSL, ELv2, AGPL, GPL, proprietary).
Owner / governance	Single company? Foundation? BDFL? Community-elected?
Pricing	Free for our use? Tier structure? Where does the pricing curve bite?
Bus factor	If the primary maintainer disappears, who keeps this alive?
Lock-in risk	If we need to leave, how bad is the migration?
Escape hatch	Concrete alternative we’d adopt if we had to move.
Re-evaluate when	The signal that tells us this pick is no longer right.

The goal is not minimize every axis (impossible) but be explicit about each, so future us — or a future maintainer — can argue with our choices from a base of evidence rather than vibes.

Headline changes from original whitepaper

!!! warning “Where the honest re-evaluation changed the pick” - Secrets: drop Vault. SOPS + age + Flux is what’s actually deployed (verified live in apps/*/*.sops.yaml). External Secrets Operator + GCP Secret Manager is deferred until needed. GitHub App installation tokens are minted on-demand. OpenBao is the correct choice if and when we ever need Vault-class dynamic-secret capability — not now. Earlier drafts claimed SealedSecrets was our current state; that was wrong (never deployed). Third-order correction recorded in the retraction log. - LLM observability: add Phoenix. Langfuse v3 wants 16 GB RAM min and a separate ClickHouse cluster. On our 8 GB VM, Phoenix (ELv2, OTel-native, SQLite/Postgres, no ClickHouse) is the honest self-host pick. - Feature flags: defer. flagd + OpenFeature is defensible eventually but for 1-2 humans and few services with no A/B testing need, env-vars-via-Kustomize is sufficient. Adopt a flag system when there’s a concrete targeting / experimentation requirement. - Unleash: explicitly reject. OSS edition deprecated and reached EOL 2025-12-31. Was previously a reasonable alternative; no longer is. - Instrument with OpenTelemetry GenAI semantic conventions, not vendor SDKs. This is the single highest-leverage lock-in defense in the stack.

Secrets management

The most user-called-out section. The original whitepaper promised Vault for short-lived credentials; honest re-evaluation says we don’t need it.

The Vault-vs-SOPS question (directly)

They solve overlapping but distinct problems:

Dimension	SOPS + age (or SealedSecrets)	Vault / OpenBao
What it encrypts	Files at rest in git	Secret values fetched at runtime
Dynamic secrets	No	Yes — mint short-lived DB users, cloud creds, GitHub App tokens
Ops footprint	Zero runtime service	3+ node HA cluster, unsealing, upgrades
Reviewable in PRs	Yes (encrypted blob diffs cleanly)	No (secrets never in git)
Revoke on compromise	Git commit + rotate everywhere	One API call, cluster-wide
Audit log	Git history	Vault audit log
Disaster recovery	Git repo + decryption key	Vault snapshot + unseal keys

For a high-traffic production system serving paying customers: Vault (or OpenBao) wins clearly — dynamic secrets + centralized revocation + audit log are irreplaceable.

For a 1-2 person rig on one 8GB VM: SOPS-style encryption + ESO shim + cloud-KMS-backed secret manager is simpler, cheaper, and covers the real threat model.

!!! danger “Retracted (third-order correction), 2026-04-17 — we were never on SealedSecrets” Two previous retractions in this ADR (log below) framed a migration from SealedSecrets to SOPS. That framing was wrong about the ground-truth deployed state. Verified by grep-ing the repo: zero kind: SealedSecret references, zero sealed-secrets-controller HelmRelease, zero bitnami-labs image pulls. Every secret in the rig — Dev-E, Review-E, rig-conductor, Cloudflared — is already SOPS-encrypted (*.sops.yaml files in apps/). SOPS + age + Flux was always the deployed pick; there was never any SealedSecrets to migrate from.

The earlier narratives (*"SealedSecrets keep"*, then *"SealedSecrets legacy migrating"*) were built on an earlier research-agent summary that asserted SealedSecrets was our current deployment. I accepted that without running `grep -r SealedSecret apps/`. I ran that grep today and it returned nothing. The Broadcom-paywall concern is real for anyone using SealedSecrets but was theoretical for us — we avoided the risk by already being on SOPS, not by deliberately migrating off. The "GHCR hedge" I proposed is unnecessary because we don't pull the image at all.

**Meta-lesson added to the fresh-start evaluation log** (below): *verify ground-truth deployed state, not research-agent summaries, before writing retraction narratives.* A 10-second grep would have prevented two rounds of wrong framing.

Current pick (verified live in apps/ as of 2026-04-17):

SOPS + age + Flux kustomize-controller (deployed primary, has been all along)
  + .sops.yaml at repo root with creation_rules covering apps/*/*.sops.yaml
  + Cluster-scoped age key in flux-system/sops-age Secret
  + Per-app encrypted manifests: apps/dev-e/dev-e-secrets.sops.yaml,
    apps/review-e/review-e-secrets.sops.yaml,
    apps/rig-conductor/rig-conductor-secrets.sops.yaml,
    apps/cloudflared/tunnel-token.sops.yaml
  + Each kustomization sets decryption.provider: sops + secretRef.name: sops-age
  + GitHub App installation tokens minted on-demand at pod startup (1h TTL)
  + Static narrow-grant Postgres service accounts
  + External Secrets Operator deferred (not yet needed — git-at-rest scales to our inventory)
  + Vault / OpenBao deferred (no dynamic-secret requirement yet)

See docs/sops.md for the operational reference (how to bootstrap an encrypted secret, rotation procedure, key management).

Secrets tooling matrix

Tool	License	Owner	Our pick?	Why / why not
HashiCorp Vault	BSL 1.1	IBM (acq. Feb 2025)	No	BSL is tolerable (we’re non-competing) but HCP Vault Secrets EOL July 2026, IBM pricing plays, velocity concerns. Operationally expensive (3+ node HA, unsealing). Dynamic secrets are genuinely excellent — we just don’t need them yet.
OpenBao	MPL-2.0	Linux Foundation	Deferred	The correct answer if we ever need Vault-class capability. API-compatible with Vault; ESO works unchanged. Same ops burden as Vault. Adopt when we have a concrete unmet need for dynamic secrets.
SOPS + age	MPL-2.0	CNCF (getsops org)	Yes (deployed primary)	The actual deployed pattern. Verified live in `apps//.sops.yaml` — all four active app namespaces use it. Flux decrypts inline via `kustomize-controller --decryption-provider=sops` (no additional controller); age keys are simpler than GPG. MPL-2.0 forever, CNCF governance.
SealedSecrets	Apache-2.0	bitnami-labs (Broadcom-owned)	Not in use	Not deployed. Not used. Not a migration target. Broadcom’s Bitnami catalog paywall (verified real — `bitnami/postgresql:17.5.0` returns 404, same namespace as sealed-secrets-controller image) is a risk for shops that use it; we avoided it by default, not by design. Leaving the row here for the ADR audit trail.
External Secrets Operator	Apache-2.0	CNCF (incubating)	Yes (add)	The reversibility insurance. Backend-agnostic — swap GCP SM → OpenBao → Infisical by changing a CRD, workloads untouched.
GCP Secret Manager	Proprietary (GCP)	Google	Yes (add)	We’re already on GCP. Free tier covers our inventory. No dynamic secrets but doesn’t need to. Access via ESO = low lock-in.
Infisical	MIT (core) + SaaS	Infisical Inc. (YC)	No (for now)	Strong middle-ground between Bitwarden and Vault. Reasonable alternative if we outgrow GCP SM before we need Vault.
Doppler	Proprietary SaaS	Doppler Inc.	Reject	SaaS-only; no self-host. Makes Doppler outage = our deploy outage. Strongest lock-in on the list.
1Password Connect	Proprietary	1Password	Partial	We already use Bitwarden for human vault. 1Password Connect is fine if we switched, but no reason to.
CSI Secrets Store	Apache-2.0	Kubernetes	No	DaemonSet footprint too heavy on single 8 GB VM. Right choice for regulated workloads avoiding etcd.
cert-manager + trust-manager	Apache-2.0	CNCF (graduated)	Yes (add)	Table stakes. Non-controversial.

Re-evaluate secrets when

We add a second K8s cluster or second Postgres instance (static narrow grants stop scaling)
We take a compliance requirement that mandates audit log on secret access
A secret actually gets compromised (rotation scope pain becomes real)
Our team grows past ~5 operators (human secret-handling becomes the bottleneck)
The getsops.io project stalls or is archived (then fork or migrate to an alternative; currently healthy)

Retraction log — secrets picks (three rounds)

Honest disclosure of where the first, second, and third drafts of this ADR got it wrong about secrets.

Round	What the draft said	What changed	Why it was wrong
1	”SealedSecrets — Yes (keep) + ESO + GCP SM + GitHub App tokens. Governance risk post-Broadcom/Bitnami is real but no migration pressure yet.”	Promoted SealedSecrets to the declared primary; treated SOPS as redundant.	Accepted a research-agent summary that claimed SealedSecrets was our deployed state. Never grep-verified. Built a whole defense around an incorrect premise.
2	”SOPS + age is now primary. SealedSecrets is Legacy (migrating). Interim hedge: switch image source to `ghcr.io/bitnami-labs/sealed-secrets-controller`.”	SOPS promoted, SealedSecrets relabeled legacy-migrating, elaborate Broadcom-paywall hedge proposed.	Still wrong about ground truth. Corrected the right-and-wrong framing of the tools, but kept asserting SealedSecrets was our deployment. The Broadcom paywall research was real but the “migration path” was literature for a migration that didn’t need to happen.
3 (this entry)	“SOPS + age + Flux was always the deployed pick. There was never any SealedSecrets. Earlier retractions were based on an unverified premise.”	Corrected: zero SealedSecrets in the repo (`grep -r SealedSecret apps/` returns nothing). `.sops.yaml` at repo root covers all apps. Every deployed app uses `*.sops.yaml` with `decryption.provider: sops`.	The meta-lesson: verify ground-truth deployed state, not research-agent summaries, before writing retractions. A 10-second `grep -r` would have prevented two rounds of wrong framing. The second-order lesson — also named in my Fresh-start log’s “three patterns” — was “hedge narratives need re-verification.” This round adds: “assertions about current deployed state need re-verification too,” which is the stronger version of the same principle.

The incumbent-bias lesson still stands

Round 1’s incumbent-bias problem (documented in the Round 1 retraction above) was a real pattern: I skipped the fresh-start test on secrets while applying it elsewhere. Round 2’s correction — “SOPS wins on governance, operational cost, license permanence” — was the right conclusion reached via the right reasoning. What was wrong in Round 2 was the framing (claimed migration from incumbent) not the verdict (SOPS over SealedSecrets). Round 3 leaves the verdict intact and fixes only the factually-inaccurate framing.

The ground-truth-verification lesson

Add to the fresh-start evaluation meta-rules (below): before asserting what’s deployed, run grep against the repo. Research-agent summaries can be wrong or stale. I had the tools to verify in the first round and didn’t use them. Don’t skip verification of ground truth; it’s cheaper than two rounds of retraction.

Fresh-start evaluation log (April 2026)

Honest application of the fresh-start test — “if we were picking this from scratch today with no prior context, what would we pick?” — to every tool category in this ADR. This log replaces the earlier “Broader incumbent-bias check” which was a checklist, not an evaluation. Two picks were verified in depth this round (LiteLLM and pgroll); the rest were audited against 2026-current alternatives.

Summary table

Category	Current pick	Fresh-start (2026) verdict	Rigor of this round
Secrets (git-at-rest)	SOPS + age (deployed; always was)	Corrected — SOPS is the deployed pick; earlier retractions #1 and #2 framed a migration that didn’t need to happen because SealedSecrets was never deployed	Deep (three rounds of retraction; see log)
Policy engine	Kyverno	Keep — unchanged	Shallow
Supply-chain signing	Sigstore (cosign, gitsign, rekor, slsa-github-generator)	Keep — no credible OSS competitor	Shallow
Networking / L7 egress	Cilium	Keep — no equivalent at L7 in OSS	Shallow
Metrics / logs / traces	Prometheus (local) + Grafana Cloud Free (managed) + OTel Collector	Keep — name VictoriaMetrics as lighter-weight alternative for multi-node future	Medium
LLM observability	Phoenix (8 GB) / Langfuse (16 GB+)	Keep — already retracted in earlier PR	Medium
LLM gateway	LiteLLM	Keep — verified. Portkey’s “fully OSS” March 2026 announcement kept per-key budget enforcement Enterprise-only; original pick rationale holds	Deep (verified this PR)
Progressive delivery	Flagger	Keep — Flux-native, Argo Rollouts fights Flux	Shallow
Feature flags	Deferred (flagd when needed)	Keep — no change; PostHog named as bundled alternative	Shallow
DB migration safety	pgroll	Keep — verified. Atlas has closed most gaps but does not implement expand/contract multi-version schemas — pgroll’s core differentiator stands	Deep (verified this PR) + hedge-narrative fixed
Supply chain (deps)	Dependabot + Socket.dev + Syft+Grype + package-age policy	Keep — name trivy as Grype alternative	Shallow
Container + CI	GitHub Actions + GHCR	Keep — incumbent-and-defensible (SCM + CI + registry bundle dominates)	Shallow
GitOps	Flux	Keep — incumbent-and-defensible (Flagger picks Flux-native; switching cascades)	Shallow
Cluster runtime	k3s	Keep — name Talos Linux as multi-node future consideration	Medium
Event-driven autoscale	KEDA	Keep — no credible competitor	Shallow
Cloud compute	GCP Compute	Keep — incumbent-and-defensible (Workload Identity + DNS already wired)	Shallow
Human vault	Bitwarden	Keep — name Vaultwarden (Rust self-host port) for future	Shallow
Docs site	MkDocs Material	Keep — Docusaurus/VitePress/Astro Starlight are reasonable if we want more customization	Shallow
Evaluation harness	Inspect AI (candidate)	Already flagged candidate — validate in Era 2	Already done

Verified deep this round: LiteLLM (stays)

Portkey Gateway went fully open source March 2026 (Apache-2.0, 1T+ tokens/day). Original LiteLLM pick reason was “only OSS option with per-virtual-key budget enforcement.” Re-verified against Portkey’s 2026 documentation:

Portkey Budget Limits docs: “Budget Limit is currently only available to Portkey Enterprise Plan customers.”
Portkey Rate Limits docs: “Rate Limits are available exclusively to Portkey Enterprise customers and select Pro users.”

The 2026 “fully OSS” announcement was a governance + observability + MCP-registry open-sourcing, not a cost-controls open-sourcing. The original LiteLLM differentiator (per-virtual-key budget envelopes with duration windows returning 429 on exceed, free) still holds.

LiteLLM’s known bugs (#12905, #10750, #12977, #25386) don’t touch our specific config pattern (we have ~5 explicitly-configured keys, no team-scoped nesting, no pass-through routes, no AzureOpenAI direct client, no auto-created end users). Verdict: stay on LiteLLM. Revisit if (a) Portkey moves budget-limits to OSS, or (b) we scale past ~500 RPS where LiteLLM’s documented memory issues at 2k RPS start to bite.

Verified deep this round: pgroll (stays) + hedge narrative corrected

Atlas (Ariga) has shipped rapidly in 2025–2026 — v1.2.0 on 2026-04-10, Kubernetes operator (Apache-2.0 with some EULA image layers), 50+ migration safety analyzers, weekly-to-biweekly release cadence. Feature gap against pgroll narrowed significantly. But Atlas does NOT implement real expand/contract with multi-version schema views + triggered backfill — it lints for unsafe DDL, emits concurrent-index DDL, and rolls out carefully, but it executes a single migration against a single schema.

For our specific workload (one Postgres, ~10–30 tables, expand/contract required for zero-downtime), pgroll is still the only tool that keeps v1 and v2 of a table simultaneously queryable. Verdict: stay on pgroll. Revisit if Xata misses another release quarter (no v0.17 by end of Q3 2026), announces a shutdown/acquisition, or Atlas ships native expand/contract.

Corrected the hedge narrative: earlier drafts implied pgroll migrations are plain SQL. They’re not — they’re pgroll-specific operation YAML. The correct hedge is to commit generated SQL alongside each operation YAML (via pgroll SQL emission) so schema history stays reconstructible. See the pgroll section above for the corrected wording.

Shallow-audited: what “fresh-start keep” actually means

For the shallow-audited picks (Kyverno, Sigstore, Cilium, Flagger, k3s, KEDA, Dependabot/Socket, GitHub/GHCR/Flux, Bitwarden, MkDocs), “fresh-start keep” means: I considered the current 2026 alternatives to each and none clearly beat the incumbent for our scale on license, governance, operational cost, and feature coverage. They are the picks I would make today if starting from scratch.

A stronger level of rigor would be individual per-category research agents (like I did for LiteLLM and pgroll). That’s worth doing when a specific concern surfaces (as with Portkey-announcement, Xata-release-cadence, Broadcom-Bitnami). Applying it to every pick every month is over-engineering.

What the deeper-audited rounds taught us

Four patterns emerged from the SOPS (three rounds), LiteLLM, and pgroll deep audits:

Announcements lie about feature scope. Portkey’s March 2026 “fully OSS” announcement was marketing; the feature we care about stayed paywalled. Always verify against current docs, not the press release.
Release cadence is a signal. pgroll’s decelerating releases (v0.16.1 in February, nothing since) is consistent with Xata being in maintenance-mode for pgroll as an internal-product-first tool. Not alarming on its own, but worth tracking.
Hedge narratives need re-verification. The “keep migrations as plain SQL” hedge in the earlier pgroll writeup turned out to be wrong — pgroll operation files are YAML. When we write a hedge, we should confirm it’s actually realisable, not just aspirational.
Ground-truth deployed state needs re-verification too. The SealedSecrets retraction had to happen three times because the first two rounds accepted a research-agent summary about “what’s currently deployed” instead of grep-verifying the repo. A 10-second grep (grep -r SealedSecret apps/) would have prevented it. Stronger version of pattern (3): “before asserting what’s deployed, verify.”

Categories that warrant re-examination eventually

Not actionable today, flagged for future attention:

Bitwarden — picked because humans already use it. 1Password has better team-grant ergonomics; Vaultwarden is an unofficial Rust self-host port if we want more control; Infisical covers human+automation in one product (at the cost of a YC-company dependency). Re-evaluate if team grows past 3 operators or if we start needing per-project secret segregation.
MkDocs Material — Python-docs gold standard today, but Docusaurus (Meta-backed, React), VitePress (Vue/Vite), and Astro Starlight (Astro) are reasonable alternatives with better customization. Low priority — the docs site works.
k3s — ideal for single-VM. If we go multi-node for any reason, Talos Linux (immutable, API-only, no SSH, no shell) is a stronger security baseline. Not a k3s replacement — runs K8s, including k3s — but changes the host OS story.

Meta-rule, reaffirmed

When an ADR row reads “already deployed — keep” without a license/governance/operational comparison against the best fresh-start alternative, that’s a flag for re-examination. Path dependence is a cost, not a reason. Every pick in this ADR has now had the fresh-start test applied at least shallowly; two picks got deep verification this round; the retraction log above grows whenever a pick turns out to have been defended on sunk-cost reasoning.

Next scheduled re-audit: monthly for deep-picks (LiteLLM, pgroll, SOPS health at getsops.io, Langfuse/Phoenix VM sizing). Quarterly for shallow-picks. Immediate whenever a tool’s governance / license / owner changes (Broadcom-Bitnami style events). Always verify ground-truth deployed state with a grep before framing a retraction.

Policy engine

Tool	License	Owner	Our pick?	Why
Kyverno	Apache-2.0	CNCF (incubating)	Yes	YAML CRDs (no Rego language). Native Sigstore verification first-class. Reports in OpenReports format. Operational cost is lower for a 2-person team.
OPA Gatekeeper	Apache-2.0	CNCF (graduated)	No	Rego language requires learning + maintenance. Cosign verification exists but not as polished. Better for large orgs that already run OPA for non-K8s policy.
jsPolicy	Apache-2.0	Loft Labs	No	Single-vendor. JavaScript-based policies. Niche.
OpenPolicyAgent (OPA) core	Apache-2.0	CNCF	No	Lower-level; Gatekeeper is the K8s-admission wrapper.

Re-evaluate policy engine when

We need to write non-K8s policies (API gateway, CI/CD gates) — OPA’s broader reach becomes attractive
We have a Rego-fluent engineer — learning cost drops
Kyverno governance shifts unfavorably (currently healthy as CNCF incubating)

Supply chain / signing

All picks are Sigstore ecosystem; there’s no real competitor in 2026 open-source territory.

Tool	License	Owner	Our pick?	Why
Cosign	Apache-2.0	Sigstore (Linux Foundation)	Yes	Industry default. Keyless via Fulcio.
Gitsign	Apache-2.0	Sigstore	Yes	Agent commit signing with ephemeral Fulcio certs. Known gotcha: GitHub UI doesn’t display “Verified” — workaround is a CI-side `gitsign verify` check.
Rekor	Apache-2.0	Sigstore (public instance)	Yes (public)	We use the public Rekor. Private Rekor is possible but over-engineered.
slsa-github-generator	Apache-2.0	SLSA framework	Yes	Isolated-builder reusable workflow produces SLSA v1.0 L3 provenance. ~5 lines in any GitHub Actions file.
Notary v2 / ORAS	Apache-2.0	CNCF	No	OCI-artifact-focused; Sigstore covers our image case more simply.
Docker Content Trust	Proprietary-ish	Docker	No	Deprecated in favor of Sigstore.
HSM-backed PGP	varies	—	Reject	Long-lived keys to rotate. Worse threat model than Sigstore for our case.

Re-evaluate signing when

Sigstore public goods service (Fulcio / Rekor free tier) changes its commitment
Compliance requires private transparency log (run private Rekor)
Multi-tenant signing needs emerge (HSM delegation tooling)

Networking / service mesh

Tool	License	Owner	Our pick?	Why
Cilium (L7 via CNPs)	Apache-2.0	CNCF (graduated)	Yes	eBPF CNI with L7 HTTP/DNS filtering. The single biggest ROI defense against prompt-injection exfiltration. Gets 80% of service-mesh value at 10% of the cost.
Istio	Apache-2.0	CNCF (graduated)	No	Full service mesh — mTLS, traffic mgmt, etc. Overkill for our traffic volume + single-cluster setup. Revisit if we go multi-cluster or need mTLS to external services.
Linkerd	Apache-2.0	CNCF (graduated)	No	Simpler than Istio; still more than we need. Good alternative if Istio’s complexity is the only objection.
Calico (OSS)	Apache-2.0	Tigera	No	Solid CNI but L7 filtering requires Calico Enterprise (commercial). Cilium’s OSS L7 wins.
Native NetworkPolicy only	—	Kubernetes	No	L3/L4 only. Cannot filter by FQDN or HTTP method. Insufficient for our egress-allowlist goal.

Re-evaluate networking when

Multi-cluster deployment emerges — a service mesh becomes more compelling
Cilium L7 Envoy proxy memory overhead becomes the bottleneck on our VM
External mTLS requirement (e.g., to a customer-facing API) — then Istio or Linkerd

Observability — metrics, logs, traces

Split picks: local for SLO-decisive data, managed for everything else.

Tool	License	Owner	Our pick?	Why
Prometheus (local)	Apache-2.0	CNCF (graduated)	Yes	Source of truth for Flagger canary analysis. Must be local so SLO gates work when external egress blips. ~1 GB RAM.
Grafana Cloud Free	Proprietary SaaS	Grafana Labs	Yes	10k series metrics, 50 GB logs, 50 GB traces, 14-day retention. Fits a 1-2 person rig. Predictable paid scale.
Mimir / Thanos / VictoriaMetrics	Apache-2.0	Grafana Labs / Other	No	Large-scale Prometheus backends. Overkill — Grafana Cloud Free covers us.
Datadog / New Relic	Proprietary	—	Reject	Vendor lock + pricing curves bite at scale.
Self-hosted LGTM stack	Apache-2.0	Grafana Labs	No	Would memory-starve our 8 GB VM. Hybrid with Grafana Cloud is the right answer.
OpenTelemetry Collector	Apache-2.0	CNCF (incubating)	Yes	One exporter that forwards to both Prometheus (local) and Grafana Cloud (managed). Standard plumbing.

Re-evaluate metrics/logs when

Grafana Cloud Free limits bite — 10k series isn’t enough
Cost-visible scaling past ~$50/mo on Grafana Cloud makes self-hosted LGTM attractive on a bigger VM
Regulatory requirement forces log residency — self-host becomes mandatory

LLM observability

Tool	License	Owner	Pricing	Our pick?	Why
Langfuse (self-host)	MIT core + EE license-key for a few features	Langfuse GmbH (YC)	Free	Conditional	Official min 4 CPU / 16 GB RAM for app alone, plus ClickHouse cluster. Too heavy for 8 GB VM. Pick only if we size up.
Arize Phoenix (self-host)	ELv2 (source-available, non-OSI)	Arize AI	Free	Yes (for our scale)	OTel-native. SQLite or Postgres, no ClickHouse. Runs fine on our VM. ELv2 is non-concern for internal self-host (restricts SaaS resale, which we don’t do).
Langfuse Cloud Hobby	N/A (SaaS)	Langfuse GmbH	Free 50k units/mo	Backup	50k billable units sounds like a lot but complex agent traces = 15–20 units each. Hits cap fast; hard-stop at cap.
Helicone (self-host)	Apache-2.0	Helicone Inc.	Free self-host, 10k req/mo SaaS free	Alternative	Gateway + observability combined. Reasonable plan B if Phoenix unsuitable.
LangSmith	Proprietary SaaS	LangChain	Paid tiers	Reject	SaaS-only, paid-gated features, LangChain-native assumptions.
Braintrust	Proprietary SaaS	Braintrust	Free → $249/mo	Reject for cost	Strong on prompt regression but pricing bites for our scale.
Arize AI enterprise	Proprietary	Arize	Contact sales	Reject	Phoenix OSS covers us.
W&B Traces	Proprietary SaaS	CoreWeave	Paid	Reject	Broader ML observability overkill.
Cloudflare AI Gateway	Proprietary SaaS	Cloudflare	Free passive analytics	Yes (secondary)	Free since we’re already on Cloudflare. Passive cost tracking. Not sufficient primary — no virtual-key budgets.

The Langfuse vs Phoenix call

Langfuse is the de facto OSS LLM observability tool in 2026 and its feature set (prompt versioning, evaluations, dataset management, team workflows) is the richest. But its resource floor is genuine: v3 officially requires 4 CPU / 16 GB for the app, plus a ClickHouse cluster (3 nodes × 2 cores × 8 GB recommended). On our 8 GB single-VM k3s, it will boot and demo but will not survive sustained load.

Phoenix is the honest pick at our scale. Lighter infra, OTel-native (so traces are portable), solid eval story. We lose Langfuse’s prompt-management UI and team features — acceptable for a 1-2 person team.

The lock-in defense that trumps both: instrument our code with OpenTelemetry GenAI semantic conventions, not Langfuse/Phoenix SDKs. Both platforms accept OTLP. The choice of observability backend becomes swappable.

Re-evaluate LLM observability when

We scale the VM to 16 GB+ (Langfuse becomes viable)
Team grows and prompt-versioning / collaborative eval becomes load-bearing
Phoenix ELv2 changes (unlikely but license watch)

LLM gateway / proxy

Tool	License	Owner	Pricing	Our pick?	Why
LiteLLM	MIT (core) + commercial Enterprise	BerriAI (YC W23)	Free	Yes	Only OSS option with proper per-virtual-key budget enforcement + duration resets. OpenAI-format passthrough. Anthropic provider first-class.
Portkey Gateway	Apache-2.0 (fully OSS since March 2026)	Portkey	Free self-host; $9 per 100k logs managed	Documented fallback	Fully OSS escape hatch if LiteLLM stumbles. Processing 1T+ tokens/day across users.
Cloudflare AI Gateway	Proprietary SaaS	Cloudflare	Free	Secondary	Passive observability already in our stack. No virtual-key budgets — not sufficient as primary.
OpenRouter	Proprietary SaaS	OpenRouter	5.5% markup	No	Adds hop, no self-host, no per-key budgets like LiteLLM.
Kong AI Gateway	Proprietary (Enterprise plugin)	Kong	Enterprise contract	Reject	Enterprise pricing not justified.
TrueFoundry	Proprietary SaaS	TrueFoundry	Paid	Reject	Platform-level opinion.

The LiteLLM SPoF concern

LiteLLM is a single point of failure: if it’s down, all agents block.

Mitigations:

Run ≥2 replicas behind a k3s Service with Postgres + Redis shared state.
Client-side fallback to direct api.anthropic.com after N seconds of 5xx from proxy — but this bypasses budget enforcement by design; document as acceptable degraded mode.
Monitor proxy health as a first-class SLO in Prometheus.

LiteLLM funding / license risk

License: MIT intact. No BSL drift as of Q2 2026.
Owner: BerriAI, YC W23, publicly reported ~$2.1M seed. No Series A disclosed. Venture-stage risk is real.
12–24 month watch: (a) BSL-style moves on enterprise features only would be fine for us (we’re on OSS), (b) aggressive monetization could lock some OSS features behind keys, (c) a quiet under-maintenance period is the likeliest failure mode.

Re-evaluate gateway when

LiteLLM license changes or project health deteriorates
Our traffic exceeds what LiteLLM’s Postgres-backed rate limiter can handle (~10k RPS)
Portkey Gateway momentum surpasses LiteLLM’s

Progressive delivery / canary

Tool	License	Owner	Our pick?	Why
Flagger	Apache-2.0	CNCF (graduated via Flux)	Yes	FluxCD-native. Owns its own Canary CRD that shadows the Deployment — no field-level fights with Flux. Webhooks at every phase for rig-conductorintegration. ~100 MB controller footprint.
Argo Rollouts	Apache-2.0	CNCF (graduated via Argo)	No	Mutates fields Flux also reconciles — recurring drift fights. Pair with ArgoCD, not Flux.
Keptn	—	CNCF archived 2025-09-03	Reject (dead)	Dynatrace team pulled back. Do not adopt.
OpenKruise Rollout	Apache-2.0	OpenKruise (CNCF sandbox)	No	Mostly Alibaba ecosystem. Right only if we need StatefulSet canary.

Re-evaluate canary when

We migrate from Flux to ArgoCD (Argo Rollouts becomes natural)
Flagger project health deteriorates (currently active)

Feature flags

!!! danger “Honest YAGNI” Feature flags at our scale (1-2 humans, few services, no A/B testing need) are overkill today. Env vars + Kustomize overlays per environment cover the actual use case — compile/deploy-time toggles — at zero operational cost. We should adopt a flag system when there’s a concrete targeting, experimentation, or kill-switch need — not before.

Tool	License	Owner	Our pick?	Why
env vars + Kustomize overlays	—	—	Yes (now)	Zero ops cost. Covers 100% of actual current need.
OpenFeature + flagd	Apache-2.0	CNCF (incubating)	Deferred	Right pick when we need runtime toggles. Sidecar ~30-60 MB. JSON flag config.
Flipt	GPL-3.0 (server) + MIT (clients)	flipt-io	Alternative	GitOps-native YAML flags. Single Go binary. GPL server is sticky but fine for internal use.
GrowthBook	MIT	GrowthBook	Alternative	If we need A/B experimentation with stats out of the box. OpenFeature SDK.
PostHog feature flags	MIT (self-host) + SaaS	PostHog	Consider if we adopt PostHog	Bundled with analytics. Zero marginal cost if already using PostHog.
Unleash	Apache-2.0 core (EOL 2025-12-31)	Unleash	Reject (dying OSS)	Enterprise-only going forward. Avoid for new adoption.
LaunchDarkly	Proprietary SaaS	LaunchDarkly	Reject	$12/seat/mo + MAU overages. Overkill by an order of magnitude.
Statsig	Proprietary SaaS	Statsig	Reject for lock-in	Generous free tier (1M MTUs) but SaaS-only.
ConfigCat	Proprietary SaaS	ConfigCat	Alternative	Free tier forever, simple, Hungarian SaaS. If we want SaaS and not LaunchDarkly.

Re-evaluate feature flags when

We need per-user or per-tenant targeting that env vars can’t express
A/B experimentation with real statistical significance becomes a product need
We adopt PostHog for analytics (flags come bundled)
Any T1 incident where a kill switch faster than kubectl rollout undo would have saved us

DB migration safety

Tool	License	Owner	Our pick?	Why
pgroll	Apache-2.0	Xata	Yes	Automates expand/contract safely for Postgres — the only tool in this category that keeps v1 and v2 of a schema simultaneously queryable via views, with triggered backfill. Atlas does not implement this; it lints for unsafe DDL and rolls out carefully but executes a single migration against a single schema. Moderate single-vendor bus factor (Xata is ~27 employees, still operating; pivoted mid-2025 to serverless Postgres with Simplyblock). Release cadence decelerated: v0.16.1 last released 2026-02-17. Verified April 2026.
Atlas (Community Edition)	Apache-2.0 + EULA on official binaries	Ariga	Alternative / hedge	Declarative schema-as-code + linting. Source Apache-2.0; official binaries under Atlas EULA. Build from source if EULA matters.
Flyway Community	Apache-2.0 core (Redgate-owned)	Redgate	Alternative	Classic versioned SQL migrations. Not zero-downtime-automated. License creep concern (Redgate moving features out of OSS).
gh-ost	MIT	GitHub	Irrelevant	MySQL only. We’re on Postgres.
Reshape	Apache-2.0	fabianlindfors	Reject (bus factor 1)	Single-author, author’s focus shifted. Don’t adopt for production.
Bytebase	Apache-2.0 (5-source limit)	Bytebase	No	UI-heavy workflow tool. Overkill for 1-2 person rig.

The pgroll bus factor hedge

!!! warning “Corrected: pgroll files are YAML, not SQL” An earlier draft claimed we could “keep migrations inspectable SQL… runnable by plain psql.” That’s wrong — pgroll migration files are pgroll-specific operation YAML (e.g., add_column, drop_column, set_not_null), not raw SQL. The actual hedge: keep a parallel SQL trail. For every pgroll operation that runs, commit the generated SQL (pgroll migrate --dry-run --json | pgroll generate-sql) alongside the operation YAML. If Xata folds, the SQL trail lets us reconstruct schema state; we then pick up with plain Flyway or Atlas going forward. This does not make individual operations portable — it keeps the history reconstructible.

Re-evaluate DB migrations when

Xata pivots or folds
A migration we need is outside pgroll’s expand/contract model (e.g., type changes with data loss implications)

Supply chain for dependencies

Tool	Our pick?	Why
GitHub Dependabot (malware mode)	Yes	Free with GitHub. Detects npm malware against GitHub Advisory Database malware feed.
Socket.dev	Yes	Per-dependency security score. PR check fails below threshold.
Package-age policy (14d minimum)	Yes (via CI gate)	Datadog’s pattern. Catches typosquat account-takeovers.
Syft (SBOM) + Grype (CVE scan)	Yes	Apache-2.0, Anchore, widely adopted.
Snyk	Reject for cost	Dependabot + Socket covers it cheaper.

Container and CI

Tool	License	Owner	Our pick?	Why
GitHub Actions	Proprietary	GitHub	Yes	Already in use. OIDC to cosign and Sigstore. Moderate vendor lock-in, acceptable given GitHub is also our SCM.
GHCR	Proprietary	GitHub	Yes	Already in use. Paired with Actions. Lock-in acceptable.
Flux CD	Apache-2.0	CNCF (graduated)	Yes	Already our GitOps. Stable.
Argo CD	Apache-2.0	CNCF (graduated)	No	Alternative to Flux; switching costs exceed benefit for us.

Cluster and runtime

Tool	License	Owner	Our pick?	Why
k3s	Apache-2.0	CNCF (sandbox, maintained by SUSE)	Yes	Lightweight K8s, single-binary install, fits 8 GB VM.
KEDA	Apache-2.0	CNCF (graduated)	Yes	Event-driven autoscaling + scale-to-zero. Already deployed.
GCP Compute (one VM)	Proprietary	Google	Yes	Small bill, predictable, good enough.

Re-evaluate cluster when

We outgrow a single VM (multi-node Kubernetes warranted)
GCP pricing shifts unfavorably
k3s project health deteriorates (currently healthy under SUSE)

Human vault and docs

Tool	License	Owner	Our pick?	Why
Bitwarden	GPL-3.0 (self-hostable) + SaaS	Bitwarden Inc.	Yes	Already in use. Self-host option if SaaS changes unfavorably.
MkDocs Material	MIT (community) + commercial Insiders	Martin Donath	Yes	Our docs-site. Community edition is sufficient.

Evaluation

Tool	License	Owner	Our pick?	Why
Inspect AI	MIT	UK AISI	Candidate — validate in Era 2	Released March 2026. Adopted by METR, Apollo, major labs. OSS, agent-aware, production-shaped. Too new to call chosen; revisit once we have a nightly run with 60 days of data comparing it against raw `pytest`-style harnesses.
SWE-bench Pro	MIT	Scale AI	Yes (benchmark)	Replacement for Verified (contaminated). 1,865 multi-language tasks.
lm-eval-harness	MIT	EleutherAI	No (benchmark-only)	Raw model quality, not agent-scaffolding quality.
OpenAI Evals	MIT	OpenAI	Reject (abandoned)	Historical.
Hypothesis	MPL-2.0	Community	Yes	Property-based testing for Python code agents write.

Lock-in exposure summary

The rig’s total lock-in exposure, honestly:

Vendor	Lock-in level	Criticality	Why
Anthropic (as default LLM provider)	High	Critical	LLM is the engine. LiteLLM + OTel GenAI conventions make runtime and backend swappable (see provider-portability.md); prompts are the sticky layer — migrating to OpenAI or Gemini needs per-prompt re-authoring and a re-run of the eval suite. Concrete, not unbounded.
GitHub	High	Critical	Source, CI, OIDC, artifact registry, Issues — deeply wired.
GCP	Medium	High	One VM — replaceable with any VPS vendor, but DNS/network moves cost ~1 week.
Cloudflare	Medium	Medium	DNS, tunnels, Pages — replaceable, 1-2 days of work.
Sigstore public infra	Low	Medium	Public good service. Private Rekor is the escape if the service model changes.
All CNCF-graduated tools (Flux, k3s, KEDA, Kyverno, Cilium, Flagger, cert-manager)	Very low	High	Portable, active foundations.
LiteLLM	Low	High	MIT + Portkey as fallback.
Langfuse/Phoenix	Low	Medium	OTel GenAI conventions make swap trivial.
SOPS (getsops)	Very low	Medium	MPL-2.0, CNCF governance, active maintainers; SOPS files are portable ciphertext — any decrypter reads them.
pgroll	Medium	Medium	Bus factor 1-vendor. Inspectable SQL trail preserves schema history.

The ones that would hurt to lose: Anthropic (prompt portability hard), GitHub (everything wired there). Every other pick has a concrete escape hatch.

When any pick is re-evaluated

The whitepaper’s picks are living decisions. Trigger a re-evaluation when:

License change on a critical tool (BSL drift is the modern pattern)
Ownership change — acquisitions, foundations handing off, single maintainers disappearing
Material scale change — we grow past 5 operators, add a second cluster, serve customer traffic
Active incident — a pick contributed to an outage and the compensating controls weren’t enough
Cheaper / better alternative emerges with 2+ years of production adoption evidence

Every re-evaluation ends in one of: keep, migrate, or defer. The decision gets a timestamp and a link to this document’s updated version.

What we explicitly reject

Short list of things we have evaluated and ruled out:

Vault now (OpenBao later if needed, not adopt Vault)
SealedSecrets (never deployed; SOPS is the chosen primary with better governance and no extra controller pod)
Full self-hosted LGTM stack on 8 GB (memory-starves)
Argo Rollouts with Flux (drift fights)
Unleash OSS (EOL)
Keptn (CNCF-archived)
Doppler / LaunchDarkly / Kong AI Gateway / TrueFoundry (SaaS-only or enterprise pricing)
Reshape (bus factor 1)
HSM-backed PGP signing (worse than keyless Sigstore)
CSI Secrets Store at single-VM scale (DaemonSet footprint)
OpenAI Evals (abandoned)
microVMs (e2b, Daytona, Firecracker) (wrong threat model)
Dev-E .NET standalone worker (dashecorp/dev-e, archived 2026-04-17 — CommandCodeExecutor shells out to claude-cli without MCP injection, stream-json parsing, or token refresh; the value lives in the CLI driver, not the outer state machine, and Node rig-agent-runtime already implements both)

Tool Choices — An ADR for Every Pick

Tool Choices — An ADR for Every Pick

Related

Capabilities

How to read each entry

Headline changes from original whitepaper

Secrets management

The Vault-vs-SOPS question (directly)

Secrets tooling matrix

Re-evaluate secrets when

Retraction log — secrets picks (three rounds)

The incumbent-bias lesson still stands

The ground-truth-verification lesson

Fresh-start evaluation log (April 2026)

Summary table

Verified deep this round: LiteLLM (stays)

Verified deep this round: pgroll (stays) + hedge narrative corrected

Shallow-audited: what “fresh-start keep” actually means

What the deeper-audited rounds taught us

Categories that warrant re-examination eventually

Meta-rule, reaffirmed

Policy engine

Re-evaluate policy engine when

Supply chain / signing

Re-evaluate signing when

Networking / service mesh

Re-evaluate networking when

Observability — metrics, logs, traces

Re-evaluate metrics/logs when

LLM observability

The Langfuse vs Phoenix call

Re-evaluate LLM observability when

LLM gateway / proxy

The LiteLLM SPoF concern

LiteLLM funding / license risk

Re-evaluate gateway when

Progressive delivery / canary

Re-evaluate canary when

Feature flags

Re-evaluate feature flags when

DB migration safety

The pgroll bus factor hedge

Re-evaluate DB migrations when

Supply chain for dependencies

Container and CI

Cluster and runtime

Re-evaluate cluster when

Human vault and docs

Evaluation

Lock-in exposure summary

When any pick is re-evaluated

What we explicitly reject

See also