Skip to content

Tool Choices — An ADR for Every Pick

  • 🟢 k3s-gcp-vm — k3s on single GCP VM (8 GB). invotek-k3s live.
  • 🟢 fluxcd-gitops — FluxCD GitOps. rig-gitops → cluster reconciliation active.
  • 🟢 keda-autoscaling — KEDA event-driven autoscaling. ScaledObject per agent.

!!! abstract “TL;DR” Every tool named in the whitepaper gets a defensible answer to: what problem, what alternatives, why this, license and backing, pricing, lock-in risk, migration path. The exercise changed several of the original picks after honest re-evaluation — notably: drop Vault (overkill), SOPS + age is the deployed secrets pick (corrected through three rounds — see Secrets section; the rig was always on SOPS, earlier retractions assumed otherwise), add Phoenix alternative to Langfuse (8 GB VM reality), defer feature flags (YAGNI at our scale), hedge pgroll with inspectable SQL trail (single-vendor bus factor, correctly framed).

This document is the reasoning the other whitepaper docs just assert. Every line of the form “We use X” elsewhere has a row here that explains why X and not Y.

Every pick is evaluated against the same rubric:

ColumnWhat it captures
LicenseOSI-approved? Copyleft? Source-available-but-restricted? Specific license string (MIT, Apache-2.0, MPL-2.0, BSL, ELv2, AGPL, GPL, proprietary).
Owner / governanceSingle company? Foundation? BDFL? Community-elected?
PricingFree for our use? Tier structure? Where does the pricing curve bite?
Bus factorIf the primary maintainer disappears, who keeps this alive?
Lock-in riskIf we need to leave, how bad is the migration?
Escape hatchConcrete alternative we’d adopt if we had to move.
Re-evaluate whenThe signal that tells us this pick is no longer right.

The goal is not minimize every axis (impossible) but be explicit about each, so future us — or a future maintainer — can argue with our choices from a base of evidence rather than vibes.

!!! warning “Where the honest re-evaluation changed the pick” - Secrets: drop Vault. SOPS + age + Flux is what’s actually deployed (verified live in apps/*/*.sops.yaml). External Secrets Operator + GCP Secret Manager is deferred until needed. GitHub App installation tokens are minted on-demand. OpenBao is the correct choice if and when we ever need Vault-class dynamic-secret capability — not now. Earlier drafts claimed SealedSecrets was our current state; that was wrong (never deployed). Third-order correction recorded in the retraction log. - LLM observability: add Phoenix. Langfuse v3 wants 16 GB RAM min and a separate ClickHouse cluster. On our 8 GB VM, Phoenix (ELv2, OTel-native, SQLite/Postgres, no ClickHouse) is the honest self-host pick. - Feature flags: defer. flagd + OpenFeature is defensible eventually but for 1-2 humans and few services with no A/B testing need, env-vars-via-Kustomize is sufficient. Adopt a flag system when there’s a concrete targeting / experimentation requirement. - Unleash: explicitly reject. OSS edition deprecated and reached EOL 2025-12-31. Was previously a reasonable alternative; no longer is. - Instrument with OpenTelemetry GenAI semantic conventions, not vendor SDKs. This is the single highest-leverage lock-in defense in the stack.

The most user-called-out section. The original whitepaper promised Vault for short-lived credentials; honest re-evaluation says we don’t need it.

They solve overlapping but distinct problems:

DimensionSOPS + age (or SealedSecrets)Vault / OpenBao
What it encryptsFiles at rest in gitSecret values fetched at runtime
Dynamic secretsNoYes — mint short-lived DB users, cloud creds, GitHub App tokens
Ops footprintZero runtime service3+ node HA cluster, unsealing, upgrades
Reviewable in PRsYes (encrypted blob diffs cleanly)No (secrets never in git)
Revoke on compromiseGit commit + rotate everywhereOne API call, cluster-wide
Audit logGit historyVault audit log
Disaster recoveryGit repo + decryption keyVault snapshot + unseal keys

For a high-traffic production system serving paying customers: Vault (or OpenBao) wins clearly — dynamic secrets + centralized revocation + audit log are irreplaceable.

For a 1-2 person rig on one 8GB VM: SOPS-style encryption + ESO shim + cloud-KMS-backed secret manager is simpler, cheaper, and covers the real threat model.

!!! danger “Retracted (third-order correction), 2026-04-17 — we were never on SealedSecrets” Two previous retractions in this ADR (log below) framed a migration from SealedSecrets to SOPS. That framing was wrong about the ground-truth deployed state. Verified by grep-ing the repo: zero kind: SealedSecret references, zero sealed-secrets-controller HelmRelease, zero bitnami-labs image pulls. Every secret in the rig — Dev-E, Review-E, rig-conductor, Cloudflared — is already SOPS-encrypted (*.sops.yaml files in apps/). SOPS + age + Flux was always the deployed pick; there was never any SealedSecrets to migrate from.

The earlier narratives (*"SealedSecrets keep"*, then *"SealedSecrets legacy migrating"*) were built on an earlier research-agent summary that asserted SealedSecrets was our current deployment. I accepted that without running `grep -r SealedSecret apps/`. I ran that grep today and it returned nothing. The Broadcom-paywall concern is real for anyone using SealedSecrets but was theoretical for us — we avoided the risk by already being on SOPS, not by deliberately migrating off. The "GHCR hedge" I proposed is unnecessary because we don't pull the image at all.
**Meta-lesson added to the fresh-start evaluation log** (below): *verify ground-truth deployed state, not research-agent summaries, before writing retraction narratives.* A 10-second grep would have prevented two rounds of wrong framing.

Current pick (verified live in apps/ as of 2026-04-17):

SOPS + age + Flux kustomize-controller (deployed primary, has been all along)
+ .sops.yaml at repo root with creation_rules covering apps/*/*.sops.yaml
+ Cluster-scoped age key in flux-system/sops-age Secret
+ Per-app encrypted manifests: apps/dev-e/dev-e-secrets.sops.yaml,
apps/review-e/review-e-secrets.sops.yaml,
apps/rig-conductor/rig-conductor-secrets.sops.yaml,
apps/cloudflared/tunnel-token.sops.yaml
+ Each kustomization sets decryption.provider: sops + secretRef.name: sops-age
+ GitHub App installation tokens minted on-demand at pod startup (1h TTL)
+ Static narrow-grant Postgres service accounts
+ External Secrets Operator deferred (not yet needed — git-at-rest scales to our inventory)
+ Vault / OpenBao deferred (no dynamic-secret requirement yet)

See docs/sops.md for the operational reference (how to bootstrap an encrypted secret, rotation procedure, key management).

ToolLicenseOwnerOur pick?Why / why not
HashiCorp VaultBSL 1.1IBM (acq. Feb 2025)NoBSL is tolerable (we’re non-competing) but HCP Vault Secrets EOL July 2026, IBM pricing plays, velocity concerns. Operationally expensive (3+ node HA, unsealing). Dynamic secrets are genuinely excellent — we just don’t need them yet.
OpenBaoMPL-2.0Linux FoundationDeferredThe correct answer if we ever need Vault-class capability. API-compatible with Vault; ESO works unchanged. Same ops burden as Vault. Adopt when we have a concrete unmet need for dynamic secrets.
SOPS + ageMPL-2.0CNCF (getsops org)Yes (deployed primary)The actual deployed pattern. Verified live in apps/*/*.sops.yaml — all four active app namespaces use it. Flux decrypts inline via kustomize-controller --decryption-provider=sops (no additional controller); age keys are simpler than GPG. MPL-2.0 forever, CNCF governance.
SealedSecretsApache-2.0bitnami-labs (Broadcom-owned)Not in useNot deployed. Not used. Not a migration target. Broadcom’s Bitnami catalog paywall (verified real — bitnami/postgresql:17.5.0 returns 404, same namespace as sealed-secrets-controller image) is a risk for shops that use it; we avoided it by default, not by design. Leaving the row here for the ADR audit trail.
External Secrets OperatorApache-2.0CNCF (incubating)Yes (add)The reversibility insurance. Backend-agnostic — swap GCP SM → OpenBao → Infisical by changing a CRD, workloads untouched.
GCP Secret ManagerProprietary (GCP)GoogleYes (add)We’re already on GCP. Free tier covers our inventory. No dynamic secrets but doesn’t need to. Access via ESO = low lock-in.
InfisicalMIT (core) + SaaSInfisical Inc. (YC)No (for now)Strong middle-ground between Bitwarden and Vault. Reasonable alternative if we outgrow GCP SM before we need Vault.
DopplerProprietary SaaSDoppler Inc.RejectSaaS-only; no self-host. Makes Doppler outage = our deploy outage. Strongest lock-in on the list.
1Password ConnectProprietary1PasswordPartialWe already use Bitwarden for human vault. 1Password Connect is fine if we switched, but no reason to.
CSI Secrets StoreApache-2.0KubernetesNoDaemonSet footprint too heavy on single 8 GB VM. Right choice for regulated workloads avoiding etcd.
cert-manager + trust-managerApache-2.0CNCF (graduated)Yes (add)Table stakes. Non-controversial.
  • We add a second K8s cluster or second Postgres instance (static narrow grants stop scaling)
  • We take a compliance requirement that mandates audit log on secret access
  • A secret actually gets compromised (rotation scope pain becomes real)
  • Our team grows past ~5 operators (human secret-handling becomes the bottleneck)
  • The getsops.io project stalls or is archived (then fork or migrate to an alternative; currently healthy)

Retraction log — secrets picks (three rounds)

Section titled “Retraction log — secrets picks (three rounds)”

Honest disclosure of where the first, second, and third drafts of this ADR got it wrong about secrets.

RoundWhat the draft saidWhat changedWhy it was wrong
1”SealedSecrets — Yes (keep) + ESO + GCP SM + GitHub App tokens. Governance risk post-Broadcom/Bitnami is real but no migration pressure yet.”Promoted SealedSecrets to the declared primary; treated SOPS as redundant.Accepted a research-agent summary that claimed SealedSecrets was our deployed state. Never grep-verified. Built a whole defense around an incorrect premise.
2”SOPS + age is now primary. SealedSecrets is Legacy (migrating). Interim hedge: switch image source to ghcr.io/bitnami-labs/sealed-secrets-controller.”SOPS promoted, SealedSecrets relabeled legacy-migrating, elaborate Broadcom-paywall hedge proposed.Still wrong about ground truth. Corrected the right-and-wrong framing of the tools, but kept asserting SealedSecrets was our deployment. The Broadcom paywall research was real but the “migration path” was literature for a migration that didn’t need to happen.
3 (this entry)“SOPS + age + Flux was always the deployed pick. There was never any SealedSecrets. Earlier retractions were based on an unverified premise.”Corrected: zero SealedSecrets in the repo (grep -r SealedSecret apps/ returns nothing). .sops.yaml at repo root covers all apps. Every deployed app uses *.sops.yaml with decryption.provider: sops.The meta-lesson: verify ground-truth deployed state, not research-agent summaries, before writing retractions. A 10-second grep -r would have prevented two rounds of wrong framing. The second-order lesson — also named in my Fresh-start log’s “three patterns” — was “hedge narratives need re-verification.” This round adds: “assertions about current deployed state need re-verification too,” which is the stronger version of the same principle.

Round 1’s incumbent-bias problem (documented in the Round 1 retraction above) was a real pattern: I skipped the fresh-start test on secrets while applying it elsewhere. Round 2’s correction — “SOPS wins on governance, operational cost, license permanence” — was the right conclusion reached via the right reasoning. What was wrong in Round 2 was the framing (claimed migration from incumbent) not the verdict (SOPS over SealedSecrets). Round 3 leaves the verdict intact and fixes only the factually-inaccurate framing.

Add to the fresh-start evaluation meta-rules (below): before asserting what’s deployed, run grep against the repo. Research-agent summaries can be wrong or stale. I had the tools to verify in the first round and didn’t use them. Don’t skip verification of ground truth; it’s cheaper than two rounds of retraction.

Honest application of the fresh-start test — “if we were picking this from scratch today with no prior context, what would we pick?” — to every tool category in this ADR. This log replaces the earlier “Broader incumbent-bias check” which was a checklist, not an evaluation. Two picks were verified in depth this round (LiteLLM and pgroll); the rest were audited against 2026-current alternatives.

CategoryCurrent pickFresh-start (2026) verdictRigor of this round
Secrets (git-at-rest)SOPS + age (deployed; always was)Corrected — SOPS is the deployed pick; earlier retractions #1 and #2 framed a migration that didn’t need to happen because SealedSecrets was never deployedDeep (three rounds of retraction; see log)
Policy engineKyvernoKeep — unchangedShallow
Supply-chain signingSigstore (cosign, gitsign, rekor, slsa-github-generator)Keep — no credible OSS competitorShallow
Networking / L7 egressCiliumKeep — no equivalent at L7 in OSSShallow
Metrics / logs / tracesPrometheus (local) + Grafana Cloud Free (managed) + OTel CollectorKeep — name VictoriaMetrics as lighter-weight alternative for multi-node futureMedium
LLM observabilityPhoenix (8 GB) / Langfuse (16 GB+)Keep — already retracted in earlier PRMedium
LLM gatewayLiteLLMKeep — verified. Portkey’s “fully OSS” March 2026 announcement kept per-key budget enforcement Enterprise-only; original pick rationale holdsDeep (verified this PR)
Progressive deliveryFlaggerKeep — Flux-native, Argo Rollouts fights FluxShallow
Feature flagsDeferred (flagd when needed)Keep — no change; PostHog named as bundled alternativeShallow
DB migration safetypgrollKeep — verified. Atlas has closed most gaps but does not implement expand/contract multi-version schemas — pgroll’s core differentiator standsDeep (verified this PR) + hedge-narrative fixed
Supply chain (deps)Dependabot + Socket.dev + Syft+Grype + package-age policyKeep — name trivy as Grype alternativeShallow
Container + CIGitHub Actions + GHCRKeep — incumbent-and-defensible (SCM + CI + registry bundle dominates)Shallow
GitOpsFluxKeep — incumbent-and-defensible (Flagger picks Flux-native; switching cascades)Shallow
Cluster runtimek3sKeep — name Talos Linux as multi-node future considerationMedium
Event-driven autoscaleKEDAKeep — no credible competitorShallow
Cloud computeGCP ComputeKeep — incumbent-and-defensible (Workload Identity + DNS already wired)Shallow
Human vaultBitwardenKeep — name Vaultwarden (Rust self-host port) for futureShallow
Docs siteMkDocs MaterialKeep — Docusaurus/VitePress/Astro Starlight are reasonable if we want more customizationShallow
Evaluation harnessInspect AI (candidate)Already flagged candidate — validate in Era 2Already done

Portkey Gateway went fully open source March 2026 (Apache-2.0, 1T+ tokens/day). Original LiteLLM pick reason was “only OSS option with per-virtual-key budget enforcement.” Re-verified against Portkey’s 2026 documentation:

The 2026 “fully OSS” announcement was a governance + observability + MCP-registry open-sourcing, not a cost-controls open-sourcing. The original LiteLLM differentiator (per-virtual-key budget envelopes with duration windows returning 429 on exceed, free) still holds.

LiteLLM’s known bugs (#12905, #10750, #12977, #25386) don’t touch our specific config pattern (we have ~5 explicitly-configured keys, no team-scoped nesting, no pass-through routes, no AzureOpenAI direct client, no auto-created end users). Verdict: stay on LiteLLM. Revisit if (a) Portkey moves budget-limits to OSS, or (b) we scale past ~500 RPS where LiteLLM’s documented memory issues at 2k RPS start to bite.

Verified deep this round: pgroll (stays) + hedge narrative corrected

Section titled “Verified deep this round: pgroll (stays) + hedge narrative corrected”

Atlas (Ariga) has shipped rapidly in 2025–2026 — v1.2.0 on 2026-04-10, Kubernetes operator (Apache-2.0 with some EULA image layers), 50+ migration safety analyzers, weekly-to-biweekly release cadence. Feature gap against pgroll narrowed significantly. But Atlas does NOT implement real expand/contract with multi-version schema views + triggered backfill — it lints for unsafe DDL, emits concurrent-index DDL, and rolls out carefully, but it executes a single migration against a single schema.

For our specific workload (one Postgres, ~10–30 tables, expand/contract required for zero-downtime), pgroll is still the only tool that keeps v1 and v2 of a table simultaneously queryable. Verdict: stay on pgroll. Revisit if Xata misses another release quarter (no v0.17 by end of Q3 2026), announces a shutdown/acquisition, or Atlas ships native expand/contract.

Corrected the hedge narrative: earlier drafts implied pgroll migrations are plain SQL. They’re not — they’re pgroll-specific operation YAML. The correct hedge is to commit generated SQL alongside each operation YAML (via pgroll SQL emission) so schema history stays reconstructible. See the pgroll section above for the corrected wording.

Shallow-audited: what “fresh-start keep” actually means

Section titled “Shallow-audited: what “fresh-start keep” actually means”

For the shallow-audited picks (Kyverno, Sigstore, Cilium, Flagger, k3s, KEDA, Dependabot/Socket, GitHub/GHCR/Flux, Bitwarden, MkDocs), “fresh-start keep” means: I considered the current 2026 alternatives to each and none clearly beat the incumbent for our scale on license, governance, operational cost, and feature coverage. They are the picks I would make today if starting from scratch.

A stronger level of rigor would be individual per-category research agents (like I did for LiteLLM and pgroll). That’s worth doing when a specific concern surfaces (as with Portkey-announcement, Xata-release-cadence, Broadcom-Bitnami). Applying it to every pick every month is over-engineering.

Four patterns emerged from the SOPS (three rounds), LiteLLM, and pgroll deep audits:

  1. Announcements lie about feature scope. Portkey’s March 2026 “fully OSS” announcement was marketing; the feature we care about stayed paywalled. Always verify against current docs, not the press release.
  2. Release cadence is a signal. pgroll’s decelerating releases (v0.16.1 in February, nothing since) is consistent with Xata being in maintenance-mode for pgroll as an internal-product-first tool. Not alarming on its own, but worth tracking.
  3. Hedge narratives need re-verification. The “keep migrations as plain SQL” hedge in the earlier pgroll writeup turned out to be wrong — pgroll operation files are YAML. When we write a hedge, we should confirm it’s actually realisable, not just aspirational.
  4. Ground-truth deployed state needs re-verification too. The SealedSecrets retraction had to happen three times because the first two rounds accepted a research-agent summary about “what’s currently deployed” instead of grep-verifying the repo. A 10-second grep (grep -r SealedSecret apps/) would have prevented it. Stronger version of pattern (3): “before asserting what’s deployed, verify.”

Categories that warrant re-examination eventually

Section titled “Categories that warrant re-examination eventually”

Not actionable today, flagged for future attention:

  • Bitwarden — picked because humans already use it. 1Password has better team-grant ergonomics; Vaultwarden is an unofficial Rust self-host port if we want more control; Infisical covers human+automation in one product (at the cost of a YC-company dependency). Re-evaluate if team grows past 3 operators or if we start needing per-project secret segregation.
  • MkDocs Material — Python-docs gold standard today, but Docusaurus (Meta-backed, React), VitePress (Vue/Vite), and Astro Starlight (Astro) are reasonable alternatives with better customization. Low priority — the docs site works.
  • k3s — ideal for single-VM. If we go multi-node for any reason, Talos Linux (immutable, API-only, no SSH, no shell) is a stronger security baseline. Not a k3s replacement — runs K8s, including k3s — but changes the host OS story.

When an ADR row reads “already deployed — keep” without a license/governance/operational comparison against the best fresh-start alternative, that’s a flag for re-examination. Path dependence is a cost, not a reason. Every pick in this ADR has now had the fresh-start test applied at least shallowly; two picks got deep verification this round; the retraction log above grows whenever a pick turns out to have been defended on sunk-cost reasoning.

Next scheduled re-audit: monthly for deep-picks (LiteLLM, pgroll, SOPS health at getsops.io, Langfuse/Phoenix VM sizing). Quarterly for shallow-picks. Immediate whenever a tool’s governance / license / owner changes (Broadcom-Bitnami style events). Always verify ground-truth deployed state with a grep before framing a retraction.

ToolLicenseOwnerOur pick?Why
KyvernoApache-2.0CNCF (incubating)YesYAML CRDs (no Rego language). Native Sigstore verification first-class. Reports in OpenReports format. Operational cost is lower for a 2-person team.
OPA GatekeeperApache-2.0CNCF (graduated)NoRego language requires learning + maintenance. Cosign verification exists but not as polished. Better for large orgs that already run OPA for non-K8s policy.
jsPolicyApache-2.0Loft LabsNoSingle-vendor. JavaScript-based policies. Niche.
OpenPolicyAgent (OPA) coreApache-2.0CNCFNoLower-level; Gatekeeper is the K8s-admission wrapper.
  • We need to write non-K8s policies (API gateway, CI/CD gates) — OPA’s broader reach becomes attractive
  • We have a Rego-fluent engineer — learning cost drops
  • Kyverno governance shifts unfavorably (currently healthy as CNCF incubating)

All picks are Sigstore ecosystem; there’s no real competitor in 2026 open-source territory.

ToolLicenseOwnerOur pick?Why
CosignApache-2.0Sigstore (Linux Foundation)YesIndustry default. Keyless via Fulcio.
GitsignApache-2.0SigstoreYesAgent commit signing with ephemeral Fulcio certs. Known gotcha: GitHub UI doesn’t display “Verified” — workaround is a CI-side gitsign verify check.
RekorApache-2.0Sigstore (public instance)Yes (public)We use the public Rekor. Private Rekor is possible but over-engineered.
slsa-github-generatorApache-2.0SLSA frameworkYesIsolated-builder reusable workflow produces SLSA v1.0 L3 provenance. ~5 lines in any GitHub Actions file.
Notary v2 / ORASApache-2.0CNCFNoOCI-artifact-focused; Sigstore covers our image case more simply.
Docker Content TrustProprietary-ishDockerNoDeprecated in favor of Sigstore.
HSM-backed PGPvariesRejectLong-lived keys to rotate. Worse threat model than Sigstore for our case.
  • Sigstore public goods service (Fulcio / Rekor free tier) changes its commitment
  • Compliance requires private transparency log (run private Rekor)
  • Multi-tenant signing needs emerge (HSM delegation tooling)
ToolLicenseOwnerOur pick?Why
Cilium (L7 via CNPs)Apache-2.0CNCF (graduated)YeseBPF CNI with L7 HTTP/DNS filtering. The single biggest ROI defense against prompt-injection exfiltration. Gets 80% of service-mesh value at 10% of the cost.
IstioApache-2.0CNCF (graduated)NoFull service mesh — mTLS, traffic mgmt, etc. Overkill for our traffic volume + single-cluster setup. Revisit if we go multi-cluster or need mTLS to external services.
LinkerdApache-2.0CNCF (graduated)NoSimpler than Istio; still more than we need. Good alternative if Istio’s complexity is the only objection.
Calico (OSS)Apache-2.0TigeraNoSolid CNI but L7 filtering requires Calico Enterprise (commercial). Cilium’s OSS L7 wins.
Native NetworkPolicy onlyKubernetesNoL3/L4 only. Cannot filter by FQDN or HTTP method. Insufficient for our egress-allowlist goal.
  • Multi-cluster deployment emerges — a service mesh becomes more compelling
  • Cilium L7 Envoy proxy memory overhead becomes the bottleneck on our VM
  • External mTLS requirement (e.g., to a customer-facing API) — then Istio or Linkerd

Split picks: local for SLO-decisive data, managed for everything else.

ToolLicenseOwnerOur pick?Why
Prometheus (local)Apache-2.0CNCF (graduated)YesSource of truth for Flagger canary analysis. Must be local so SLO gates work when external egress blips. ~1 GB RAM.
Grafana Cloud FreeProprietary SaaSGrafana LabsYes10k series metrics, 50 GB logs, 50 GB traces, 14-day retention. Fits a 1-2 person rig. Predictable paid scale.
Mimir / Thanos / VictoriaMetricsApache-2.0Grafana Labs / OtherNoLarge-scale Prometheus backends. Overkill — Grafana Cloud Free covers us.
Datadog / New RelicProprietaryRejectVendor lock + pricing curves bite at scale.
Self-hosted LGTM stackApache-2.0Grafana LabsNoWould memory-starve our 8 GB VM. Hybrid with Grafana Cloud is the right answer.
OpenTelemetry CollectorApache-2.0CNCF (incubating)YesOne exporter that forwards to both Prometheus (local) and Grafana Cloud (managed). Standard plumbing.
  • Grafana Cloud Free limits bite — 10k series isn’t enough
  • Cost-visible scaling past ~$50/mo on Grafana Cloud makes self-hosted LGTM attractive on a bigger VM
  • Regulatory requirement forces log residency — self-host becomes mandatory
ToolLicenseOwnerPricingOur pick?Why
Langfuse (self-host)MIT core + EE license-key for a few featuresLangfuse GmbH (YC)FreeConditionalOfficial min 4 CPU / 16 GB RAM for app alone, plus ClickHouse cluster. Too heavy for 8 GB VM. Pick only if we size up.
Arize Phoenix (self-host)ELv2 (source-available, non-OSI)Arize AIFreeYes (for our scale)OTel-native. SQLite or Postgres, no ClickHouse. Runs fine on our VM. ELv2 is non-concern for internal self-host (restricts SaaS resale, which we don’t do).
Langfuse Cloud HobbyN/A (SaaS)Langfuse GmbHFree 50k units/moBackup50k billable units sounds like a lot but complex agent traces = 15–20 units each. Hits cap fast; hard-stop at cap.
Helicone (self-host)Apache-2.0Helicone Inc.Free self-host, 10k req/mo SaaS freeAlternativeGateway + observability combined. Reasonable plan B if Phoenix unsuitable.
LangSmithProprietary SaaSLangChainPaid tiersRejectSaaS-only, paid-gated features, LangChain-native assumptions.
BraintrustProprietary SaaSBraintrustFree → $249/moReject for costStrong on prompt regression but pricing bites for our scale.
Arize AI enterpriseProprietaryArizeContact salesRejectPhoenix OSS covers us.
W&B TracesProprietary SaaSCoreWeavePaidRejectBroader ML observability overkill.
Cloudflare AI GatewayProprietary SaaSCloudflareFree passive analyticsYes (secondary)Free since we’re already on Cloudflare. Passive cost tracking. Not sufficient primary — no virtual-key budgets.

Langfuse is the de facto OSS LLM observability tool in 2026 and its feature set (prompt versioning, evaluations, dataset management, team workflows) is the richest. But its resource floor is genuine: v3 officially requires 4 CPU / 16 GB for the app, plus a ClickHouse cluster (3 nodes × 2 cores × 8 GB recommended). On our 8 GB single-VM k3s, it will boot and demo but will not survive sustained load.

Phoenix is the honest pick at our scale. Lighter infra, OTel-native (so traces are portable), solid eval story. We lose Langfuse’s prompt-management UI and team features — acceptable for a 1-2 person team.

The lock-in defense that trumps both: instrument our code with OpenTelemetry GenAI semantic conventions, not Langfuse/Phoenix SDKs. Both platforms accept OTLP. The choice of observability backend becomes swappable.

  • We scale the VM to 16 GB+ (Langfuse becomes viable)
  • Team grows and prompt-versioning / collaborative eval becomes load-bearing
  • Phoenix ELv2 changes (unlikely but license watch)
ToolLicenseOwnerPricingOur pick?Why
LiteLLMMIT (core) + commercial EnterpriseBerriAI (YC W23)FreeYesOnly OSS option with proper per-virtual-key budget enforcement + duration resets. OpenAI-format passthrough. Anthropic provider first-class.
Portkey GatewayApache-2.0 (fully OSS since March 2026)PortkeyFree self-host; $9 per 100k logs managedDocumented fallbackFully OSS escape hatch if LiteLLM stumbles. Processing 1T+ tokens/day across users.
Cloudflare AI GatewayProprietary SaaSCloudflareFreeSecondaryPassive observability already in our stack. No virtual-key budgets — not sufficient as primary.
OpenRouterProprietary SaaSOpenRouter5.5% markupNoAdds hop, no self-host, no per-key budgets like LiteLLM.
Kong AI GatewayProprietary (Enterprise plugin)KongEnterprise contractRejectEnterprise pricing not justified.
TrueFoundryProprietary SaaSTrueFoundryPaidRejectPlatform-level opinion.

LiteLLM is a single point of failure: if it’s down, all agents block.

Mitigations:

  1. Run ≥2 replicas behind a k3s Service with Postgres + Redis shared state.
  2. Client-side fallback to direct api.anthropic.com after N seconds of 5xx from proxy — but this bypasses budget enforcement by design; document as acceptable degraded mode.
  3. Monitor proxy health as a first-class SLO in Prometheus.
  • License: MIT intact. No BSL drift as of Q2 2026.
  • Owner: BerriAI, YC W23, publicly reported ~$2.1M seed. No Series A disclosed. Venture-stage risk is real.
  • 12–24 month watch: (a) BSL-style moves on enterprise features only would be fine for us (we’re on OSS), (b) aggressive monetization could lock some OSS features behind keys, (c) a quiet under-maintenance period is the likeliest failure mode.
  • LiteLLM license changes or project health deteriorates
  • Our traffic exceeds what LiteLLM’s Postgres-backed rate limiter can handle (~10k RPS)
  • Portkey Gateway momentum surpasses LiteLLM’s
ToolLicenseOwnerOur pick?Why
FlaggerApache-2.0CNCF (graduated via Flux)YesFluxCD-native. Owns its own Canary CRD that shadows the Deployment — no field-level fights with Flux. Webhooks at every phase for rig-conductorintegration. ~100 MB controller footprint.
Argo RolloutsApache-2.0CNCF (graduated via Argo)NoMutates fields Flux also reconciles — recurring drift fights. Pair with ArgoCD, not Flux.
KeptnCNCF archived 2025-09-03Reject (dead)Dynatrace team pulled back. Do not adopt.
OpenKruise RolloutApache-2.0OpenKruise (CNCF sandbox)NoMostly Alibaba ecosystem. Right only if we need StatefulSet canary.
  • We migrate from Flux to ArgoCD (Argo Rollouts becomes natural)
  • Flagger project health deteriorates (currently active)

!!! danger “Honest YAGNI” Feature flags at our scale (1-2 humans, few services, no A/B testing need) are overkill today. Env vars + Kustomize overlays per environment cover the actual use case — compile/deploy-time toggles — at zero operational cost. We should adopt a flag system when there’s a concrete targeting, experimentation, or kill-switch need — not before.

ToolLicenseOwnerOur pick?Why
env vars + Kustomize overlaysYes (now)Zero ops cost. Covers 100% of actual current need.
OpenFeature + flagdApache-2.0CNCF (incubating)DeferredRight pick when we need runtime toggles. Sidecar ~30-60 MB. JSON flag config.
FliptGPL-3.0 (server) + MIT (clients)flipt-ioAlternativeGitOps-native YAML flags. Single Go binary. GPL server is sticky but fine for internal use.
GrowthBookMITGrowthBookAlternativeIf we need A/B experimentation with stats out of the box. OpenFeature SDK.
PostHog feature flagsMIT (self-host) + SaaSPostHogConsider if we adopt PostHogBundled with analytics. Zero marginal cost if already using PostHog.
UnleashApache-2.0 core (EOL 2025-12-31)UnleashReject (dying OSS)Enterprise-only going forward. Avoid for new adoption.
LaunchDarklyProprietary SaaSLaunchDarklyReject$12/seat/mo + MAU overages. Overkill by an order of magnitude.
StatsigProprietary SaaSStatsigReject for lock-inGenerous free tier (1M MTUs) but SaaS-only.
ConfigCatProprietary SaaSConfigCatAlternativeFree tier forever, simple, Hungarian SaaS. If we want SaaS and not LaunchDarkly.
  • We need per-user or per-tenant targeting that env vars can’t express
  • A/B experimentation with real statistical significance becomes a product need
  • We adopt PostHog for analytics (flags come bundled)
  • Any T1 incident where a kill switch faster than kubectl rollout undo would have saved us
ToolLicenseOwnerOur pick?Why
pgrollApache-2.0XataYesAutomates expand/contract safely for Postgres — the only tool in this category that keeps v1 and v2 of a schema simultaneously queryable via views, with triggered backfill. Atlas does not implement this; it lints for unsafe DDL and rolls out carefully but executes a single migration against a single schema. Moderate single-vendor bus factor (Xata is ~27 employees, still operating; pivoted mid-2025 to serverless Postgres with Simplyblock). Release cadence decelerated: v0.16.1 last released 2026-02-17. Verified April 2026.
Atlas (Community Edition)Apache-2.0 + EULA on official binariesArigaAlternative / hedgeDeclarative schema-as-code + linting. Source Apache-2.0; official binaries under Atlas EULA. Build from source if EULA matters.
Flyway CommunityApache-2.0 core (Redgate-owned)RedgateAlternativeClassic versioned SQL migrations. Not zero-downtime-automated. License creep concern (Redgate moving features out of OSS).
gh-ostMITGitHubIrrelevantMySQL only. We’re on Postgres.
ReshapeApache-2.0fabianlindforsReject (bus factor 1)Single-author, author’s focus shifted. Don’t adopt for production.
BytebaseApache-2.0 (5-source limit)BytebaseNoUI-heavy workflow tool. Overkill for 1-2 person rig.

!!! warning “Corrected: pgroll files are YAML, not SQL” An earlier draft claimed we could “keep migrations inspectable SQL… runnable by plain psql.” That’s wrong — pgroll migration files are pgroll-specific operation YAML (e.g., add_column, drop_column, set_not_null), not raw SQL. The actual hedge: keep a parallel SQL trail. For every pgroll operation that runs, commit the generated SQL (pgroll migrate --dry-run --json | pgroll generate-sql) alongside the operation YAML. If Xata folds, the SQL trail lets us reconstruct schema state; we then pick up with plain Flyway or Atlas going forward. This does not make individual operations portable — it keeps the history reconstructible.

  • Xata pivots or folds
  • A migration we need is outside pgroll’s expand/contract model (e.g., type changes with data loss implications)
ToolOur pick?Why
GitHub Dependabot (malware mode)YesFree with GitHub. Detects npm malware against GitHub Advisory Database malware feed.
Socket.devYesPer-dependency security score. PR check fails below threshold.
Package-age policy (14d minimum)Yes (via CI gate)Datadog’s pattern. Catches typosquat account-takeovers.
Syft (SBOM) + Grype (CVE scan)YesApache-2.0, Anchore, widely adopted.
SnykReject for costDependabot + Socket covers it cheaper.
ToolLicenseOwnerOur pick?Why
GitHub ActionsProprietaryGitHubYesAlready in use. OIDC to cosign and Sigstore. Moderate vendor lock-in, acceptable given GitHub is also our SCM.
GHCRProprietaryGitHubYesAlready in use. Paired with Actions. Lock-in acceptable.
Flux CDApache-2.0CNCF (graduated)YesAlready our GitOps. Stable.
Argo CDApache-2.0CNCF (graduated)NoAlternative to Flux; switching costs exceed benefit for us.
ToolLicenseOwnerOur pick?Why
k3sApache-2.0CNCF (sandbox, maintained by SUSE)YesLightweight K8s, single-binary install, fits 8 GB VM.
KEDAApache-2.0CNCF (graduated)YesEvent-driven autoscaling + scale-to-zero. Already deployed.
GCP Compute (one VM)ProprietaryGoogleYesSmall bill, predictable, good enough.
  • We outgrow a single VM (multi-node Kubernetes warranted)
  • GCP pricing shifts unfavorably
  • k3s project health deteriorates (currently healthy under SUSE)
ToolLicenseOwnerOur pick?Why
BitwardenGPL-3.0 (self-hostable) + SaaSBitwarden Inc.YesAlready in use. Self-host option if SaaS changes unfavorably.
MkDocs MaterialMIT (community) + commercial InsidersMartin DonathYesOur docs-site. Community edition is sufficient.
ToolLicenseOwnerOur pick?Why
Inspect AIMITUK AISICandidate — validate in Era 2Released March 2026. Adopted by METR, Apollo, major labs. OSS, agent-aware, production-shaped. Too new to call chosen; revisit once we have a nightly run with 60 days of data comparing it against raw pytest-style harnesses.
SWE-bench ProMITScale AIYes (benchmark)Replacement for Verified (contaminated). 1,865 multi-language tasks.
lm-eval-harnessMITEleutherAINo (benchmark-only)Raw model quality, not agent-scaffolding quality.
OpenAI EvalsMITOpenAIReject (abandoned)Historical.
HypothesisMPL-2.0CommunityYesProperty-based testing for Python code agents write.

The rig’s total lock-in exposure, honestly:

VendorLock-in levelCriticalityWhy
Anthropic (as default LLM provider)HighCriticalLLM is the engine. LiteLLM + OTel GenAI conventions make runtime and backend swappable (see provider-portability.md); prompts are the sticky layer — migrating to OpenAI or Gemini needs per-prompt re-authoring and a re-run of the eval suite. Concrete, not unbounded.
GitHubHighCriticalSource, CI, OIDC, artifact registry, Issues — deeply wired.
GCPMediumHighOne VM — replaceable with any VPS vendor, but DNS/network moves cost ~1 week.
CloudflareMediumMediumDNS, tunnels, Pages — replaceable, 1-2 days of work.
Sigstore public infraLowMediumPublic good service. Private Rekor is the escape if the service model changes.
All CNCF-graduated tools (Flux, k3s, KEDA, Kyverno, Cilium, Flagger, cert-manager)Very lowHighPortable, active foundations.
LiteLLMLowHighMIT + Portkey as fallback.
Langfuse/PhoenixLowMediumOTel GenAI conventions make swap trivial.
SOPS (getsops)Very lowMediumMPL-2.0, CNCF governance, active maintainers; SOPS files are portable ciphertext — any decrypter reads them.
pgrollMediumMediumBus factor 1-vendor. Inspectable SQL trail preserves schema history.

The ones that would hurt to lose: Anthropic (prompt portability hard), GitHub (everything wired there). Every other pick has a concrete escape hatch.

The whitepaper’s picks are living decisions. Trigger a re-evaluation when:

  • License change on a critical tool (BSL drift is the modern pattern)
  • Ownership change — acquisitions, foundations handing off, single maintainers disappearing
  • Material scale change — we grow past 5 operators, add a second cluster, serve customer traffic
  • Active incident — a pick contributed to an outage and the compensating controls weren’t enough
  • Cheaper / better alternative emerges with 2+ years of production adoption evidence

Every re-evaluation ends in one of: keep, migrate, or defer. The decision gets a timestamp and a link to this document’s updated version.

Short list of things we have evaluated and ruled out:

  • Vault now (OpenBao later if needed, not adopt Vault)
  • SealedSecrets (never deployed; SOPS is the chosen primary with better governance and no extra controller pod)
  • Full self-hosted LGTM stack on 8 GB (memory-starves)
  • Argo Rollouts with Flux (drift fights)
  • Unleash OSS (EOL)
  • Keptn (CNCF-archived)
  • Doppler / LaunchDarkly / Kong AI Gateway / TrueFoundry (SaaS-only or enterprise pricing)
  • Reshape (bus factor 1)
  • HSM-backed PGP signing (worse than keyless Sigstore)
  • CSI Secrets Store at single-VM scale (DaemonSet footprint)
  • OpenAI Evals (abandoned)
  • microVMs (e2b, Daytona, Firecracker) (wrong threat model)
  • Dev-E .NET standalone worker (dashecorp/dev-e, archived 2026-04-17 — CommandCodeExecutor shells out to claude-cli without MCP injection, stream-json parsing, or token refresh; the value lives in the CLI driver, not the outer state machine, and Node rig-agent-runtime already implements both)