OTel-native LLM observability — free and low-cost options for the rig

User stories spawned from this document

DoneUser story: agent observability — one env var + a trace store dashecorp/rig-docs#58

Superseded 2026-04-21. The structural evaluation of candidates below (footprints, LLM UX, lock-in) still stands. The pricing and recommendation sections are superseded by research/2026-04-21-otel-startup-programs-storage-economics, which factors in startup credit programs (Grafana Cloud for Startups = $100k/12mo; Langfuse 50% off first year) and storage-at-scale economics this doc missed. The revised verdict moves the production backend from Phoenix-now/Langfuse-later to Grafana Cloud (credit) + Langfuse (discount) with self-hosted OpenObserve as the explicit fallback. Phoenix remains for the dev inner loop.

TL;DR. The whats-next whitepaper treats Langfuse self-hosted and Phoenix as alternatives chosen on VM size. Real 2026 footprint numbers flip that: Langfuse self-hosted is 1.5–2 GB (six services), Phoenix is 300–500 MB (one container). For the current 8 GB VM, Phoenix is the default; Langfuse is a later migration when multi-tenant/RBAC becomes load-bearing. The other nine tools evaluated are either too heavy (SigNoz, HyperDX), wrong shape (Helicone is proxy-first), at momentum risk (Uptrace), or lack LLM-specific UI (OpenObserve, Jaeger/Tempo, Axiom, Honeycomb).

Why this research exists

The rig emits OpenTelemetry GenAI spans when CLAUDE_CODE_ENABLE_TELEMETRY=1 is set. We need a trace store that:

Ingests OTLP natively — no vendor SDK lock-in, portable across Claude Code, Codex CLI, Gemini CLI
Has LLM-specific UI — token counts, cost per model, prompt/response diff, session trees — not just generic trace waterfalls
Fits alongside rig-conductor, Postgres, Valkey, and the agents on a single 8 GB k3s VM today
Has an honest migration path when we outgrow the 8 GB box

Cost-sensitivity is explicit. The question isn’t which premium tool; it’s which free or near-free OSS tool doesn’t paint us into a corner.

The candidates (in one paragraph each)

Langfuse — MIT core; commercial ee/ modules hold SSO, RBAC, multi-tenant. Self-hosted stack: web + worker + Postgres + ClickHouse + Redis + MinIO — realistic 1.5–2 GB RAM, not the 1 GB the whitepaper currently claims. OTel/OTLP GA since v3. Best-in-class LLM UI: sessions, token/cost per model, evaluations, prompts, datasets. Cloud Hobby free: 50k observations/month, 30-day retention. ~16k stars, YC W23, adopted by Khan Academy, Twilio, SumUp. Depth is the bet; footprint is the cost.

Arize Phoenix — Elastic-2.0 (source-available, self-host permitted). One Python container, ~300–500 MB RAM, SQLite or Postgres backend. Native OTLP + OpenInference (their extension of OTel GenAI). Strong LLM UI: traces, evals, prompt playground, datasets. No built-in auth or multi-tenant in OSS — Arize sells that. ~6k stars, very active. “Drop in and go” on an 8 GB VM. Auth gap is resolvable by putting it behind Tailscale or an auth-proxy sidecar.

OpenObserve — AGPL-3. Single Rust binary, ~200 MB RAM idle, local disk or S3. OTel-native (traces + logs + metrics). No LLM-specific UI — generic trace waterfalls and SQL over span attributes. Cloud free: 200 GB ingest/month. Fast release cadence. Excellent infrastructure backbone; weak for prompt/token inspection.

SigNoz — MIT. ClickHouse + Zookeeper + query-service + frontend + OTel collector — ~2–3 GB RAM, tight on an 8 GB VM sharing with everything else. OTel-native end-to-end. Added LLM/GenAI dashboards in 2025. Cloud: free trial only, paid from ~$199/mo. ~25k stars. Right answer when the rig has its own observability node.

Uptrace — BSL-1.1 (converts to Apache-2 after 3 years). Go binary + ClickHouse + Postgres, ~1 GB. OTel-native, decent trace UI, modest AI dashboards. ~2.5k stars, commit cadence slowed in 2025 — momentum risk. Pick only with a migration contingency.

HyperDX — MIT. ClickHouse + Mongo + app, ~1.5–2 GB. OTel-native, strong trace/log UI, session replay. Acquired by ClickHouse Inc. mid-2024, v2 effectively folded into ClickHouse Observability. No LLM-specific UI. Roadmap now ClickHouse-driven — vendor trajectory unpredictable for the standalone OSS version.

Grafana Cloud Free + Tempo — 50 GB traces/mo, 14-day retention, 3 users. Tempo is OTel-native but has no LLM UI — you build views in Grafana with TraceQL/attributes. Self-hosted Tempo is ~500 MB but needs object storage. Good if you already run Grafana; weak for prompt inspection.

Axiom — SaaS only. Free Personal tier: 0.5 TB ingest/month, 30-day retention, 3 users. Native OTLP. No LLM UI — fast log/event explorer (APL). Generous free tier, zero LLM affordance out of the box. Pair with a thin instrumentation layer if you go this route.

Honeycomb — SaaS. Free: 20M events/month, 60-day retention, 5 users. OTLP-native, excellent BubbleUp/trace UI. No LLM-specific UI; GenAI semconv attributes render as normal span fields.

Helicone — Apache-2. Proxy-first (Postgres + ClickHouse + proxy + web), OTel support experimental. Cloud free: 10k requests/month. Good LLM UI (cost, caching, sessions). Not an OTel-ingest backend; it’s a proxy. Pick only if agent traffic is routed through Helicone, which re-frames the stack.

OpenLLMetry / Traceloop SDK — Apache-2 instrumentation library, not a backend. Emits OTel GenAI spans → ship anywhere. Traceloop SaaS free: 50k spans/month. Useful as the emission side for Codex/Gemini CLI if native GenAI semconv isn’t there yet.

Jaeger / Tempo baseline — CNCF, free, OTel-native, no LLM UI. Baseline only; include as comparison anchor.

Picking by need

Need	Pick	Why
Cheapest free-tier SaaS, OTel-native, LLM UI	Langfuse Cloud Hobby	Only free SaaS with proper token/cost/prompt UI over OTLP. 50k observations/month covers a small rig.
Best self-hosted on one 8 GB VM (today)	Arize Phoenix	300–500 MB, single container, OTLP-native, strong LLM UX. Keep OTel Collector routing infra traces separately.
Best if we scale past 8 GB (later)	Langfuse self-hosted (ClickHouse externalised) or SigNoz	Langfuse wins on LLM depth; SigNoz wins on unified APM.

Recommendation

Deploy Phoenix to the 8 GB VM now as the agent trace store. Single container, expose an OTLP endpoint, set CLAUDE_CODE_ENABLE_TELEMETRY=1 on agent pods, done. Put Phoenix behind Tailscale or a tiny auth proxy (oauth2-proxy) since OSS has no auth — acceptable for internal-only use.

Keep the existing OTel Collector as the single OTLP ingress, fan-out:

Agent LLM spans (gen_ai.* attributes) → Phoenix
Infra traces + metrics → Grafana Cloud Free (50 GB/mo free, 14-day retention) via Tempo/Mimir — or stay on the local Prometheus path for Flagger gates
Optional later: OpenObserve as a unified backend if Grafana Cloud’s free tier exhausts

When a second tenant’s agents come online and multi-tenant / RBAC / prompt-management becomes load-bearing — which Phoenix OSS can’t give us — migrate to Langfuse self-hosted with ClickHouse moved to managed Postgres/Cloud SQL. The OTLP emitters on agent pods don’t change; only the ingest endpoint URL does. That migration risk is the single highest-leverage reason to pick Phoenix now rather than Langfuse-from-day-one: we prove the traces flow, get the LLM UX, and pay zero switching cost when the scale demands ClickHouse anyway.

Caveats

Phoenix OSS has no auth. Do not expose publicly. Tailscale ACL or oauth2-proxy sidecar.
Phoenix on Elastic-2.0 — source-available, not OSI-OSS. Self-host is explicitly permitted for any purpose except offering a competing hosted Phoenix-as-a-service. Fine for internal use; would matter only if we ever try to SaaS-resell it.
Langfuse commercial features live under ee/ in the repo — SSO, SCIM, RBAC, audit logs, multi-tenant quotas. MIT core is enough for single-tenant ops; once we onboard a second tenant we’ll need ee/.
HyperDX post-acquisition trajectory — the OSS repo still ships, but roadmap is ClickHouse Inc.’s. Treat as a ClickHouse front-end, not an independent project.
Uptrace momentum — fewer commits in H2 2025 than H1. If you pick it, plan a migration contingency.
Helicone is a proxy, not an OTel backend. If we route agent HTTP through it we get its UI but break OTel-native emission. Picking it means replacing the OTLP path, not augmenting it. Don’t pick for that reason.
Free-tier reality check — all three SaaS free tiers (Langfuse 50k obs, Axiom 0.5 TB, Honeycomb 20M events) handle a small rig today. At ~1k tasks/day with ~50 LLM calls/task = 1.5M spans/mo — still comfortable on all three. Onboarding a second tenant with busy workloads could push past the Langfuse Hobby limit within a quarter; the self-host is the hedge.

Supersession

This research supersedes the original Observability recommendation in whitepapers/2026-04-20-whats-next Priority 2, which listed Langfuse and Phoenix as interchangeable alternatives. The whitepaper should be amended to name Phoenix as the immediate default and Langfuse as the documented scale-up migration.