Skip to content

OTel-native LLM observability — free and low-cost options for the rig

User stories spawned from this document

Superseded 2026-04-21. The structural evaluation of candidates below (footprints, LLM UX, lock-in) still stands. The pricing and recommendation sections are superseded by research/2026-04-21-otel-startup-programs-storage-economics, which factors in startup credit programs (Grafana Cloud for Startups = $100k/12mo; Langfuse 50% off first year) and storage-at-scale economics this doc missed. The revised verdict moves the production backend from Phoenix-now/Langfuse-later to Grafana Cloud (credit) + Langfuse (discount) with self-hosted OpenObserve as the explicit fallback. Phoenix remains for the dev inner loop.

TL;DR. The whats-next whitepaper treats Langfuse self-hosted and Phoenix as alternatives chosen on VM size. Real 2026 footprint numbers flip that: Langfuse self-hosted is 1.5–2 GB (six services), Phoenix is 300–500 MB (one container). For the current 8 GB VM, Phoenix is the default; Langfuse is a later migration when multi-tenant/RBAC becomes load-bearing. The other nine tools evaluated are either too heavy (SigNoz, HyperDX), wrong shape (Helicone is proxy-first), at momentum risk (Uptrace), or lack LLM-specific UI (OpenObserve, Jaeger/Tempo, Axiom, Honeycomb).

The rig emits OpenTelemetry GenAI spans when CLAUDE_CODE_ENABLE_TELEMETRY=1 is set. We need a trace store that:

  1. Ingests OTLP natively — no vendor SDK lock-in, portable across Claude Code, Codex CLI, Gemini CLI
  2. Has LLM-specific UI — token counts, cost per model, prompt/response diff, session trees — not just generic trace waterfalls
  3. Fits alongside rig-conductor, Postgres, Valkey, and the agents on a single 8 GB k3s VM today
  4. Has an honest migration path when we outgrow the 8 GB box

Cost-sensitivity is explicit. The question isn’t which premium tool; it’s which free or near-free OSS tool doesn’t paint us into a corner.

Langfuse — MIT core; commercial ee/ modules hold SSO, RBAC, multi-tenant. Self-hosted stack: web + worker + Postgres + ClickHouse + Redis + MinIO — realistic 1.5–2 GB RAM, not the 1 GB the whitepaper currently claims. OTel/OTLP GA since v3. Best-in-class LLM UI: sessions, token/cost per model, evaluations, prompts, datasets. Cloud Hobby free: 50k observations/month, 30-day retention. ~16k stars, YC W23, adopted by Khan Academy, Twilio, SumUp. Depth is the bet; footprint is the cost.

Arize Phoenix — Elastic-2.0 (source-available, self-host permitted). One Python container, ~300–500 MB RAM, SQLite or Postgres backend. Native OTLP + OpenInference (their extension of OTel GenAI). Strong LLM UI: traces, evals, prompt playground, datasets. No built-in auth or multi-tenant in OSS — Arize sells that. ~6k stars, very active. “Drop in and go” on an 8 GB VM. Auth gap is resolvable by putting it behind Tailscale or an auth-proxy sidecar.

OpenObserve — AGPL-3. Single Rust binary, ~200 MB RAM idle, local disk or S3. OTel-native (traces + logs + metrics). No LLM-specific UI — generic trace waterfalls and SQL over span attributes. Cloud free: 200 GB ingest/month. Fast release cadence. Excellent infrastructure backbone; weak for prompt/token inspection.

SigNoz — MIT. ClickHouse + Zookeeper + query-service + frontend + OTel collector — ~2–3 GB RAM, tight on an 8 GB VM sharing with everything else. OTel-native end-to-end. Added LLM/GenAI dashboards in 2025. Cloud: free trial only, paid from ~$199/mo. ~25k stars. Right answer when the rig has its own observability node.

Uptrace — BSL-1.1 (converts to Apache-2 after 3 years). Go binary + ClickHouse + Postgres, ~1 GB. OTel-native, decent trace UI, modest AI dashboards. ~2.5k stars, commit cadence slowed in 2025 — momentum risk. Pick only with a migration contingency.

HyperDX — MIT. ClickHouse + Mongo + app, ~1.5–2 GB. OTel-native, strong trace/log UI, session replay. Acquired by ClickHouse Inc. mid-2024, v2 effectively folded into ClickHouse Observability. No LLM-specific UI. Roadmap now ClickHouse-driven — vendor trajectory unpredictable for the standalone OSS version.

Grafana Cloud Free + Tempo — 50 GB traces/mo, 14-day retention, 3 users. Tempo is OTel-native but has no LLM UI — you build views in Grafana with TraceQL/attributes. Self-hosted Tempo is ~500 MB but needs object storage. Good if you already run Grafana; weak for prompt inspection.

Axiom — SaaS only. Free Personal tier: 0.5 TB ingest/month, 30-day retention, 3 users. Native OTLP. No LLM UI — fast log/event explorer (APL). Generous free tier, zero LLM affordance out of the box. Pair with a thin instrumentation layer if you go this route.

Honeycomb — SaaS. Free: 20M events/month, 60-day retention, 5 users. OTLP-native, excellent BubbleUp/trace UI. No LLM-specific UI; GenAI semconv attributes render as normal span fields.

Helicone — Apache-2. Proxy-first (Postgres + ClickHouse + proxy + web), OTel support experimental. Cloud free: 10k requests/month. Good LLM UI (cost, caching, sessions). Not an OTel-ingest backend; it’s a proxy. Pick only if agent traffic is routed through Helicone, which re-frames the stack.

OpenLLMetry / Traceloop SDK — Apache-2 instrumentation library, not a backend. Emits OTel GenAI spans → ship anywhere. Traceloop SaaS free: 50k spans/month. Useful as the emission side for Codex/Gemini CLI if native GenAI semconv isn’t there yet.

Jaeger / Tempo baseline — CNCF, free, OTel-native, no LLM UI. Baseline only; include as comparison anchor.

NeedPickWhy
Cheapest free-tier SaaS, OTel-native, LLM UILangfuse Cloud HobbyOnly free SaaS with proper token/cost/prompt UI over OTLP. 50k observations/month covers a small rig.
Best self-hosted on one 8 GB VM (today)Arize Phoenix300–500 MB, single container, OTLP-native, strong LLM UX. Keep OTel Collector routing infra traces separately.
Best if we scale past 8 GB (later)Langfuse self-hosted (ClickHouse externalised) or SigNozLangfuse wins on LLM depth; SigNoz wins on unified APM.

Deploy Phoenix to the 8 GB VM now as the agent trace store. Single container, expose an OTLP endpoint, set CLAUDE_CODE_ENABLE_TELEMETRY=1 on agent pods, done. Put Phoenix behind Tailscale or a tiny auth proxy (oauth2-proxy) since OSS has no auth — acceptable for internal-only use.

Keep the existing OTel Collector as the single OTLP ingress, fan-out:

  • Agent LLM spans (gen_ai.* attributes) → Phoenix
  • Infra traces + metrics → Grafana Cloud Free (50 GB/mo free, 14-day retention) via Tempo/Mimir — or stay on the local Prometheus path for Flagger gates
  • Optional later: OpenObserve as a unified backend if Grafana Cloud’s free tier exhausts

When a second tenant’s agents come online and multi-tenant / RBAC / prompt-management becomes load-bearing — which Phoenix OSS can’t give us — migrate to Langfuse self-hosted with ClickHouse moved to managed Postgres/Cloud SQL. The OTLP emitters on agent pods don’t change; only the ingest endpoint URL does. That migration risk is the single highest-leverage reason to pick Phoenix now rather than Langfuse-from-day-one: we prove the traces flow, get the LLM UX, and pay zero switching cost when the scale demands ClickHouse anyway.

  • Phoenix OSS has no auth. Do not expose publicly. Tailscale ACL or oauth2-proxy sidecar.
  • Phoenix on Elastic-2.0 — source-available, not OSI-OSS. Self-host is explicitly permitted for any purpose except offering a competing hosted Phoenix-as-a-service. Fine for internal use; would matter only if we ever try to SaaS-resell it.
  • Langfuse commercial features live under ee/ in the repo — SSO, SCIM, RBAC, audit logs, multi-tenant quotas. MIT core is enough for single-tenant ops; once we onboard a second tenant we’ll need ee/.
  • HyperDX post-acquisition trajectory — the OSS repo still ships, but roadmap is ClickHouse Inc.’s. Treat as a ClickHouse front-end, not an independent project.
  • Uptrace momentum — fewer commits in H2 2025 than H1. If you pick it, plan a migration contingency.
  • Helicone is a proxy, not an OTel backend. If we route agent HTTP through it we get its UI but break OTel-native emission. Picking it means replacing the OTLP path, not augmenting it. Don’t pick for that reason.
  • Free-tier reality check — all three SaaS free tiers (Langfuse 50k obs, Axiom 0.5 TB, Honeycomb 20M events) handle a small rig today. At ~1k tasks/day with ~50 LLM calls/task = 1.5M spans/mo — still comfortable on all three. Onboarding a second tenant with busy workloads could push past the Langfuse Hobby limit within a quarter; the self-host is the hedge.

This research supersedes the original Observability recommendation in whitepapers/2026-04-20-whats-next Priority 2, which listed Langfuse and Phoenix as interchangeable alternatives. The whitepaper should be amended to name Phoenix as the immediate default and Langfuse as the documented scale-up migration.