User story: agent observability — one env var + a trace store

User story

As the rig operator

I want every agent LLM call emitted as an OpenTelemetry GenAI span and stored in an LLM-aware trace store — without adding significant RAM pressure to the existing 8 GB VM

So that a healthy run and an unhealthy run are distinguishable at a glance (tokens, cost, latency, prompt, response), every other priority on the roadmap has baselines to gate on, and the trace store decision does not lock us out of scaling later.

Context

See whats-next whitepaper §Priority 2, the source observability.md whitepaper, and two research docs in sequence:

2026-04-20 OTel-native LLM observability options — structural comparison of 11 candidates (footprint, LLM UX, lock-in). Superseded on the pricing/recommendation question.
2026-04-21 Startup programs + storage economics — factors in Grafana Cloud for Startups ($100k / 12 mo) and Langfuse 50% first-year discount, both of which Invotek AS qualifies for. Flips the production-backend recommendation.

The OTel Collector is already deployed for rig-conductor infrastructure traces. Two gaps remain: (a) agent pods don’t emit OTel; (b) no LLM-aware trace store to render what they would emit.

Revised vendor path (2026-04-21): Phoenix remains for the dev inner loop (local latency matters during development). Production backend becomes Grafana Cloud + Langfuse: Grafana’s startup credit makes the full LGTM stack effectively free for 12 months ($22–$47/mo list at our 1.5M–15M spans/mo workload); Langfuse’s startup discount halves the LLM-specific UX cost. OpenObserve self-hosted on GKE is the documented fallback if Grafana credit is denied (~$30/mo flat, zero lock-in).

Acceptance criteria

⏳ Apply for credits — Grafana Cloud for Startups and Langfuse early-stage discount. Both approvals 1–2 weeks.
⏳ CLAUDE_CODE_ENABLE_TELEMETRY=1 set in agent pods — one rig-agent-helmrelease.yaml edit per agent (dev-e, review-e, macos-e). No code change. Verifiable via kubectl exec -- env | grep TELEMETRY.
⏳ OTel Collector dual-export — GenAI spans → Langfuse Cloud (Hobby free until discount lands); infra + full OTel → Grafana Cloud (free 50 GB / 14-day until credit lands).
⏳ LLM spans visible in Langfuse for all three agents — token counts, model, latency, prompt/response, session tree. Spot-check 10 random tasks.
⏳ Infra traces visible in Grafana Cloud — service topology, request spans, latency histograms.
⏳ Phoenix kept for dev inner loop — local docker compose instance developers run while iterating on prompts or agent code; OTLP endpoint configurable via env var.
⏳ Credits-granted vs fallback ADR — short tool-choices ADR. If either credit is denied within 30 days, ADR specifies the fallback path (Grafana credit denied → OpenObserve self-host; Langfuse discount denied → stay on Cloud Hobby until volume forces a choice).

What it unblocks

Priority 3 (hard cost ceiling) — cost dashboards need span data; LiteLLM proxy attribution dovetails into per-task cost tracking via gen_ai.usage.* attributes.
Priority 4 (nightly quality gate) — regression metrics need baselines; baselines need a trace store.
Tier promotion from T1 to T2 — decision requires a quality signal with more precision than “did CI pass.”
Debuggability of sick runs — today a bad run is visible only in Discord threads and event-store projections.

Out of scope

LiteLLM proxy (Priority 3 — separate user story)
Langfuse self-hosted migration (only relevant if Cloud discount is denied AND we outgrow Hobby; covered in AC 7 ADR)
Honeycomb-style burn-rate alerts (Phase 5 per implementation-status)
Agent quality evaluation (Priority 4)

Priority

High. Second in sequence after Priority 1 safety foundation. Visibility is prerequisite for every measurement-driven decision that follows.

Estimated effort

AC 1 (apply for credits): ~30 min of paperwork, 1–2 week approval wait.
AC 2 (CLAUDE_CODE_ENABLE_TELEMETRY=1 + OTEL_EXPORTER_OTLP_ENDPOINT): ~1 day. HelmRelease env-var edits across three agents + one SealedSecret for the endpoint auth.
AC 3 (Collector dual-export): ~2 days. Config change to the existing Collector.
AC 4 (Langfuse spans): ~1 day. Manual spot-check of 10 tasks.
AC 5 (Grafana infra traces): ~1 day. Free-tier signup + endpoint config.
AC 6 (Phoenix dev-loop docker compose): ~1 day. One-off dev-setup doc; no cluster work.
AC 7 (ADR): ~1 day.

Total: ~1 week of focused work (plus 1–2 week credit-approval wait that doesn’t block AC 2–5 on free tiers).

Caveats

Phoenix OSS has no auth. If ever exposed, Tailscale ACL is mandatory. Dev-loop use is docker compose on the engineer’s machine — no public surface.
Phoenix is Elastic-2.0 licensed — source-available, self-host permitted for internal use, not for offering competing Phoenix-as-a-service. Non-issue for us.
Grafana Cloud credit is not guaranteed — Invotek AS qualifies on the plain criteria (<$10M funded, <25 FTE) but approval takes 1–2 weeks. AC 7 ADR covers the fallback path (OpenObserve self-host).
Langfuse Hobby free tier is 50k observations/mo — enough for today’s volume but not for multi-tenant scale. The 50% startup discount makes Pro $99/mo first year, so the upgrade path is bounded.