Skip to content

What to implement next — raising the floor before raising the ceiling

For engineering leadership deciding what to fund next. The rig runs autonomous issue → merged PR loops today (~20 min, ~$0.62/task, zero human interventions). But the implementation matrix is honest: 21 of 78 tracked capabilities are deployed or partial (27%); 44 are planned (56%). The question isn’t whether the rig works — it’s which gap to close first.

TL;DR: raise the floor before the ceiling. Four investments, in this order: (1) safety foundation — dangerous-command guard, worktrees, egress policy; (2) agent observability — one env var + a trace store; (3) hard cost ceiling — LiteLLM proxy, per-agent virtual keys; (4) nightly quality gate — golden suite as merge blocker. None of these add headline features. All of them make it safe to add headline features later.

flowchart LR
    NOW["📍 Today<br/>27% deployed<br/>no hard guards"] --> F1
    F1["🛡 Safety floor<br/>blocks the unrecoverable"] --> F2
    F2["👁 Visibility<br/>we can see what agents do"] --> F3
    F3["💰 Cost ceiling<br/>proxy-enforced, not trust-based"] --> F4
    F4["✅ Quality gate<br/>regression blocks merge"] --> NEXT["🚀 Ready for<br/>ambition investments"]

    style NOW fill:#ffebee,stroke:#c62828
    style F1 fill:#fff3e0,stroke:#ef6c00
    style F2 fill:#fff8e1,stroke:#f9a825
    style F3 fill:#e3f2fd,stroke:#1976d2
    style F4 fill:#e8f5e9,stroke:#388e3c
    style NEXT fill:#c8e6c9,stroke:#2e7d32,stroke-width:2px
View Mermaid source
flowchart LR
    NOW["📍 Today<br/>27% deployed<br/>no hard guards"] --> F1
    F1["🛡 Safety floor<br/>blocks the unrecoverable"] --> F2
    F2["👁 Visibility<br/>we can see what agents do"] --> F3
    F3["💰 Cost ceiling<br/>proxy-enforced, not trust-based"] --> F4
    F4["✅ Quality gate<br/>regression blocks merge"] --> NEXT["🚀 Ready for<br/>ambition investments"]

    style NOW fill:#ffebee,stroke:#c62828
    style F1 fill:#fff3e0,stroke:#ef6c00
    style F2 fill:#fff8e1,stroke:#f9a825
    style F3 fill:#e3f2fd,stroke:#1976d2
    style F4 fill:#e8f5e9,stroke:#388e3c
    style NEXT fill:#c8e6c9,stroke:#2e7d32,stroke-width:2px

Each layer unblocks the next. Without the floor, the ceiling is aspirational.


pie title 78 tracked capabilities · updated 2026-04-21
    "Deployed" : 17
    "Partial" : 7
    "Planned" : 32
    "Deferred" : 9
    "Rejected" : 13
View Mermaid source
pie title 78 tracked capabilities · updated 2026-04-21
    "Deployed" : 17
    "Partial" : 7
    "Planned" : 32
    "Deferred" : 9
    "Rejected" : 13

Source: rig-gitops/docs/whitepaper/implementation-status.md, updated every merge. (2026-04-21: +3 Deployed and +1 Partial from Priority 1 shipping — dangerous-command-guard, worktrees per task, GuardBlocked events, and partial GitHub-App-1h-tokens.)

What works today. rig-conductor event store (Marten/Postgres, 28 event types, projections live). Valkey per-agent streams + KEDA autoscaling. rig-dev / rig-reviewer / rig-macos runtimes deployed. Memory MCP with pgvector + HNSW. Brain compiled from facts/*.yaml on every merge with CI drift checks. Cost attribution via TokenUsageProjection. SOPS + age + Flux inline decryption (the security foundation). Three autonomous merges on 2026-04-19 in a 70-minute window. Priority 1 safety floor 3.5 of 5 complete as of 2026-04-21: PreToolUse dangerous-command guard active on all agent pods, per-task git worktrees, GuardBlocked events flowing to rig-conductor, GitHub App 1h-token hardening (no PAT fallback on mint failure).

What’s still missing. Default-deny egress policy (AC 5 — the heaviest). All of Priority 2 observability (parked on destination pick + startup-credit decision). Hard cost cap. Regression gate. These are all planned with concrete tickets. “Raising the floor” is one AC from done.


flowchart TB
    subgraph P0["🛡 Phase 0 — deterministic guards"]
        G1["✅ Dangerous-command guard<br/>PreToolUse hook · shipped"]
        G2["✅ Git worktrees per task<br/>bare + worktree · shipped"]
        G3["🟡 Default-deny egress<br/>phased plan scoped · Phase 1 YAML pending"]
        G4["✅ GitHub App tokens · 1h TTL<br/>no PAT fallback · shipped"]
    end
    A["🤖 Agent tool call"] --> G1
    G1 -->|allow| G2
    G1 -->|block + event| GB["GuardBlocked<br/>→ metrics dashboard"]
    G2 --> G3
    G3 -->|allow host| NET((internet))
    G3 -->|deny| DROP["🚫"]

    style P0 fill:#fff3e0,stroke:#ef6c00
    style GB fill:#ffebee,stroke:#c62828
    style DROP fill:#ffebee,stroke:#c62828
View Mermaid source
flowchart TB
    subgraph P0["🛡 Phase 0 — deterministic guards"]
        G1["✅ Dangerous-command guard<br/>PreToolUse hook · shipped"]
        G2["✅ Git worktrees per task<br/>bare + worktree · shipped"]
        G3["🟡 Default-deny egress<br/>phased plan scoped · Phase 1 YAML pending"]
        G4["✅ GitHub App tokens · 1h TTL<br/>no PAT fallback · shipped"]
    end
    A["🤖 Agent tool call"] --> G1
    G1 -->|allow| G2
    G1 -->|block + event| GB["GuardBlocked<br/>→ metrics dashboard"]
    G2 --> G3
    G3 -->|allow host| NET((internet))
    G3 -->|deny| DROP["🚫"]

    style P0 fill:#fff3e0,stroke:#ef6c00
    style GB fill:#ffebee,stroke:#c62828
    style DROP fill:#ffebee,stroke:#c62828

Four independent guards that sit between agent reasoning and tool execution — deterministic, no LLM in the loop.

GuardWhat it stopsCost to build
Dangerous-command guardsudo, rm -rf /, git push --force, drop table, chmod 777, curl | sh, unreviewed package installs~1 week. Pattern already specified in safety.mdGastown’s tap_guard_dangerous as reference. No override flag.
Git worktrees per taskOne agent’s mistake reaching another agent’s workspace~1 week. Cursor 2026 pattern, well-trodden.
Default-deny egressData exfiltration via prompt injection~2 weeks. Needs Cilium L7 — the biggest ROI prompt-injection defense.
GitHub App installation tokens (1h TTL)Long-lived PAT leakage replayability~3 days. Replaces the classic PAT in agent pods.

No override flag is the non-obvious choice. “Add --confirm-dangerous and it works” becomes a learned pattern in any agent’s training data. The escape hatch is the human running the command manually outside the agent loop — that’s working as intended, not a gap. Every blocked call emits a GuardBlocked event to rig-conductor, so block counts become visible signal: a spike means a prompt-injection attempt or an agent bug worth looking at.

Every higher-trust tier. The trust model’s T2 and T3 gates depend on “the agent can’t do the unrecoverable thing without a human.” That’s what this priority buys.

Evidence base: safety.md pillars 1–2, implementation-status.md → Safety domain (8 capabilities; today 0 deployed).


flowchart LR
    A1["rig-dev pod<br/>CLAUDE_CODE_ENABLE_TELEMETRY=1"] -->|OTLP| OC[OTel Collector]
    A2["rig-reviewer pod"] -->|OTLP| OC
    A3["rig-macos pod"] -->|OTLP| OC
    OC -->|LLM traces| LF["🔭 Langfuse Cloud<br/>startup 50% off<br/>LLM-specific UX"]
    OC -->|infra + trace waterfalls| GC2["📊 Grafana Cloud<br/>$100k startup credit · 12 mo<br/>LGTM stack"]
    DEV["🧑 Dev inner loop"] -->|OTLP localhost| PX["🔥 Phoenix<br/>docker compose<br/>instant feedback"]
    OC -->|infra metrics| GC["Grafana Cloud"]
    LF --> UI1["Quality · cost · per-task UI"]
    GC --> UI2["SLO · error budget"]

    style OC fill:#fff3e0,stroke:#ef6c00
    style LF fill:#e3f2fd,stroke:#1976d2
    style GC fill:#e3f2fd,stroke:#1976d2
    style UI1 fill:#e8f5e9,stroke:#388e3c
    style UI2 fill:#e8f5e9,stroke:#388e3c
View Mermaid source
flowchart LR
    A1["rig-dev pod<br/>CLAUDE_CODE_ENABLE_TELEMETRY=1"] -->|OTLP| OC[OTel Collector]
    A2["rig-reviewer pod"] -->|OTLP| OC
    A3["rig-macos pod"] -->|OTLP| OC
    OC -->|LLM traces| LF["🔭 Langfuse Cloud<br/>startup 50% off<br/>LLM-specific UX"]
    OC -->|infra + trace waterfalls| GC2["📊 Grafana Cloud<br/>$100k startup credit · 12 mo<br/>LGTM stack"]
    DEV["🧑 Dev inner loop"] -->|OTLP localhost| PX["🔥 Phoenix<br/>docker compose<br/>instant feedback"]
    OC -->|infra metrics| GC["Grafana Cloud"]
    LF --> UI1["Quality · cost · per-task UI"]
    GC --> UI2["SLO · error budget"]

    style OC fill:#fff3e0,stroke:#ef6c00
    style LF fill:#e3f2fd,stroke:#1976d2
    style GC fill:#e3f2fd,stroke:#1976d2
    style UI1 fill:#e8f5e9,stroke:#388e3c
    style UI2 fill:#e8f5e9,stroke:#388e3c

Agents emit OpenTelemetry GenAI spans for every LLM call. Collector already runs per cluster for rig-conductor. Flip the env var, ship the spans to a trace store.

Set CLAUDE_CODE_ENABLE_TELEMETRY=1 in agent pods. That single line turns on native OTel emission with GenAI semantic conventions. No code change. One helmrelease.yaml edit per agent.

Then:

  1. Apply for startup credits (Invotek AS qualifies on the plain criteria):
    • Grafana Cloud for Startups — $100k / 12 mo; covers Tempo traces + Loki logs + Mimir metrics + Enterprise plugins. At our 1.5M–15M spans/mo workload, list is $22–$47/mo so the credit gives effectively unlimited runway.
    • Langfuse early-stage discount — 50% off first year on Core/Pro; keeps LLM-specific UX (prompt diff, eval scoring, datasets) affordable.
    • Both approvals take 1–2 weeks; free tiers cover the gap.
  2. Wire OTel Collector dual-exportgen_ai.* spans → Langfuse Cloud (Hobby free until discount lands); infra + full OTel → Grafana Cloud (free 50 GB / 14-day until credit lands). One Collector config change.
  3. Phoenix stays for the dev inner loop — engineers run a local docker compose Phoenix while iterating on prompts or agent code. Local latency matters; no network hop during the tight inner loop.
  4. Fallback path, documented: if the Grafana credit is denied, drop back to OpenObserve self-hosted on the rig k3s cluster (~$30/mo flat, zero lock-in, S3/GCS-backed). If the Langfuse discount is denied, stay on Hobby free tier until volume forces a decision.

The second-pass research — startup programs + storage economics (2026-04-21) — supersedes the original options doc on pricing; the structural comparison of 11 candidates in the earlier research still stands.

Right now we know the rig works because three merges landed cleanly on 2026-04-19. We don’t know why a bad run is bad. Every other priority on this list depends on being able to distinguish a healthy agent from an unhealthy one at a glance — cost attribution, quality regression, drift detection, tier promotion, self-healing loops, all of it.

Priority 3 (cost dashboards need trace data), Priority 4 (regression metrics need baselines), and every principle that contains the word “measure”.

Evidence base: observability.md TL;DR + implementation-status → Observability domain (7 capabilities; today 1 deployed, 2 partial).


flowchart LR
    classDef off fill:#ffebee,stroke:#c62828,color:#000
    classDef on fill:#e8f5e9,stroke:#388e3c,color:#000
    classDef work fill:#fff3e0,stroke:#ef6c00,color:#000

    L1["1 · Pre-flight prediction<br/>cheap model<br/>abort if over budget"]:::off
    L2["2 · Dispatch token-bucket<br/>rig-conductor<br/>circuit breaker"]:::work
    L3["3 · LiteLLM proxy<br/>per-agent virtual keys<br/>HARD 429 CEILING"]:::off
    L4["4 · Langfuse attribution<br/>post-hoc per-task cost"]:::off

    L1 --> L2 --> L3 --> LLM["LLM provider"]
    LLM --> L4
    L4 -.->|weekly review| L2
View Mermaid source
flowchart LR
    classDef off fill:#ffebee,stroke:#c62828,color:#000
    classDef on fill:#e8f5e9,stroke:#388e3c,color:#000
    classDef work fill:#fff3e0,stroke:#ef6c00,color:#000

    L1["1 · Pre-flight prediction<br/>cheap model<br/>abort if over budget"]:::off
    L2["2 · Dispatch token-bucket<br/>rig-conductor<br/>circuit breaker"]:::work
    L3["3 · LiteLLM proxy<br/>per-agent virtual keys<br/>HARD 429 CEILING"]:::off
    L4["4 · Langfuse attribution<br/>post-hoc per-task cost"]:::off

    L1 --> L2 --> L3 --> LLM["LLM provider"]
    LLM --> L4
    L4 -.->|weekly review| L2

Legend: 🔴 not built · 🟡 partial · 🟢 deployed.

Four layers of cost control from pre-flight estimation to post-hoc attribution. Today: only the cheapest, lowest-guarantee layer (TokenUsageProjection) exists. The one that matters — Layer 3, proxy-level hard ceiling — is unbuilt.

LiteLLM proxy with per-agent virtual keys. No override, no trust-based limiter. Returns 429 before the request reaches the LLM provider. A compromised or looping agent cannot exceed its budget because the call fails at the proxy, not at the agent.

Honest caveat from the cost-framework whitepaper: LiteLLM issue #12905 shows user-level budgets are not enforced inside team configs. Treat the proxy as the primary defense, not an absolute one. Every LiteLLM upgrade needs a synthetic budget-overrun test to verify 429 still fires as configured.

One looping agent on a shared provider burns the hourly rate-limit budget for every other agent. Today that’s a trust-based hope (“don’t let agents loop”), not an enforced guarantee. Phase 2 work — prompt caching, pre-flight prediction, circuit breakers — all depend on having the proxy to route through.

Multi-tenant operation with confidence. A new project onboarding can be given a virtual key with a hard dollar cap without any engineering per-tenant. “Stage 1 tenant, $50/day hard cap” becomes a LiteLLM config row, not a feature request.

Evidence base: cost-framework.md four layers, implementation-status.md → Cost framework domain (7 capabilities; today 1 deployed, 1 partial).


flowchart LR
    N["🌙 Nightly<br/>~$3–8/run"] --> G[Golden suite<br/>10 internal tasks]
    N --> S[SWE-bench Pro<br/>weekly subset<br/>$20–40/week]
    N --> R[Regression cases<br/>one per past incident]
    G & S & R --> EVAL{Regression<br/>over 10%?}
    EVAL -->|yes| BLOCK["🚫 Fails pipeline<br/>alert + merge block"]
    EVAL -->|no| OK["✅ Green · continue"]
    EVAL -.->|all results| LF["Langfuse trends"]

    style N fill:#fff9c4
    style BLOCK fill:#ffebee,stroke:#c62828
    style OK fill:#e8f5e9,stroke:#388e3c
    style LF fill:#e3f2fd,stroke:#1976d2
View Mermaid source
flowchart LR
    N["🌙 Nightly<br/>~$3–8/run"] --> G[Golden suite<br/>10 internal tasks]
    N --> S[SWE-bench Pro<br/>weekly subset<br/>$20–40/week]
    N --> R[Regression cases<br/>one per past incident]
    G & S & R --> EVAL{Regression<br/>over 10%?}
    EVAL -->|yes| BLOCK["🚫 Fails pipeline<br/>alert + merge block"]
    EVAL -->|no| OK["✅ Green · continue"]
    EVAL -.->|all results| LF["Langfuse trends"]

    style N fill:#fff9c4
    style BLOCK fill:#ffebee,stroke:#c62828
    style OK fill:#e8f5e9,stroke:#388e3c
    style LF fill:#e3f2fd,stroke:#1976d2

A nightly harness runs the rig against a fixed set of tasks and fails the pipeline if any metric regresses more than 10%. Cost estimate from quality-and-evaluation.md: ~$3–8/night, ~$1.1–2.9k/year. Runs in 30–60 minutes. Catches actual regressions in tasks we care about, not synthetic leaderboard tasks.

Every prompt change, dependency bump, and model upgrade today is a hope that nothing regressed. The brain-and-memory whitepaper claim — “measured today: 20 min issue→merge, $0.62/task” — is a snapshot, not an invariant. Without a nightly gate the numbers drift quietly.

Autonomy tier promotion. The rig advances from T1 (suggest) to T2 (merge-with-approval) only when 20 successful runs land with zero rollbacks — and that’s measurable only if “successful run” has a fixed definition. The nightly suite is that definition.

Also: property-based tests (Hypothesis) on labeled changes, LLM-as-judge sampling (10% T0 / 100% T2), DORA metrics adapted to agents — all depend on the nightly pipeline existing as a scaffold to hang per-PR gates on.

Evidence base: quality-and-evaluation.md + implementation-status → Quality and evaluation domain (7 capabilities; today 0 deployed).


gantt
    title Floor-raising roadmap · next ~60 working days
    dateFormat YYYY-MM-DD

    section 1 · Safety floor
    Dangerous-command guard           :active, s1, 2026-04-21, 5d
    Git worktrees per task            :s2, after s1, 5d
    GitHub App tokens (1h TTL)        :s3, after s1, 3d
    Default-deny egress + Cilium L7   :s4, after s2, 10d

    section 2 · Observability
    CLAUDE_CODE_ENABLE_TELEMETRY=1    :o1, after s3, 1d
    Apply Grafana + Langfuse startup credits :o2, after o1, 1d
    OTel Collector dual-export config :o3, after o2, 2d
    Phoenix local docker compose      :o4, after o2, 1d
    Credits-granted vs fallback ADR   :o5, after o3, 1d

    section 3 · Cost ceiling
    LiteLLM proxy deploy              :c1, after o2, 5d
    Per-agent virtual keys + caps     :c2, after c1, 3d
    Synthetic budget-overrun test     :c3, after c2, 2d
    Pre-flight prediction (Haiku)     :c4, after c3, 5d

    section 4 · Quality gate
    Golden suite · 10 tasks           :q1, after c2, 5d
    Nightly harness + alert           :q2, after q1, 5d
    Regression blocker in CI          :q3, after q2, 3d
View Mermaid source
gantt
    title Floor-raising roadmap · next ~60 working days
    dateFormat YYYY-MM-DD

    section 1 · Safety floor
    Dangerous-command guard           :active, s1, 2026-04-21, 5d
    Git worktrees per task            :s2, after s1, 5d
    GitHub App tokens (1h TTL)        :s3, after s1, 3d
    Default-deny egress + Cilium L7   :s4, after s2, 10d

    section 2 · Observability
    CLAUDE_CODE_ENABLE_TELEMETRY=1    :o1, after s3, 1d
    Apply Grafana + Langfuse startup credits :o2, after o1, 1d
    OTel Collector dual-export config :o3, after o2, 2d
    Phoenix local docker compose      :o4, after o2, 1d
    Credits-granted vs fallback ADR   :o5, after o3, 1d

    section 3 · Cost ceiling
    LiteLLM proxy deploy              :c1, after o2, 5d
    Per-agent virtual keys + caps     :c2, after c1, 3d
    Synthetic budget-overrun test     :c3, after c2, 2d
    Pre-flight prediction (Haiku)     :c4, after c3, 5d

    section 4 · Quality gate
    Golden suite · 10 tasks           :q1, after c2, 5d
    Nightly harness + alert           :q2, after q1, 5d
    Regression blocker in CI          :q3, after q2, 3d

Not hard-committed dates — shape and sequencing. Actual ticket landing will slip; that’s fine. What matters is the order: floor before ceiling, each layer unblocking the next.


Honest deferrals, with the reason to defer:

flowchart TB
    subgraph NOT["🔭 Deliberately not now"]
        N1["Spec-E / Architect-E<br/>(new agents)"]
        N2["Reproduction harness<br/>(self-healing Stage 2)"]
        N3["Sigstore + SLSA L3 + Kyverno<br/>(supply-chain hardening)"]
        N4["flagd + OpenFeature<br/>(feature-flag platform)"]
        N5["Cross-provider fallback<br/>(LiteLLM fallback_models)"]
    end
    N1 -.->|"reason"| R1[Headline feature · needs<br/>floor first]
    N2 -.-> R2[Frontier work ·<br/>needs quality gate first]
    N3 -.-> R3[Phase 4 · right thing<br/>eventually, not this quarter]
    N4 -.-> R4[YAGNI · env vars +<br/>Kustomize cover today]
    N5 -.-> R5[Deferred · only meaningful<br/>with multi-provider config]

    style NOT fill:#f3e5f5,stroke:#7b1fa2
View Mermaid source
flowchart TB
    subgraph NOT["🔭 Deliberately not now"]
        N1["Spec-E / Architect-E<br/>(new agents)"]
        N2["Reproduction harness<br/>(self-healing Stage 2)"]
        N3["Sigstore + SLSA L3 + Kyverno<br/>(supply-chain hardening)"]
        N4["flagd + OpenFeature<br/>(feature-flag platform)"]
        N5["Cross-provider fallback<br/>(LiteLLM fallback_models)"]
    end
    N1 -.->|"reason"| R1[Headline feature · needs<br/>floor first]
    N2 -.-> R2[Frontier work ·<br/>needs quality gate first]
    N3 -.-> R3[Phase 4 · right thing<br/>eventually, not this quarter]
    N4 -.-> R4[YAGNI · env vars +<br/>Kustomize cover today]
    N5 -.-> R5[Deferred · only meaningful<br/>with multi-provider config]

    style NOT fill:#f3e5f5,stroke:#7b1fa2

These are good ideas. They are not the next idea. Any of them built before the four floor layers above compounds technical debt in a system that doesn’t yet have the observability to tell when the debt has become a problem.


How we know this is done, ~60 working days out:

CriterionHow measuredStatus
Zero sudo / rm -rf / / git push --force calls land from agent podsGuardBlocked event count with allow rate 100% on legitimate commands✅ Guard shipped + activated; event projection live
Tasks run in isolated workspaces with no cross-task leakage/workspace/tasks/<task-id>/ per task, bare clone reused✅ Shipped
Agents use only short-lived (≤1h) GitHub credentialsgetGitHubToken() returns App-minted token; no PAT fallback when App creds are configured✅ Shipped — PAT env var removed from dev-e + review-e pods (2026-04-21)
Every agent LLM call is traceable end-to-end in Langfuse Cloud + Grafana CloudSpot-check 10 random tasks; all visible with timing, tokens, cost⏳ Parked on credit-application + destination pick
Proxy enforcement verified by synthetic overrunDedicated test job deliberately exceeds a key’s daily cap; 429 fires before provider billed⏳ Priority 3
Nightly run completes green 7 days in a rowGrafana dashboard green streak⏳ Priority 4
Tier promotion unblocked — T1 → T2 policy engine has data to act on20+ successful nightly runs, zero rollbacks, quality metrics within tolerance⏳ Depends on Priorities 2–4

Then — and only then — the rig is ready for the next class of investment: headline agents (Spec-E, Architect-E), complex refactor capability, multi-runtime portability, reproduction-harness self-healing.


Each priority has a dedicated user story with acceptance criteria, estimated effort, and a GitHub issue:


Same as the brain-and-memory whitepaper: this document uses the target agent names (rig-conductor, rig-dev, rig-reviewer, rig-macos). The running deployment today still uses the original -E suffixes (Dev-E, Review-E, iBuild-E); both forms appear in infrastructure code, Discord channels, and event payloads during the transition.