Skip to content

Self-Healing Production — Canary, SLO Gating, Kill Switches, Repair-E

Self-Healing Production — Canary, SLO Gating, Kill Switches, Repair-E

Section titled “Self-Healing Production — Canary, SLO Gating, Kill Switches, Repair-E”
  • flagger-canary — Flagger canary deploys. Phase 5. Flux-native progressive delivery.
  • pgroll-migrations — pgroll expand/contract migrations. Phase 5. With inspectable SQL trail hedge.

!!! abstract “TL;DR” Production bugs get detected, diagnosed, fixed, canaried, and promoted — with humans only at semantic boundaries — within minutes of first SLO impact. Five stages (know → rollback → diagnose → fix → learn). The trusted rig targets stages 0–3 for our services; stage 4 is aspirational. Most very well-engineered teams (Stripe, GitHub, Cloudflare) do not fully achieve stages 2–3 for logic bugs.

!!! note “Terminology: Repair-E = Dev-E in repair-dispatch mode” This document uses the name “Repair-E” as shorthand for Dev-E dispatched by an SLO-burn alert with a repair-specific system prompt. It is not a separate agent class — same pod class, same model, different trigger + prompt. Earlier drafts framed it as a fifth agent role; honest re-evaluation (see glossary.md) found the event-shaped-boundary test isn’t cleanly met. The name is kept as a convenient label for a dispatch mode, not a separate agent.

The realistic self-healing ladder, restated from the conversation:

StageCapabilityTarget
0Know prod is brokenOTel + Prometheus + SLOs + error-budget math
1Auto-rollback on SLO breachFlagger + flagd, signed images, trustworthy rollback target
2Auto-diagnoseRepair-E reads trace + deploy + git blame, proposes fix with confidence score
3Auto-fix + canary + progressive rolloutReproduction harness, DB migration safety, feedback loop
4Learn from incidentsPost-incident projection, prior updates, preemptive detection

Stages 0-1 are engineering that can ship. Stages 2-3 are frontier work where Cursor, Devin, Anthropic internal all have pieces but none publicly demonstrate full coverage. Stage 4 is research.

The trusted rig targets stages 0-3 for our services. Stage 4 is aspirational.

sequenceDiagram
    participant P as Prometheus
    participant A as Alertmanager
    participant CE as rig-conductor
    participant R as Router
    participant RE as Repair-E
    participant FD as flagd
    participant F as Flagger
    participant KV as Kyverno
    participant D as Discord

    P->>A: SLO burn-rate exceeds threshold
    A->>CE: EscalationRequired severity P1
    CE->>R: Route by severity
    R->>D: Post to admin channel
    R->>FD: Flip kill switch for affected feature (~30s)
    R->>RE: Dispatch with trace context
    RE->>RE: Pull top-N slow/error traces
    RE->>RE: Extract code.function + code.filepath
    RE->>RE: git log with -S for changed function, last 24h
    RE->>RE: Cross-reference recent deploys
    alt clear diagnosis
        RE->>CE: Propose fix PR (attestation chain)
        CE->>F: Submit Canary
        F->>P: Run AnalysisTemplate (success rate, p99 latency)
        alt canary passes
            F->>KV: Promote (attested)
            KV->>KV: Verify signatures
            KV-->>F: Admitted
            F->>F: Progressive rollout 5% 25% 50% 100%
            F->>CE: Promoted
            CE->>FD: Clear kill switch
        else canary fails
            F->>CE: Aborted
            CE->>R: Escalate to P0
        end
    else ambiguous
        RE->>CE: Low confidence — escalate to human
        R->>D: P0 DM with mention
    end
View Mermaid source
sequenceDiagram
    participant P as Prometheus
    participant A as Alertmanager
    participant CE as rig-conductor
    participant R as Router
    participant RE as Repair-E
    participant FD as flagd
    participant F as Flagger
    participant KV as Kyverno
    participant D as Discord

    P->>A: SLO burn-rate exceeds threshold
    A->>CE: EscalationRequired severity P1
    CE->>R: Route by severity
    R->>D: Post to admin channel
    R->>FD: Flip kill switch for affected feature (~30s)
    R->>RE: Dispatch with trace context
    RE->>RE: Pull top-N slow/error traces
    RE->>RE: Extract code.function + code.filepath
    RE->>RE: git log with -S for changed function, last 24h
    RE->>RE: Cross-reference recent deploys
    alt clear diagnosis
        RE->>CE: Propose fix PR (attestation chain)
        CE->>F: Submit Canary
        F->>P: Run AnalysisTemplate (success rate, p99 latency)
        alt canary passes
            F->>KV: Promote (attested)
            KV->>KV: Verify signatures
            KV-->>F: Admitted
            F->>F: Progressive rollout 5% 25% 50% 100%
            F->>CE: Promoted
            CE->>FD: Clear kill switch
        else canary fails
            F->>CE: Aborted
            CE->>R: Escalate to P0
        end
    else ambiguous
        RE->>CE: Low confidence — escalate to human
        R->>D: P0 DM with mention
    end

Every arrow is an event. Every decision is attested. Every metric is in the dashboards.

  • Burn rate — current error rate projected forward; honeycomb-style 4h-forward-look triggers P1
  • Latency p99 regression — 2× week-over-week baseline for 5 minutes
  • Error rate spike — 3σ above rolling hourly baseline
  • Synthetic probe failure — constant-QPS synthetic traffic catches what user traffic misses at low QPS
  • Dependency failure — upstream service unreachable or 5xx spike
  • Deployment correlation — within 15 min of a deploy, any of the above is elevated severity

Why synthetic probes matter at small scale

Section titled “Why synthetic probes matter at small scale”

At < 10 QPS, organic traffic is statistical noise. A single 500 burns 10% of an hourly budget. Constant-rate synthetic probes (every 15s, say) provide a signal baseline that doesn’t depend on user traffic. Prometheus Blackbox Exporter + scheduled probes hitting the service’s health endpoints + key user journeys.

Per service, compute:

budget_remaining = (1 - SLO_target) * total_window_events - failed_events
burn_rate = failed_events_current_rate / failed_events_budgeted_rate

Honeycomb’s pattern: alert when burn_rate * 4h > budget_remaining (at current rate, we’d exhaust in 4h). rig-conductor projects this per service and exposes it as GET /api/services/{name}/budget.

Every service in the rig gets a Flagger Canary resource. No service deploys via raw Deployment apply.

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: payments-service
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: payments-service
service:
port: 8080
analysis:
interval: 1m
threshold: 5 # consecutive failures → abort
maxWeight: 50
stepWeight: 10
metrics:
- name: success-rate
thresholdRange: { min: 99 }
interval: 1m
- name: latency-p99
thresholdRange: { max: 500 }
interval: 1m
webhooks:
- name: rig-conductor-notify
url: http://rig-conductor.rig-conductor.svc:8080/api/events
timeout: 5s
metadata:
type: CanaryPhase

Rollout: 5% canary for 1 minute → analysis passes → 15% → … → 50% → promotion. Any failed analysis aborts; maxWeight: 50 means we never canary past half traffic before full promotion.

Flux-native. Wraps existing Deployment resources rather than requiring a swap to a new Rollout CRD. Webhook hooks at every phase (pre-rollout, confirm-promotion, post-rollout) are the natural place to plug rig-conductordecisions. Argo Rollouts is better if ArgoCD is the GitOps tool — it isn’t for us, and the recurring Flux-vs-Rollouts field-drift fights confirm this. See tool-choices.md for full evaluation.

!!! warning “YAGNI caveat” Feature flags at our current scale (1-2 humans, few services, no A/B testing need) are arguably overkill — env vars + Kustomize overlays cover deploy-time toggles for zero operational cost. Adopt flagd when we have a concrete runtime-toggle or targeting need. See tool-choices.md for the honest YAGNI discussion and alternatives (Flipt, GrowthBook, PostHog flags). Note that Unleash reached OSS EOL 2025-12-31 — explicitly reject.

Rollback takes 5 minutes (canary re-promotion of the previous version). A feature flag flip takes 30 seconds. For incident response, flag-kill > rollback.

OpenFeature + flagd pattern:

# feature-flags.yaml (in Flux-managed repo)
apiVersion: core.openfeature.dev/v1beta1
kind: FeatureFlag
metadata:
name: payments-flags
spec:
flagSpec:
flags:
new-payment-path:
state: ENABLED
variants: { on: true, off: false }
defaultVariant: on

To kill: PR changes defaultVariant: off, Flux reconciles in ~30s, all pods see the new flag via the flagd sidecar, the feature is disabled globally. No deploy, no rollback.

The rule: every migration splits into expand (backward-compatible additive) → deploy dual-write code → contract (destructive) → deploy read-new code. Each as a separate deploy. No NOT NULL on first deploy. No column rename as a single step. No destructive DDL in the same release as code that depends on the new shape.

pgroll automates this for Postgres: creates shadow columns, backfills, installs triggers for dual-write, keeps both schema versions queryable via views. A migration YAML declares the intended final shape; pgroll generates and executes the safe intermediate steps.

!!! warning “Single-vendor bus factor hedge” pgroll is Apache-2.0 but Xata-driven (~27 employees, still operating). If Xata folds, there’s no big-company co-maintainer. Hedge (corrected): pgroll migration files are pgroll-specific YAML, not portable SQL. The correct hedge is to commit a parallel SQL trail (pgroll can emit generated SQL) alongside each operation YAML, so schema history stays reconstructible if we ever have to migrate to Flyway or Atlas. See tool-choices.md#db-migration-safety.

!!! danger “Cloudflare Dec 5 2025: the emergency-fast-path lesson” Cloudflare’s December 5, 2025 post-mortem: gradual rollouts for code, but their global config system bypassed gradual rollout by design for speed. A config change detonated globally in seconds — 25-minute global outage.

**Our rule**: every mutable surface (code, config, feature flags, Kyverno policies, AGENTS.md, SLA definitions) flows through the same staged rollout pipeline. No fast path. Enforceable by Kyverno admission policies that deny emergency paths.
sequenceDiagram
    participant AL as Alert
    participant RE as Repair-E
    participant OT as OTel / Grafana
    participant G as GitHub / git
    participant CE as rig-conductor

    AL->>RE: Invoke with service, alert_type, SLO_target
    RE->>OT: Query top-N slow/error traces last 5min
    OT-->>RE: Spans with code.function, code.filepath, service.version
    RE->>G: git log -S for changed function on service.version
    G-->>RE: Commit history touching that function
    RE->>CE: Query recent deploy events for service
    CE-->>RE: Deploy timestamps + commit SHAs
    RE->>RE: Correlate alert time to deploy time to commit
    RE->>RE: Propose fix (revert or forward-fix)
    RE->>CE: DiagnosisComplete with commit, confidence, action
    alt confidence high
        RE->>G: Open PR with fix
    else
        RE->>CE: Escalate to human (ambiguous)
    end
View Mermaid source
sequenceDiagram
    participant AL as Alert
    participant RE as Repair-E
    participant OT as OTel / Grafana
    participant G as GitHub / git
    participant CE as rig-conductor

    AL->>RE: Invoke with service, alert_type, SLO_target
    RE->>OT: Query top-N slow/error traces last 5min
    OT-->>RE: Spans with code.function, code.filepath, service.version
    RE->>G: git log -S for changed function on service.version
    G-->>RE: Commit history touching that function
    RE->>CE: Query recent deploy events for service
    CE-->>RE: Deploy timestamps + commit SHAs
    RE->>RE: Correlate alert time to deploy time to commit
    RE->>RE: Propose fix (revert or forward-fix)
    RE->>CE: DiagnosisComplete with commit, confidence, action
    alt confidence high
        RE->>G: Open PR with fix
    else
        RE->>CE: Escalate to human (ambiguous)
    end

Inputs:

  • Alert metadata (service, SLO, burn rate, timestamp)
  • Top-N traces from OTel (by error or latency)
  • Code location from span attributes (code.function, code.filepath, code.namespace)
  • Recent commits touching that location (git log -S)
  • Recent deploy events from rig-conductor
  • Related OpenTelemetry logs via trace_id correlation
  • Recent error messages from Sentry/Loki

Outputs:

  • Structured diagnosis: {root_cause, affected_commit, confidence}
  • Proposed fix: PR or feature-flag-kill decision
  • Attestation chain (Repair-E identity, trace IDs consulted, reasoning hash)

Confidence thresholds — derived, not self-reported

Section titled “Confidence thresholds — derived, not self-reported”

!!! warning “LLM self-reported confidence is uncalibrated” Earlier drafts quoted numeric thresholds (”> 0.8 auto-fix, 0.5–0.8 propose, < 0.5 human”) as if the LLM could emit a meaningful self-confidence score. It cannot. LLM self-reported confidence is famously uncalibrated: the agent says “95% confident” with the same tone whether it’s right or wrong. Confidence is a derived metric, not a self-report.

Confidence is computed from four measurable signals available at diagnosis time, each scored 0–1:

SignalHow it’s measuredWhy it’s a proxy for correctness
Deploy-to-alert correlation strengthMinutes between the most recent deploy and first error signal (from rig-conductor deploy events + Prometheus burn alert)Shorter gap → deploy is more likely the root cause
Trace-to-commit precisionDoes the offending span’s code.function + code.filepath appear in the recent commit’s diff? (git blame intersection)Direct topology-match = high precision
Test coverage of the affected pathCoverage report for the file/function — pulled from CI artifactsHigh coverage means change is less likely a logic bug in covered territory
Historical same-signature fix successLookup in rig-conductor’s incident-history projection: have we seen this trace-fingerprint before, and did prior fixes survive 24h?Known pattern with known resolution

These four signals combine (configurable weights, default equal) into a single score. Derived, not guessed.

Calibration — the score itself must be measured

Section titled “Calibration — the score itself must be measured”

Thresholds “auto-fix / propose / human” are not fixed numbers — they are tuned by measuring predicted confidence against actual fix-survives-24h outcomes over rolling N incidents. Process:

  1. Start conservative: high auto-fix threshold (e.g., 0.85), most incidents go to human
  2. After each incident, record (predicted score, outcome)
  3. After 20+ incidents with known outcomes, fit the threshold so the auto-fix bucket shows ≥95% fix-survives-24h
  4. Propose/human buckets tune similarly (propose bucket: 70-95% survival; human bucket: <70%)
  5. Re-tune quarterly — never freeze the thresholds, since the model, the codebase, and the failure mode distribution all drift

Until 20+ calibration incidents have landed, everything is human-driven regardless of the predicted score. The auto-fix bucket literally does not exist yet. This is a measurement-gated capability, not a day-one feature.

Current thresholds (provisional until calibration)

Section titled “Current thresholds (provisional until calibration)”
  • All three buckets route to human until 20+ incidents have calibrated the scoring
  • During calibration, Repair-E still proposes (and logs the predicted score), but never auto-fixes — the human either applies, modifies, or rejects
  • After calibration, thresholds become real — initially conservative (e.g., auto-fix only above 0.85 if the 95% survival criterion holds)

This is honestly measurement-gated progress, not aspirational numbers treated as real.

Before a proposed fix is promoted past canary, it must reproduce the failure in a sandbox:

  • Ephemeral namespacek create namespace repair-{incident_id}
  • Service deploy — the buggy version
  • Traffic replay — recorded requests from the failing trace window, replayed via Envoy tap or service-specific replay tooling
  • Assert failure — verify the bug manifests
  • Apply fix — deploy Repair-E’s proposed patch
  • Re-run — verify the fix resolves

Only fixes that reproduce-then-resolve in the harness are dispatched to the real canary. The reproduction harness is the single most important artifact separating “AI-generated looks-like-a-fix” from “verified-to-work fix.”

As of early 2026, no production system publicly demonstrates full auto-diagnose + auto-reproduce + auto-fix + auto-canary for logic bugs. Components exist:

  • Datadog Bits AI SRE, Rootly AI, Resolve.ai, incident.io — AI-assisted diagnosis, human-approved fix
  • Cursor Cloud Agents, Cognition Devin — AI-authored fix + PR, human review
  • Harness Self-Healing — partial pipeline automation

The trusted rig’s claim: we wire these components into a closed loop. The novelty is the integration, not the individual pieces.

Stage 3: Auto-fix + canary + progressive rollout

Section titled “Stage 3: Auto-fix + canary + progressive rollout”

Repair-E’s fix follows the same canary pipeline as any other change:

  1. Attestation — Repair-E commits with gitsign, the image builds with SLSA provenance, cosign-signed
  2. Kyverno admission — verifies the attestation chain, admits to the canary namespace
  3. Flagger canary — 5% → analysis → 15% → … → 100%
  4. Post-promotion monitoring — 3× the canary interval after promotion, alert still armed
  5. Observation — after 24h, rig-conductorqueries the post-incident health and updates Repair-E’s track record

The fix succeeds only if it survives 24h in production. “Deployed” is not “done.”

Even for urgent fixes, T3 changes never bypass the two-attestor policy. A destructive DB migration to fix a production bug requires human co-sign. The principle: production urgency is not a reason to weaken safety guarantees. Kill-switch first (no destructive migration needed), then careful human-driven repair.

  • Bugs in logic that manifest only at scale or under specific data conditions the sandbox doesn’t reproduce
  • Bugs in shared infrastructure ( rig-conductoritself, Flux, cluster networking) — meta-bugs requiring human intervention
  • Bugs whose fix requires new business-logic decisions — falls to human semantic judgment
  • Novel failure modes with no prior-incident pattern to match — Repair-E’s confidence drops below threshold, human-driven

After every auto-resolved incident:

  • Structured incident record: SLI that fired, trace IDs, diff, decision log, time-to-resolve
  • Open a GitHub Issue with a templated post-mortem (Rootly/incident.io pattern)
  • Tag the fix PR with the incident ID; cross-link
  • When a similar signature fires, Repair-E retrieves prior fixes first

Post-incident learning at small scale is a 200-line rig-conductorhandler plus a Langfuse eval template that scores future Repair-E proposals against the historical resolution log.

Stage 4 is where the rig starts actively improving itself. It is the goal, not a near-term deliverable.

Self-healing expands the rig’s autonomy. That expansion must be bounded by tier policy:

ActionBlast radiusWho decides
Flip kill switchContained (one feature flag)Repair-E auto, with attested reason
Roll back to previous versionContained (one service)Repair-E auto
Forward-fix PR (code-only)T1Repair-E auto, through canary
Forward-fix PR (config)T1-T2Repair-E with Review-E gate; T2 if config spans services
Forward-fix PR (schema change)T2Repair-E proposes, human approves interface
Forward-fix PR (auth/payments/destructive)T3Human drives, Repair-E assists

The tier classification at intake (Spec-E) applies at fix-time (Repair-E). The policy is unified.

The weekly self-healing dashboard:

  • Mean-time-to-detect (MTTD) — from production incident to alert firing
  • Mean-time-to-escalate (MTTE) — from alert to Discord notification
  • Mean-time-to-diagnose (MTTDiag) — from dispatch to Repair-E diagnosis committed
  • Mean-time-to-fix (MTTF) — from diagnosis to canary-promoted fix
  • Mean-time-to-resolve (MTTR) — total MTTD to budget-restoration
  • False-positive rollback rate — canary aborts where no actual bug
  • Fix-survives-24h rate — of auto-fixed incidents, % that don’t revert within 24h
  • Auto-resolve rate — % of incidents resolved without human intervention
  • Human-override rate — % of Repair-E proposals humans rejected or modified

For the trusted rig (end state, not today):

  • MTTD: < 1 minute (synthetic probe or burn-rate alert)
  • MTTE: < 30 seconds (flagd kill switch flipped via git-commit-to-reconcile)
  • MTTDiag: < 5 minutes for T1 bugs with clear trace-to-commit correlation
  • MTTR: < 15 minutes for T1 bugs; < 1 hour for T2 requiring human approval
  • Fix-survives-24h: > 80%
  • Auto-resolve rate: > 60% of T1 incidents

These are aggressive but consistent with published pilot data from Stripe, Cursor, and Datadog Bits AI SRE.

  • Zero downtime is aspirational, not absolute. Under catastrophic failure (full cluster outage, Postgres corruption), human intervention is mandatory.
  • T3 incidents do not self-heal. Auth bugs, payment bugs, and destructive data issues require human decision.
  • Novel bugs are slower. Without prior-incident patterns, Repair-E’s confidence is low, and humans drive.
  • Reproduction harness coverage is finite. If the bug only manifests under specific load or data, the sandbox may not reproduce it, and auto-fix cannot proceed.
  • The feedback loop takes time. An auto-fix that “works” in canary but fails 2 days later is caught by the 24h survival metric, but during that 2 days it’s not visible as a failure.
  • No emergency fast path. Even when SLO is burning, every change flows through the same gated pipeline. Cloudflare Dec 2025 is the lesson.
  • No skipping canary for “obvious” fixes. The obvious-fix-that-breaks-everything is a documented failure class.
  • No LLM-judged automatic promotion. Promotion is SLO-gated by Prometheus analysis, not LLM-reviewed. Deterministic gate.
  • No auto-fix on T3. Never. Humans drive.
  • No silent rollback. Every rollback emits events, updates dashboards, opens a post-mortem issue.
  • No persistent staging environment that diverges from prod. Reproduction harness is ephemeral, created from recent prod state, destroyed after incident. Long-lived staging drifts.

Tied to the roadmap in index.md:

Phase 5 (self-healing) exit criteria:

  • Flagger canary operates on every production service
  • flagd feature flag sidecars injected via OpenFeature Operator
  • pgroll gates every DB migration; non-pgroll migrations rejected by CI
  • Error-budget projection live in rig-conductorwith per-service breakdown
  • SLO burn-rate alerts route through rig-conductorto Discord with severity routing
  • Repair-E dispatches on P1 alerts, logs diagnosis with attestation
  • Reproduction harness ephemeral-namespace pattern works for at least one service end-to-end
  • Kill-switch latency measured < 60s from commit to pod-observed-change
  • 24h fix-survival rate measured on dashboard
  • Documented runbook for when self-healing fails (on-call procedure)

Only when every checkbox is checked does the phase close.