Self-Healing Production — Canary, SLO Gating, Kill Switches, Repair-E

Capabilities

⚪ flagger-canary — Flagger canary deploys. Phase 5. Flux-native progressive delivery.
⚪ pgroll-migrations — pgroll expand/contract migrations. Phase 5. With inspectable SQL trail hedge.

!!! abstract “TL;DR” Production bugs get detected, diagnosed, fixed, canaried, and promoted — with humans only at semantic boundaries — within minutes of first SLO impact. Five stages (know → rollback → diagnose → fix → learn). The trusted rig targets stages 0–3 for our services; stage 4 is aspirational. Most very well-engineered teams (Stripe, GitHub, Cloudflare) do not fully achieve stages 2–3 for logic bugs.

!!! note “Terminology: Repair-E = Dev-E in repair-dispatch mode” This document uses the name “Repair-E” as shorthand for Dev-E dispatched by an SLO-burn alert with a repair-specific system prompt. It is not a separate agent class — same pod class, same model, different trigger + prompt. Earlier drafts framed it as a fifth agent role; honest re-evaluation (see glossary.md) found the event-shaped-boundary test isn’t cleanly met. The name is kept as a convenient label for a dispatch mode, not a separate agent.

The ladder

The realistic self-healing ladder, restated from the conversation:

Stage	Capability	Target
0	Know prod is broken	OTel + Prometheus + SLOs + error-budget math
1	Auto-rollback on SLO breach	Flagger + flagd, signed images, trustworthy rollback target
2	Auto-diagnose	Repair-E reads trace + deploy + git blame, proposes fix with confidence score
3	Auto-fix + canary + progressive rollout	Reproduction harness, DB migration safety, feedback loop
4	Learn from incidents	Post-incident projection, prior updates, preemptive detection

Stages 0-1 are engineering that can ship. Stages 2-3 are frontier work where Cursor, Devin, Anthropic internal all have pieces but none publicly demonstrate full coverage. Stage 4 is research.

The trusted rig targets stages 0-3 for our services. Stage 4 is aspirational.

The canonical pipeline

sequenceDiagram
    participant P as Prometheus
    participant A as Alertmanager
    participant CE as rig-conductor
    participant R as Router
    participant RE as Repair-E
    participant FD as flagd
    participant F as Flagger
    participant KV as Kyverno
    participant D as Discord

    P->>A: SLO burn-rate exceeds threshold
    A->>CE: EscalationRequired severity P1
    CE->>R: Route by severity
    R->>D: Post to admin channel
    R->>FD: Flip kill switch for affected feature (~30s)
    R->>RE: Dispatch with trace context
    RE->>RE: Pull top-N slow/error traces
    RE->>RE: Extract code.function + code.filepath
    RE->>RE: git log with -S for changed function, last 24h
    RE->>RE: Cross-reference recent deploys
    alt clear diagnosis
        RE->>CE: Propose fix PR (attestation chain)
        CE->>F: Submit Canary
        F->>P: Run AnalysisTemplate (success rate, p99 latency)
        alt canary passes
            F->>KV: Promote (attested)
            KV->>KV: Verify signatures
            KV-->>F: Admitted
            F->>F: Progressive rollout 5% 25% 50% 100%
            F->>CE: Promoted
            CE->>FD: Clear kill switch
        else canary fails
            F->>CE: Aborted
            CE->>R: Escalate to P0
        end
    else ambiguous
        RE->>CE: Low confidence — escalate to human
        R->>D: P0 DM with mention
    end

View Mermaid source

sequenceDiagram
    participant P as Prometheus
    participant A as Alertmanager
    participant CE as rig-conductor
    participant R as Router
    participant RE as Repair-E
    participant FD as flagd
    participant F as Flagger
    participant KV as Kyverno
    participant D as Discord

    P->>A: SLO burn-rate exceeds threshold
    A->>CE: EscalationRequired severity P1
    CE->>R: Route by severity
    R->>D: Post to admin channel
    R->>FD: Flip kill switch for affected feature (~30s)
    R->>RE: Dispatch with trace context
    RE->>RE: Pull top-N slow/error traces
    RE->>RE: Extract code.function + code.filepath
    RE->>RE: git log with -S for changed function, last 24h
    RE->>RE: Cross-reference recent deploys
    alt clear diagnosis
        RE->>CE: Propose fix PR (attestation chain)
        CE->>F: Submit Canary
        F->>P: Run AnalysisTemplate (success rate, p99 latency)
        alt canary passes
            F->>KV: Promote (attested)
            KV->>KV: Verify signatures
            KV-->>F: Admitted
            F->>F: Progressive rollout 5% 25% 50% 100%
            F->>CE: Promoted
            CE->>FD: Clear kill switch
        else canary fails
            F->>CE: Aborted
            CE->>R: Escalate to P0
        end
    else ambiguous
        RE->>CE: Low confidence — escalate to human
        R->>D: P0 DM with mention
    end

Every arrow is an event. Every decision is attested. Every metric is in the dashboards.

Stage 0: Know production is broken

Signals

Burn rate — current error rate projected forward; honeycomb-style 4h-forward-look triggers P1
Latency p99 regression — 2× week-over-week baseline for 5 minutes
Error rate spike — 3σ above rolling hourly baseline
Synthetic probe failure — constant-QPS synthetic traffic catches what user traffic misses at low QPS
Dependency failure — upstream service unreachable or 5xx spike
Deployment correlation — within 15 min of a deploy, any of the above is elevated severity

Why synthetic probes matter at small scale

At < 10 QPS, organic traffic is statistical noise. A single 500 burns 10% of an hourly budget. Constant-rate synthetic probes (every 15s, say) provide a signal baseline that doesn’t depend on user traffic. Prometheus Blackbox Exporter + scheduled probes hitting the service’s health endpoints + key user journeys.

Error budget projection

Per service, compute:

budget_remaining = (1 - SLO_target) * total_window_events - failed_events
burn_rate = failed_events_current_rate / failed_events_budgeted_rate

Honeycomb’s pattern: alert when burn_rate * 4h > budget_remaining (at current rate, we’d exhaust in 4h). rig-conductor projects this per service and exposes it as GET /api/services/{name}/budget.

Stage 1: Auto-rollback on SLO breach

Flagger as the default deploy path

Every service in the rig gets a Flagger Canary resource. No service deploys via raw Deployment apply.

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: payments-service
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payments-service
  service:
    port: 8080
  analysis:
    interval: 1m
    threshold: 5           # consecutive failures → abort
    maxWeight: 50
    stepWeight: 10
    metrics:
    - name: success-rate
      thresholdRange: { min: 99 }
      interval: 1m
    - name: latency-p99
      thresholdRange: { max: 500 }
      interval: 1m
    webhooks:
    - name: rig-conductor-notify
      url: http://rig-conductor.rig-conductor.svc:8080/api/events
      timeout: 5s
      metadata:
        type: CanaryPhase

Rollout: 5% canary for 1 minute → analysis passes → 15% → … → 50% → promotion. Any failed analysis aborts; maxWeight: 50 means we never canary past half traffic before full promotion.

Why Flagger over Argo Rollouts

Flux-native. Wraps existing Deployment resources rather than requiring a swap to a new Rollout CRD. Webhook hooks at every phase (pre-rollout, confirm-promotion, post-rollout) are the natural place to plug rig-conductordecisions. Argo Rollouts is better if ArgoCD is the GitOps tool — it isn’t for us, and the recurring Flux-vs-Rollouts field-drift fights confirm this. See tool-choices.md for full evaluation.

flagd as the faster kill switch

!!! warning “YAGNI caveat” Feature flags at our current scale (1-2 humans, few services, no A/B testing need) are arguably overkill — env vars + Kustomize overlays cover deploy-time toggles for zero operational cost. Adopt flagd when we have a concrete runtime-toggle or targeting need. See tool-choices.md for the honest YAGNI discussion and alternatives (Flipt, GrowthBook, PostHog flags). Note that Unleash reached OSS EOL 2025-12-31 — explicitly reject.

Rollback takes 5 minutes (canary re-promotion of the previous version). A feature flag flip takes 30 seconds. For incident response, flag-kill > rollback.

OpenFeature + flagd pattern:

# feature-flags.yaml (in Flux-managed repo)
apiVersion: core.openfeature.dev/v1beta1
kind: FeatureFlag
metadata:
  name: payments-flags
spec:
  flagSpec:
    flags:
      new-payment-path:
        state: ENABLED
        variants: { on: true, off: false }
        defaultVariant: on

To kill: PR changes defaultVariant: off, Flux reconciles in ~30s, all pods see the new flag via the flagd sidecar, the feature is disabled globally. No deploy, no rollback.

DB migration safety: pgroll (with hedge)

The rule: every migration splits into expand (backward-compatible additive) → deploy dual-write code → contract (destructive) → deploy read-new code. Each as a separate deploy. No NOT NULL on first deploy. No column rename as a single step. No destructive DDL in the same release as code that depends on the new shape.

pgroll automates this for Postgres: creates shadow columns, backfills, installs triggers for dual-write, keeps both schema versions queryable via views. A migration YAML declares the intended final shape; pgroll generates and executes the safe intermediate steps.

!!! warning “Single-vendor bus factor hedge” pgroll is Apache-2.0 but Xata-driven (~27 employees, still operating). If Xata folds, there’s no big-company co-maintainer. Hedge (corrected): pgroll migration files are pgroll-specific YAML, not portable SQL. The correct hedge is to commit a parallel SQL trail (pgroll can emit generated SQL) alongside each operation YAML, so schema history stays reconstructible if we ever have to migrate to Flyway or Atlas. See tool-choices.md#db-migration-safety.

!!! danger “Cloudflare Dec 5 2025: the emergency-fast-path lesson” Cloudflare’s December 5, 2025 post-mortem: gradual rollouts for code, but their global config system bypassed gradual rollout by design for speed. A config change detonated globally in seconds — 25-minute global outage.

**Our rule**: every mutable surface (code, config, feature flags, Kyverno policies, AGENTS.md, SLA definitions) flows through the same staged rollout pipeline. No fast path. Enforceable by Kyverno admission policies that deny emergency paths.

Stage 2: Auto-diagnose (Repair-E)

The pipeline

sequenceDiagram
    participant AL as Alert
    participant RE as Repair-E
    participant OT as OTel / Grafana
    participant G as GitHub / git
    participant CE as rig-conductor

    AL->>RE: Invoke with service, alert_type, SLO_target
    RE->>OT: Query top-N slow/error traces last 5min
    OT-->>RE: Spans with code.function, code.filepath, service.version
    RE->>G: git log -S for changed function on service.version
    G-->>RE: Commit history touching that function
    RE->>CE: Query recent deploy events for service
    CE-->>RE: Deploy timestamps + commit SHAs
    RE->>RE: Correlate alert time to deploy time to commit
    RE->>RE: Propose fix (revert or forward-fix)
    RE->>CE: DiagnosisComplete with commit, confidence, action
    alt confidence high
        RE->>G: Open PR with fix
    else
        RE->>CE: Escalate to human (ambiguous)
    end

View Mermaid source

sequenceDiagram
    participant AL as Alert
    participant RE as Repair-E
    participant OT as OTel / Grafana
    participant G as GitHub / git
    participant CE as rig-conductor

    AL->>RE: Invoke with service, alert_type, SLO_target
    RE->>OT: Query top-N slow/error traces last 5min
    OT-->>RE: Spans with code.function, code.filepath, service.version
    RE->>G: git log -S for changed function on service.version
    G-->>RE: Commit history touching that function
    RE->>CE: Query recent deploy events for service
    CE-->>RE: Deploy timestamps + commit SHAs
    RE->>RE: Correlate alert time to deploy time to commit
    RE->>RE: Propose fix (revert or forward-fix)
    RE->>CE: DiagnosisComplete with commit, confidence, action
    alt confidence high
        RE->>G: Open PR with fix
    else
        RE->>CE: Escalate to human (ambiguous)
    end

What Repair-E actually sees

Inputs:

Alert metadata (service, SLO, burn rate, timestamp)
Top-N traces from OTel (by error or latency)
Code location from span attributes (code.function, code.filepath, code.namespace)
Recent commits touching that location (git log -S)
Recent deploy events from rig-conductor
Related OpenTelemetry logs via trace_id correlation
Recent error messages from Sentry/Loki

Outputs:

Structured diagnosis: {root_cause, affected_commit, confidence}
Proposed fix: PR or feature-flag-kill decision
Attestation chain (Repair-E identity, trace IDs consulted, reasoning hash)

Confidence thresholds — derived, not self-reported

!!! warning “LLM self-reported confidence is uncalibrated” Earlier drafts quoted numeric thresholds (”> 0.8 auto-fix, 0.5–0.8 propose, < 0.5 human”) as if the LLM could emit a meaningful self-confidence score. It cannot. LLM self-reported confidence is famously uncalibrated: the agent says “95% confident” with the same tone whether it’s right or wrong. Confidence is a derived metric, not a self-report.

Confidence is computed from four measurable signals available at diagnosis time, each scored 0–1:

Signal	How it’s measured	Why it’s a proxy for correctness
Deploy-to-alert correlation strength	Minutes between the most recent deploy and first error signal (from rig-conductor deploy events + Prometheus burn alert)	Shorter gap → deploy is more likely the root cause
Trace-to-commit precision	Does the offending span’s `code.function` + `code.filepath` appear in the recent commit’s diff? (git blame intersection)	Direct topology-match = high precision
Test coverage of the affected path	Coverage report for the file/function — pulled from CI artifacts	High coverage means change is less likely a logic bug in covered territory
Historical same-signature fix success	Lookup in rig-conductor’s incident-history projection: have we seen this trace-fingerprint before, and did prior fixes survive 24h?	Known pattern with known resolution

These four signals combine (configurable weights, default equal) into a single score. Derived, not guessed.

Calibration — the score itself must be measured

Thresholds “auto-fix / propose / human” are not fixed numbers — they are tuned by measuring predicted confidence against actual fix-survives-24h outcomes over rolling N incidents. Process:

Start conservative: high auto-fix threshold (e.g., 0.85), most incidents go to human
After each incident, record (predicted score, outcome)
After 20+ incidents with known outcomes, fit the threshold so the auto-fix bucket shows ≥95% fix-survives-24h
Propose/human buckets tune similarly (propose bucket: 70-95% survival; human bucket: <70%)
Re-tune quarterly — never freeze the thresholds, since the model, the codebase, and the failure mode distribution all drift

Until 20+ calibration incidents have landed, everything is human-driven regardless of the predicted score. The auto-fix bucket literally does not exist yet. This is a measurement-gated capability, not a day-one feature.

Current thresholds (provisional until calibration)

All three buckets route to human until 20+ incidents have calibrated the scoring
During calibration, Repair-E still proposes (and logs the predicted score), but never auto-fixes — the human either applies, modifies, or rejects
After calibration, thresholds become real — initially conservative (e.g., auto-fix only above 0.85 if the 95% survival criterion holds)

This is honestly measurement-gated progress, not aspirational numbers treated as real.

Reproduction harness

Before a proposed fix is promoted past canary, it must reproduce the failure in a sandbox:

Ephemeral namespace — k create namespace repair-{incident_id}
Service deploy — the buggy version
Traffic replay — recorded requests from the failing trace window, replayed via Envoy tap or service-specific replay tooling
Assert failure — verify the bug manifests
Apply fix — deploy Repair-E’s proposed patch
Re-run — verify the fix resolves

Only fixes that reproduce-then-resolve in the harness are dispatched to the real canary. The reproduction harness is the single most important artifact separating “AI-generated looks-like-a-fix” from “verified-to-work fix.”

The state of the art — honesty

As of early 2026, no production system publicly demonstrates full auto-diagnose + auto-reproduce + auto-fix + auto-canary for logic bugs. Components exist:

Datadog Bits AI SRE, Rootly AI, Resolve.ai, incident.io — AI-assisted diagnosis, human-approved fix
Cursor Cloud Agents, Cognition Devin — AI-authored fix + PR, human review
Harness Self-Healing — partial pipeline automation

The trusted rig’s claim: we wire these components into a closed loop. The novelty is the integration, not the individual pieces.

Stage 3: Auto-fix + canary + progressive rollout

The feedback loop

Repair-E’s fix follows the same canary pipeline as any other change:

Attestation — Repair-E commits with gitsign, the image builds with SLSA provenance, cosign-signed
Kyverno admission — verifies the attestation chain, admits to the canary namespace
Flagger canary — 5% → analysis → 15% → … → 100%
Post-promotion monitoring — 3× the canary interval after promotion, alert still armed
Observation — after 24h, rig-conductorqueries the post-incident health and updates Repair-E’s track record

The fix succeeds only if it survives 24h in production. “Deployed” is not “done.”

T3 bypass: never

Even for urgent fixes, T3 changes never bypass the two-attestor policy. A destructive DB migration to fix a production bug requires human co-sign. The principle: production urgency is not a reason to weaken safety guarantees. Kill-switch first (no destructive migration needed), then careful human-driven repair.

What this does not fix

Bugs in logic that manifest only at scale or under specific data conditions the sandbox doesn’t reproduce
Bugs in shared infrastructure ( rig-conductoritself, Flux, cluster networking) — meta-bugs requiring human intervention
Bugs whose fix requires new business-logic decisions — falls to human semantic judgment
Novel failure modes with no prior-incident pattern to match — Repair-E’s confidence drops below threshold, human-driven

Stage 4: Learn (aspirational)

After every auto-resolved incident:

Structured incident record: SLI that fired, trace IDs, diff, decision log, time-to-resolve
Open a GitHub Issue with a templated post-mortem (Rootly/incident.io pattern)
Tag the fix PR with the incident ID; cross-link
When a similar signature fires, Repair-E retrieves prior fixes first

Post-incident learning at small scale is a 200-line rig-conductorhandler plus a Langfuse eval template that scores future Repair-E proposals against the historical resolution log.

Stage 4 is where the rig starts actively improving itself. It is the goal, not a near-term deliverable.

Blast radius of self-healing

Self-healing expands the rig’s autonomy. That expansion must be bounded by tier policy:

Action	Blast radius	Who decides
Flip kill switch	Contained (one feature flag)	Repair-E auto, with attested reason
Roll back to previous version	Contained (one service)	Repair-E auto
Forward-fix PR (code-only)	T1	Repair-E auto, through canary
Forward-fix PR (config)	T1-T2	Repair-E with Review-E gate; T2 if config spans services
Forward-fix PR (schema change)	T2	Repair-E proposes, human approves interface
Forward-fix PR (auth/payments/destructive)	T3	Human drives, Repair-E assists

The tier classification at intake (Spec-E) applies at fix-time (Repair-E). The policy is unified.

Metrics that mark success

The weekly self-healing dashboard:

Mean-time-to-detect (MTTD) — from production incident to alert firing
Mean-time-to-escalate (MTTE) — from alert to Discord notification
Mean-time-to-diagnose (MTTDiag) — from dispatch to Repair-E diagnosis committed
Mean-time-to-fix (MTTF) — from diagnosis to canary-promoted fix
Mean-time-to-resolve (MTTR) — total MTTD to budget-restoration
False-positive rollback rate — canary aborts where no actual bug
Fix-survives-24h rate — of auto-fixed incidents, % that don’t revert within 24h
Auto-resolve rate — % of incidents resolved without human intervention
Human-override rate — % of Repair-E proposals humans rejected or modified

Target values

For the trusted rig (end state, not today):

MTTD: < 1 minute (synthetic probe or burn-rate alert)
MTTE: < 30 seconds (flagd kill switch flipped via git-commit-to-reconcile)
MTTDiag: < 5 minutes for T1 bugs with clear trace-to-commit correlation
MTTR: < 15 minutes for T1 bugs; < 1 hour for T2 requiring human approval
Fix-survives-24h: > 80%
Auto-resolve rate: > 60% of T1 incidents

These are aggressive but consistent with published pilot data from Stripe, Cursor, and Datadog Bits AI SRE.

The honest limits

Zero downtime is aspirational, not absolute. Under catastrophic failure (full cluster outage, Postgres corruption), human intervention is mandatory.
T3 incidents do not self-heal. Auth bugs, payment bugs, and destructive data issues require human decision.
Novel bugs are slower. Without prior-incident patterns, Repair-E’s confidence is low, and humans drive.
Reproduction harness coverage is finite. If the bug only manifests under specific load or data, the sandbox may not reproduce it, and auto-fix cannot proceed.
The feedback loop takes time. An auto-fix that “works” in canary but fails 2 days later is caught by the 24h survival metric, but during that 2 days it’s not visible as a failure.

What NOT to do

No emergency fast path. Even when SLO is burning, every change flows through the same gated pipeline. Cloudflare Dec 2025 is the lesson.
No skipping canary for “obvious” fixes. The obvious-fix-that-breaks-everything is a documented failure class.
No LLM-judged automatic promotion. Promotion is SLO-gated by Prometheus analysis, not LLM-reviewed. Deterministic gate.
No auto-fix on T3. Never. Humans drive.
No silent rollback. Every rollback emits events, updates dashboards, opens a post-mortem issue.
No persistent staging environment that diverges from prod. Reproduction harness is ephemeral, created from recent prod state, destroyed after incident. Long-lived staging drifts.

Phase-by-phase exit criteria

Tied to the roadmap in index.md:

Phase 5 (self-healing) exit criteria:

Only when every checkbox is checked does the phase close.

Self-Healing Production — Canary, SLO Gating, Kill Switches, Repair-E

Self-Healing Production — Canary, SLO Gating, Kill Switches, Repair-E

Related

Capabilities

The ladder

The canonical pipeline

Stage 0: Know production is broken

Signals

Why synthetic probes matter at small scale

Error budget projection

Stage 1: Auto-rollback on SLO breach

Flagger as the default deploy path

Why Flagger over Argo Rollouts

flagd as the faster kill switch

DB migration safety: pgroll (with hedge)

Stage 2: Auto-diagnose (Repair-E)

The pipeline

What Repair-E actually sees

Confidence thresholds — derived, not self-reported

Calibration — the score itself must be measured

Current thresholds (provisional until calibration)

Reproduction harness

The state of the art — honesty

Stage 3: Auto-fix + canary + progressive rollout

The feedback loop

T3 bypass: never

What this does not fix

Stage 4: Learn (aspirational)

Blast radius of self-healing

Metrics that mark success

Target values

The honest limits

What NOT to do

Phase-by-phase exit criteria

See also