Skip to content

Self-auditing rig — stuck-issue detection, memory, and recognition loop

Manual triage ran the rig all day 2026-04-24. The orchestrator noticed “I don’t see Review-E reviews” → investigated → filed rc#216/#217 → fix landed → verified. Every step was human-initiated. The rig should do most of this itself.

Today the rig has orphan detection (rc#127 reconciliation loop, 5-min TTL) but it only drives stuck issues to terminal states. It does not:

  • Investigate WHY they got stuck (what event sequence preceded the stuck state)
  • File issues when a new stuck pattern appears
  • Update agent memory or BRAIN.md so the next occurrence is handled faster
  • Summarise stuck-state metrics over time

Result: the same gotchas get re-discovered in every operator session.

flowchart LR
    A[DETECT] --> B[INVESTIGATE]
    B --> C[FILE]
    C --> D[FIX]
    D --> E[VERIFY]
    E --> F[MEMORISE]
    F --> G[RECOGNISE]
    G -.->|known fingerprint| A
View Mermaid source
flowchart LR
    A[DETECT] --> B[INVESTIGATE]
    B --> C[FILE]
    C --> D[FIX]
    D --> E[VERIFY]
    E --> F[MEMORISE]
    F --> G[RECOGNISE]
    G -.->|known fingerprint| A
StepActorMechanism
DETECTstuck-watcher cron (*/10 * * * *)Queries /api/issues; checks age thresholds
INVESTIGATEstuck-watcherReads last 10 events + log snippet; computes fingerprint
FILEstuck-watcherOpens stuck-pattern GH issue if fingerprint is novel; comments if known
FIXDev-E or operatorNormal issue assignment via rig-conductor
VERIFYrig-conductorPR merged + issue closed
MEMORISEPR merge webhookCalls write_memory(kind=gotcha, fingerprint=...)
RECOGNISEAgent session startread_memories surfaces known pattern before work begins

A */10 * * * * cron persona (Pi-E reuse or new watchdog-e) that:

CheckThreshold
in_progress with no cli_progress event> 30 min
in_review with no review_assigned event> 20 min
merge_conflict terminal stateany age
ready_to_merge state> 15 min
envelope_timed_out count> 0

For each flagged issue:

  1. Compute fingerprint (see Fingerprinting below)
  2. If novel → open new stuck-pattern GH issue with investigation template body
  3. If known → comment on existing issue with fresh occurrence timestamp + count

When stuck-watcher opens an issue, the body includes:

  • Last 10 events from the event stream
  • Last log snippet if available
  • Fingerprint (SHA1 of event type sequence + repo slug)
  • Links to similar past issues sharing the same stuck-fp: label
  • Suggested questions for the assigned Dev-E or operator

When a PR closes a stuck-pattern issue, the PR merge workflow calls rig-memory-mcp/write_memory:

scope: rig
kind: gotcha
title: <stuck-pattern title>
content: |
Pattern: <what happened>
Root cause: <why it happened>
Fix: <how to resolve>
fingerprint: <sha1 of event sequence>
tags: [conductor, lifecycle, stuck-pattern]

A generated ## Known stuck patterns block in BRAIN.md sourced from memory tag stuck-pattern:

ColumnSource
Patternmemory title
Fingerprintstuck-fp: label prefix
FixPR that resolved the originating issue

Refreshed on write_memory call or nightly cron.

Add a “Stuck watch” card to the rig overview: count of active stuck-pattern issues, link to the label filter on rig-conductor.

Why sha1(last_3_event_types + "|" + repo_slug)?

Stuck patterns in the rig are almost always caused by a specific event sequence — not random noise. The last 3 event types before the stuck state form a minimal but discriminating signature:

  • Long enough to distinguish cli_started → cli_completed → review_assigned (normal) from cli_started → agent_stuck → envelope_timed_out (stuck pattern A)
  • Short enough to remain stable across minor timing differences that share the same root cause
  • Repo-scoped to prevent false-positive matches across codebases with different tooling

The fingerprint is stored as a GitHub label stuck-fp:<sha1_8char_prefix> on the filed issue, enabling label-search deduplication without a separate database:

sha1("cli_started|agent_stuck|envelope_timed_out|rig-gitops")[:8]
→ stuck-fp:a3f2b1c9

Collision handling: SHA1 8-char prefix collision probability at the rig’s current issue volume (~200 issues/month) is negligible (< 0.01%). If two semantically different patterns collide, the watcher detects that the existing issue body fingerprint does not match and falls back to opening a new issue with an incremented suffix.

RefTitleRepo
rc#232Stuck-watcher cron (detection + fingerprint + GH issue filing)rig-conductor
rc#233Memory writeback hook (PR-closes-stuck-issue → write_memory)rig-conductor
rd#206BRAIN.md ## Known stuck patterns auto-sectionrig-docs
rc#234Dashboard Stuck-watch cardrig-conductor
rg#187Discord ping on new stuck-pattern (edge-triggered)rig-gitops
  • Manually introduce a stuck condition (e.g. re-create the merge_conflict terminal bug)
  • Within 10 min, a stuck-pattern issue appears on rig-conductor with fingerprint + last events
  • Fix lands, issue closes, write_memory is called with kind: gotcha
  • BRAIN.md regenerates with the new pattern under ## Known stuck patterns
  • Recreating the same condition later → stuck-watcher comments on existing issue instead of filing a new one
QuestionOptionsStatus
Who runs the watcher?Reuse Pi-E (extend existing cron persona) vs. new dedicated watchdog-eOpen
Fingerprint collisions8-char SHA1 prefix sufficient? Fallback to body-hash comparison on collision?Tentatively resolved (see above)
Human override of false positivesLabel stuck-fp-false-positive to suppress recurrence filing? Or close-and-lock the issue?Open
  • rc#127 — reconciliation loop (already drives orphans to terminal states)
  • rc#226main_ci_failed events + dashboard
  • rc#227 — auto-rebase on merge_conflict
  • rc#228 — don’t auto-cancel parent issue on stale PR close
  • rig-memory-mcp — already exists, not yet wired for lifecycle events
  • rig-docs BRAIN.md — already exists, needs new auto-section

When all sub-issues are merged and the epic closes:

  1. Flip status: draftstatus: accepted in this file
  2. Copy as-is to src/content/docs/decisions/2026-04-24-self-auditing-rig.md
  3. Set superseded_by: decisions/2026-04-24-self-auditing-rig on this proposal
  4. Set supersedes: proposals/2026-04-24-self-auditing-rig on the new decision doc