Self-auditing rig — stuck-issue detection, memory, and recognition loop
Manual triage ran the rig all day 2026-04-24. The orchestrator noticed “I don’t see Review-E reviews” → investigated → filed rc#216/#217 → fix landed → verified. Every step was human-initiated. The rig should do most of this itself.
Today the rig has orphan detection (rc#127 reconciliation loop, 5-min TTL) but it only drives stuck issues to terminal states. It does not:
- Investigate WHY they got stuck (what event sequence preceded the stuck state)
- File issues when a new stuck pattern appears
- Update agent memory or BRAIN.md so the next occurrence is handled faster
- Summarise stuck-state metrics over time
Result: the same gotchas get re-discovered in every operator session.
flowchart LR
A[DETECT] --> B[INVESTIGATE]
B --> C[FILE]
C --> D[FIX]
D --> E[VERIFY]
E --> F[MEMORISE]
F --> G[RECOGNISE]
G -.->|known fingerprint| AView Mermaid source
flowchart LR
A[DETECT] --> B[INVESTIGATE]
B --> C[FILE]
C --> D[FIX]
D --> E[VERIFY]
E --> F[MEMORISE]
F --> G[RECOGNISE]
G -.->|known fingerprint| A| Step | Actor | Mechanism |
|---|---|---|
| DETECT | stuck-watcher cron (*/10 * * * *) | Queries /api/issues; checks age thresholds |
| INVESTIGATE | stuck-watcher | Reads last 10 events + log snippet; computes fingerprint |
| FILE | stuck-watcher | Opens stuck-pattern GH issue if fingerprint is novel; comments if known |
| FIX | Dev-E or operator | Normal issue assignment via rig-conductor |
| VERIFY | rig-conductor | PR merged + issue closed |
| MEMORISE | PR merge webhook | Calls write_memory(kind=gotcha, fingerprint=...) |
| RECOGNISE | Agent session start | read_memories surfaces known pattern before work begins |
Scope (minimum viable)
Section titled “Scope (minimum viable)”1. Stuck-watcher (detection + file)
Section titled “1. Stuck-watcher (detection + file)”A */10 * * * * cron persona (Pi-E reuse or new watchdog-e) that:
| Check | Threshold |
|---|---|
in_progress with no cli_progress event | > 30 min |
in_review with no review_assigned event | > 20 min |
merge_conflict terminal state | any age |
ready_to_merge state | > 15 min |
envelope_timed_out count | > 0 |
For each flagged issue:
- Compute fingerprint (see Fingerprinting below)
- If novel → open new
stuck-patternGH issue with investigation template body - If known → comment on existing issue with fresh occurrence timestamp + count
2. Investigation prompt template
Section titled “2. Investigation prompt template”When stuck-watcher opens an issue, the body includes:
- Last 10 events from the event stream
- Last log snippet if available
- Fingerprint (SHA1 of event type sequence + repo slug)
- Links to similar past issues sharing the same
stuck-fp:label - Suggested questions for the assigned Dev-E or operator
3. Memory writeback after fix
Section titled “3. Memory writeback after fix”When a PR closes a stuck-pattern issue, the PR merge workflow calls rig-memory-mcp/write_memory:
scope: rigkind: gotchatitle: <stuck-pattern title>content: | Pattern: <what happened> Root cause: <why it happened> Fix: <how to resolve>fingerprint: <sha1 of event sequence>tags: [conductor, lifecycle, stuck-pattern]4. BRAIN.md auto-section
Section titled “4. BRAIN.md auto-section”A generated ## Known stuck patterns block in BRAIN.md sourced from memory tag stuck-pattern:
| Column | Source |
|---|---|
| Pattern | memory title |
| Fingerprint | stuck-fp: label prefix |
| Fix | PR that resolved the originating issue |
Refreshed on write_memory call or nightly cron.
5. Dashboard surface
Section titled “5. Dashboard surface”Add a “Stuck watch” card to the rig overview: count of active stuck-pattern issues, link to the label filter on rig-conductor.
Fingerprinting
Section titled “Fingerprinting”Why sha1(last_3_event_types + "|" + repo_slug)?
Stuck patterns in the rig are almost always caused by a specific event sequence — not random noise. The last 3 event types before the stuck state form a minimal but discriminating signature:
- Long enough to distinguish
cli_started → cli_completed → review_assigned(normal) fromcli_started → agent_stuck → envelope_timed_out(stuck pattern A) - Short enough to remain stable across minor timing differences that share the same root cause
- Repo-scoped to prevent false-positive matches across codebases with different tooling
The fingerprint is stored as a GitHub label stuck-fp:<sha1_8char_prefix> on the filed issue, enabling label-search deduplication without a separate database:
sha1("cli_started|agent_stuck|envelope_timed_out|rig-gitops")[:8]→ stuck-fp:a3f2b1c9Collision handling: SHA1 8-char prefix collision probability at the rig’s current issue volume (~200 issues/month) is negligible (< 0.01%). If two semantically different patterns collide, the watcher detects that the existing issue body fingerprint does not match and falls back to opening a new issue with an incremented suffix.
Sub-issues
Section titled “Sub-issues”| Ref | Title | Repo |
|---|---|---|
| rc#232 | Stuck-watcher cron (detection + fingerprint + GH issue filing) | rig-conductor |
| rc#233 | Memory writeback hook (PR-closes-stuck-issue → write_memory) | rig-conductor |
| rd#206 | BRAIN.md ## Known stuck patterns auto-section | rig-docs |
| rc#234 | Dashboard Stuck-watch card | rig-conductor |
| rg#187 | Discord ping on new stuck-pattern (edge-triggered) | rig-gitops |
Acceptance criteria
Section titled “Acceptance criteria”- Manually introduce a stuck condition (e.g. re-create the
merge_conflictterminal bug) - Within 10 min, a
stuck-patternissue appears on rig-conductor with fingerprint + last events - Fix lands, issue closes,
write_memoryis called withkind: gotcha - BRAIN.md regenerates with the new pattern under
## Known stuck patterns - Recreating the same condition later → stuck-watcher comments on existing issue instead of filing a new one
Open questions
Section titled “Open questions”| Question | Options | Status |
|---|---|---|
| Who runs the watcher? | Reuse Pi-E (extend existing cron persona) vs. new dedicated watchdog-e | Open |
| Fingerprint collisions | 8-char SHA1 prefix sufficient? Fallback to body-hash comparison on collision? | Tentatively resolved (see above) |
| Human override of false positives | Label stuck-fp-false-positive to suppress recurrence filing? Or close-and-lock the issue? | Open |
Related prior work
Section titled “Related prior work”- rc#127 — reconciliation loop (already drives orphans to terminal states)
- rc#226 —
main_ci_failedevents + dashboard - rc#227 — auto-rebase on
merge_conflict - rc#228 — don’t auto-cancel parent issue on stale PR close
- rig-memory-mcp — already exists, not yet wired for lifecycle events
- rig-docs BRAIN.md — already exists, needs new auto-section
Transition on accept
Section titled “Transition on accept”When all sub-issues are merged and the epic closes:
- Flip
status: draft→status: acceptedin this file - Copy as-is to
src/content/docs/decisions/2026-04-24-self-auditing-rig.md - Set
superseded_by: decisions/2026-04-24-self-auditing-rigon this proposal - Set
supersedes: proposals/2026-04-24-self-auditing-rigon the new decision doc