Self-auditing rig — stuck-issue detection, memory, and recognition loop

Why

Manual triage ran the rig all day 2026-04-24. The orchestrator noticed “I don’t see Review-E reviews” → investigated → filed rc#216/#217 → fix landed → verified. Every step was human-initiated. The rig should do most of this itself.

Today the rig has orphan detection (rc#127 reconciliation loop, 5-min TTL) but it only drives stuck issues to terminal states. It does not:

Investigate WHY they got stuck (what event sequence preceded the stuck state)
File issues when a new stuck pattern appears
Update agent memory or BRAIN.md so the next occurrence is handled faster
Summarise stuck-state metrics over time

Result: the same gotchas get re-discovered in every operator session.

Loop

flowchart LR
    A[DETECT] --> B[INVESTIGATE]
    B --> C[FILE]
    C --> D[FIX]
    D --> E[VERIFY]
    E --> F[MEMORISE]
    F --> G[RECOGNISE]
    G -.->|known fingerprint| A

View Mermaid source

flowchart LR
    A[DETECT] --> B[INVESTIGATE]
    B --> C[FILE]
    C --> D[FIX]
    D --> E[VERIFY]
    E --> F[MEMORISE]
    F --> G[RECOGNISE]
    G -.->|known fingerprint| A

Step	Actor	Mechanism
DETECT	stuck-watcher cron (`/10 * * *`)	Queries `/api/issues`; checks age thresholds
INVESTIGATE	stuck-watcher	Reads last 10 events + log snippet; computes fingerprint
FILE	stuck-watcher	Opens `stuck-pattern` GH issue if fingerprint is novel; comments if known
FIX	Dev-E or operator	Normal issue assignment via rig-conductor
VERIFY	rig-conductor	PR merged + issue closed
MEMORISE	PR merge webhook	Calls `write_memory(kind=gotcha, fingerprint=...)`
RECOGNISE	Agent session start	`read_memories` surfaces known pattern before work begins

Scope (minimum viable)

1. Stuck-watcher (detection + file)

A */10 * * * * cron persona (Pi-E reuse or new watchdog-e) that:

Check	Threshold
`in_progress` with no `cli_progress` event	> 30 min
`in_review` with no `review_assigned` event	> 20 min
`merge_conflict` terminal state	any age
`ready_to_merge` state	> 15 min
`envelope_timed_out` count	> 0

For each flagged issue:

Compute fingerprint (see Fingerprinting below)
If novel → open new stuck-pattern GH issue with investigation template body
If known → comment on existing issue with fresh occurrence timestamp + count

2. Investigation prompt template

When stuck-watcher opens an issue, the body includes:

Last 10 events from the event stream
Last log snippet if available
Fingerprint (SHA1 of event type sequence + repo slug)
Links to similar past issues sharing the same stuck-fp: label
Suggested questions for the assigned Dev-E or operator

3. Memory writeback after fix

When a PR closes a stuck-pattern issue, the PR merge workflow calls rig-memory-mcp/write_memory:

scope: rig
kind: gotcha
title: <stuck-pattern title>
content: |
  Pattern: <what happened>
  Root cause: <why it happened>
  Fix: <how to resolve>
fingerprint: <sha1 of event sequence>
tags: [conductor, lifecycle, stuck-pattern]

4. BRAIN.md auto-section

A generated ## Known stuck patterns block in BRAIN.md sourced from memory tag stuck-pattern:

Column	Source
Pattern	memory title
Fingerprint	`stuck-fp:` label prefix
Fix	PR that resolved the originating issue

Refreshed on write_memory call or nightly cron.

5. Dashboard surface

Add a “Stuck watch” card to the rig overview: count of active stuck-pattern issues, link to the label filter on rig-conductor.

Fingerprinting

Why sha1(last_3_event_types + "|" + repo_slug)?

Stuck patterns in the rig are almost always caused by a specific event sequence — not random noise. The last 3 event types before the stuck state form a minimal but discriminating signature:

Long enough to distinguish cli_started → cli_completed → review_assigned (normal) from cli_started → agent_stuck → envelope_timed_out (stuck pattern A)
Short enough to remain stable across minor timing differences that share the same root cause
Repo-scoped to prevent false-positive matches across codebases with different tooling

The fingerprint is stored as a GitHub label stuck-fp:<sha1_8char_prefix> on the filed issue, enabling label-search deduplication without a separate database:

sha1("cli_started|agent_stuck|envelope_timed_out|rig-gitops")[:8]
→ stuck-fp:a3f2b1c9

Collision handling: SHA1 8-char prefix collision probability at the rig’s current issue volume (~200 issues/month) is negligible (< 0.01%). If two semantically different patterns collide, the watcher detects that the existing issue body fingerprint does not match and falls back to opening a new issue with an incremented suffix.

Sub-issues

Ref	Title	Repo
rc#232	Stuck-watcher cron (detection + fingerprint + GH issue filing)	rig-conductor
rc#233	Memory writeback hook (PR-closes-stuck-issue → `write_memory`)	rig-conductor
rd#206	BRAIN.md `## Known stuck patterns` auto-section	rig-docs
rc#234	Dashboard Stuck-watch card	rig-conductor
rg#187	Discord ping on new stuck-pattern (edge-triggered)	rig-gitops

Acceptance criteria

Manually introduce a stuck condition (e.g. re-create the merge_conflict terminal bug)
Within 10 min, a stuck-pattern issue appears on rig-conductor with fingerprint + last events
Fix lands, issue closes, write_memory is called with kind: gotcha
BRAIN.md regenerates with the new pattern under ## Known stuck patterns
Recreating the same condition later → stuck-watcher comments on existing issue instead of filing a new one

Open questions

Question	Options	Status
Who runs the watcher?	Reuse Pi-E (extend existing cron persona) vs. new dedicated `watchdog-e`	Open
Fingerprint collisions	8-char SHA1 prefix sufficient? Fallback to body-hash comparison on collision?	Tentatively resolved (see above)
Human override of false positives	Label `stuck-fp-false-positive` to suppress recurrence filing? Or close-and-lock the issue?	Open

rc#127 — reconciliation loop (already drives orphans to terminal states)
rc#226 — main_ci_failed events + dashboard
rc#227 — auto-rebase on merge_conflict
rc#228 — don’t auto-cancel parent issue on stale PR close
rig-memory-mcp — already exists, not yet wired for lifecycle events
rig-docs BRAIN.md — already exists, needs new auto-section

Transition on accept

When all sub-issues are merged and the epic closes:

Flip status: draft → status: accepted in this file
Copy as-is to src/content/docs/decisions/2026-04-24-self-auditing-rig.md
Set superseded_by: decisions/2026-04-24-self-auditing-rig on this proposal
Set supersedes: proposals/2026-04-24-self-auditing-rig on the new decision doc