Rig Brain

Fresh-agent entry point. Read this first. One fetch (~27 KB) gives you the repo manifest, deployed surfaces (including rig-conductor’s 13 endpoints and built-in Dashboard), agent instances, primary flows, frontmatter schema, 40+ event types (summary; full schemas at /events.md), 18-whitepaper catalog, and the current backlog with prior_art links. Every claim traces to its source file in facts/.

Compiled from facts/*.yaml + live GitHub state (gh api /orgs/dashecorp/repos for the repo list; manifest validation for agents). Do not hand-edit BRAIN.md. Regenerate with npm run brain. CI runs --check and fails on drift.

What this is

The Dashecorp rig is an autonomous coding-agent system. A human posts a user story; agents research, propose, code, review, and ship. Canonical docs live in dashecorp/rig-docs (Astro Starlight); operational memory lives in a Postgres + pgvector Memory MCP; deployments are Flux-managed on a k3s cluster running on a GCE VM (invotek-k3s in invotek-github-infra).

Published surfaces

Rig landing — discoverable index of all surfaces

URL: https://rig.dashecorp.com/
Type: html

Canonical brain entry point (this file, rendered)

URL: https://docs.rig.dashecorp.com/brain/
Raw: https://research.rig.dashecorp.com/BRAIN.md
Type: markdown

Brain map — visual architecture + doc-linkage graph

URL: https://research.rig.dashecorp.com/map/
Type: astro-starlight
Note: Two auto-derived diagrams (architecture from facts/, linkage from doc frontmatter). See the shape of what the rig knows before fetching individual pages.

LLM site map (research, proposals, user-stories)

URL: https://research.rig.dashecorp.com/llms.txt
Type: llms-txt

Full content dump (single-shot ingestion)

URL: https://research.rig.dashecorp.com/llms-full.txt
Type: llms-full-txt

Research, proposals, user-stories (rendered Starlight site)

URL: https://research.rig.dashecorp.com/
Type: astro-starlight
Source: dashecorp/rig-docs

Aggregated engineering docs (architecture, guides, whitepapers, per-repo docs)

URL: https://docs.rig.dashecorp.com/
Type: mkdocs-material
Source: dashecorp/rig-gitops (docs-site/)
Note: Built by scripts/build-docs.sh in rig-gitops on push + hourly cron. Pulls each rig repo’s docs/ via gh api. Different scope from research.rig.dashecorp.com (engineering reference vs. research).

Sitemap (XML)

URL: https://research.rig.dashecorp.com/sitemap-index.xml
Type: sitemap-xml

rig-conductor API (cluster-internal)

Type: rest-api
Visibility: cluster-internal-only
Endpoints:
- POST /api/events — Submit any of the 40+ event types — see /events.md
- GET /api/assignments/next — Claim next issue assignment. Query: agentId=dev-e-node
- GET /api/pr-reviews/next — Claim direct-PR review (no issue) for infra/tooling PRs
- GET /api/pr-reviews/item — Inspect a single PR review item. Query: repo, prNumber
- POST /api/pr-reviews/merge — Server-side merge gate for direct PR reviews (rc#1028)
- GET /api/issues — List tracked issues. Query: state=open|done|stuck
- GET /api/issues/item — Fetch a single issue projection by (repo, issueNumber)
- GET /api/issues/trace — Per-issue event trace + state transitions for debugging
- GET /api/stuck-issues — List issues in a non-terminal state for too long (stuck-watcher candidate set)
- GET /api/queue — Current dispatch queue state
- GET /api/usage — Token / cost usage by agent and/or repo. Query: agentId, repo
- GET /api/costs/issue — Cost for a specific issue. Query: repo, issueNumber
- GET /api/costs/summary — Aggregate cost. Query: days (default 7)
- GET /api/costs/daily — Daily cost time series. Query: days
- GET /api/events/live — SSE stream of live events (for Dashboard.html)
- GET /api/streams/status — Stream consumer status
- GET /api/streams/{agentId} — Per-agent stream tail (recent assignment messages). Query: count
- GET /api/agents — List registered agents (heartbeat + status). Query: archived=true
- DELETE /api/agents/{agentId} — Forcibly archive a specific agent (admin)
- DELETE /api/agents/offline — Bulk-archive all agents that are offline (no recent heartbeat)
- GET /api/agent-capacity — Per-agent capacity / quota / dispatch eligibility snapshot
- POST /api/webhook/github — GitHub webhook intake — normalizes GH events into rig-conductor stream
- POST /api/webhook/flux — Flux deploy confirmation webhook (rc#413 in_deploy → deployed)
- POST /api/merge — Server-side merge gate
- POST /api/execution-logs — Create execution log envelope
- POST /api/execution-logs/{id}/logs — Append log entries
- POST /api/execution-logs/{id}/steps — Append structured step
- POST /api/execution-logs/{id}/complete — Mark log complete
- GET /api/execution-logs/{id} — Fetch log by id
- GET /api/execution-logs/issue — Logs per issue. Query: repo, issueNumber
- GET /api/execution-logs — List logs. Query: limit, status
- POST /api/execution-logs/cleanup — Prune old logs
- GET /api/repo-learnings — Fetch learnings. Query: repo
- POST /api/repo-learnings — Upsert learning
- DELETE /api/repo-learnings — Delete learning. Query: repo, key
- GET /api/guard-blocked — Guard-block counts per agent. Query: agentId
- GET /health — Liveness probe — always 200 if the process is alive
- GET /healthz/deep — Deep readiness probe — Marten + Valkey + dependency checks (rc#1188)
- GET /api/health — Detailed health snapshot for the dashboard (component-level)
- GET /api/version — Build version + git SHA
- GET /dashboard — Built-in single-page dashboard (HTML) — Engineering Rig control plane
- GET /api/events/stream — Single event-stream tail by stream id. Query: id
- GET /api/events/recent — Recent events across all streams. Query: hours
- GET /api/main-ci — Main-branch CI status snapshot. Query: repo
- GET /api/ci-failures — List CI failures across repos. Query: repo, includeAcked
- POST /api/ci-failures/{repo}/{workflowName}/{runId:long}/ack — Ack a CI failure so it stops showing as active
- GET /api/main-guard/incidents — Main-guard incidents (rc#1226 + rc#1234). Query: repo, status
- GET /api/a11y — Accessibility scan results per repo. Query: repo
- GET /api/stuck-watch — Live stuck-watch snapshot (proxies upstream cluster check)
- GET /api/stuck-patterns — List active stuck patterns. Query: includeResolved=true for all
- POST /api/stuck-patterns/{fingerprint}/resolve — Mark a stuck pattern as resolved (writes memory)
- GET /api/stuck-patterns/brain-section — Generate the ## Known stuck patterns markdown for BRAIN.md
- GET /api/agent-logs — List recent agent log entries across all agents. Query: count
- GET /api/agent-logs/{agentId} — Tail recent log entries for one agent. Query: count
- POST /api/agent-logs — Append a batch of log entries from an agent (push from pod)
- GET /api/agent-logs/live — SSE stream of live agent log entries
- GET /api/self-improvement/signatures — Watcher signature states (rc#947): occurrences, OpenIssue, clean-tick counter
- POST /api/admin/issues/force-done — Operator force-close an issue’s read-model state to Done (admin)
- POST /api/admin/overrides — Record an operator override event (audit trail)
- GET /api/admin/overrides — List recent operator overrides for audit
- POST /api/planner/trigger — Dispatch a planner task (planner agent stream)
Note: The conductor’s in-cluster API endpoint. Reachable only from inside the cluster — exact host/port intentionally not surfaced publicly.

rig-conductor Dashboard (the built-in cost/activity UI)

Type: html-dashboard
Source: dashecorp/rig-conductor (src/ConductorE.Api/Dashboard.html)
Visibility: cluster-internal-only
Note: 42 KB single-page HTML dashboard — “Engineering Rig — Control Plane”. Has Costs, Issues, Agents, Streams tabs. Driven by /api/costs/, /api/usage, /api/issues, /api/streams/ endpoints. No separate Grafana/Starlight dashboard is needed — this one already renders per-agent / per-issue / per-day cost.

Memory MCP (Postgres + pgvector)

Type: mcp-server
Package: @dashecorp/rig-memory-mcp
Tools:
- read_memories — Query prior memory by topic/repo/scope with vector similarity
- write_memory — Persist a new memory with scope/kind/importance/tags
- mark_used — Increment hit_count on a memory that informed a decision

Discord agent channels (notifications)

Type: discord
Channels: #dev-e, #review-e, #ibuild-e, #admin
Note: Agents post thread updates here; humans watch for stuck / pending state.

Repos

Live from gh api /orgs/dashecorp/repos merged with facts/repos.yaml annotations. Archived repos are dropped automatically.

Repo	Purpose	Language	Depends on	AGENTS.md
`rig-gitops`	GitOps manifests (Flux HelmReleases, Kustomize bases) and the canonical AGENTS.md shared by every rig repo via `@dashecorp/rig-gitops/AGENTS	shell	—	compiled
`rig-agent-runtime`	The AI agent runtime (Node) — one image that deploys as Dev-E, Review-E, or iBuild-E depending on character file + environment. Handles prom	javascript	rig-memory-mcp, rig-conductor	imports-rig-gitops
`rig-memory-mcp`	MCP server backing persistent agent memory with Postgres + pgvector. Exposes `read_memories` / `write_memory` / `mark_used` tools consumed b	javascript	postgres-pgvector	claude-md
`rig-conductor`	Event store + dispatch service (C# + Marten + Postgres). Receives PR/issue events, assigns work, tracks turns/cost/stuck state, serves the `	csharp	postgres, pgvector	imports-rig-gitops
`rig-docs`	Research, proposals, user-stories, and rig-wide reference (Astro Starlight). This repo — you’re reading its BRAIN.md. Deploys to research.ri	astro	—	hand
`rig-tools`	Shell scripts, Git hooks, and workflow sync for AI-assisted development. Developer tooling, not deployed. The one repo without an AGENTS.md	shell	—	none
`infra`	OpenTofu/Terraform for GitHub org settings, Cloudflare (DNS, Pages, tunnels), GCP (k3s cluster on a GCE VM (invotek-k3s) hosting the rig), a	hcl	—	imports-rig-gitops

Per-repo doc index (token-efficient discovery)

Before cloning a repo to find docs, consult this list to decide which docs are relevant to your issue. Then fetch raw markdown for only the relevant ones:

gh api repos/dashecorp/<repo>/contents/docs/<file>.md --header 'Accept: application/vnd.github.raw'

Auto-derived per compile via gh api /repos/<r>/contents/docs. Repos without a docs/ dir are omitted.

rig-gitops — architecture-current.md, architecture-proposed-v2.md, architecture-proposed.md, documentation-standard.md, onboarding.md, research-multi-agent-platforms.md, review-e-bootstrap.md, sops.md
rig-agent-runtime — architecture.md, configuration.md, dashboard.md, deployment.md, discord-setup.md, heartbeat.md, index.md, memory.md, messaging.md, observability.md, quickstart.md, usage-tracking.md
rig-memory-mcp — api.md
rig-conductor — api.md, architecture.md, deployment.md, event-store.md, index.md, principles.md
rig-tools — agent-workflow.md

Agents (deployment instances)

Dev-E — writes code

Runtime: dashecorp/rig-agent-runtime
Deployed in: k3s cluster on GCE VM (invotek-k3s, invotek-github-infra)
Manifest: dashecorp/rig-gitops/apps/dev-e/
Variants:
- node: apps/dev-e/rig-agent-helmrelease.yaml
- python: apps/dev-e/python-helmrelease.yaml
- dotnet: apps/dev-e/dotnet-helmrelease.yaml
Character: baked into HelmRelease values
Triggers: signal:dev-e-node/-python/-dotnet LIST + assignments:dev-e STREAM
Notes: Stream-consumed via Valkey, NOT REST polling — there is no issue.assigned poll loop. Each variant’s KEDA ScaledObject (apps/dev-e/scaledobject.yaml = node, dev-e-python-scaledobject.yaml, dev-e-dotnet-scaledobject.yaml) watches its signal: Redis LIST at valkey-primary.rig-conductor.svc.cluster.local:6379 to scale 0→1; the agent then drains its assignment from the assignments:dev-e Redis STREAM.

Review-E — reviews PRs

Runtime: dashecorp/rig-agent-runtime
Deployed in: k3s cluster on GCE VM (invotek-k3s, invotek-github-infra)
Manifest: dashecorp/rig-gitops/apps/review-e/rig-agent-helmrelease.yaml
Triggers: signal:review-e LIST + assignments:review-e STREAM
Discord: #review-e
Notes: DISPATCH — stream-consumed via Valkey + KEDA, NOT cron/REST polling. The ScaledObject watches the signal:review-e Redis LIST at valkey-primary.rig-conductor.svc.cluster.local:6379 to scale 0→1; the work payload is drained from the assignments:review-e STREAM. The cron ”*/5 * * * *” + agent-bot search_filter (author:app/dev-e-bot author:app/ibuild-e-bot -reviewed-by:app/review-e-bot) are preserved DEAD CODE in the helmrelease (cron.enabled: false) — verbatim: “this cron prompt is no longer invoked. Review-E is stream-consumed via assignments:review-e (KEDA-scaled) … unreachable in the running pod.” ROUTING — review is OPT-IN for non-agent authors. Agent-bot PRs (dev-e-bot / ibuild-e-bot / dependabot) auto-route. Human/operator-authored PRs do NOT auto-route; to request Review-E on an operator PR, apply the needs-review label (the working opt-in since 2026-06-12 — rig-conductor docs/2026-06-12-operator-review-opt-in-label.md + ReviewRoutingPolicy.cs). The legacy opt-in (requesting the review-e-dashecorp reviewer) is RETIRED: that machine user no longer exists, so GitHub returns 422; the App is now review-e-bot.

iBuild-E — macOS / iOS builds

Runtime: dashecorp/rig-agent-runtime
Deployed in: Mac Mini (Oslo, on the operator’s Tailnet)
Manifest: not-in-cluster
Discord: #ibuild-e
Notes: Apple Silicon host, Xcode + App Store Connect. Auto-reauth cron refreshes OAuth every 5 min. Separate from the GCE-hosted agents because iOS builds require macOS.

Planner-E — plans sprints, manages backlog, assigns issues to agents

Runtime: dashecorp/rig-agent-runtime
Deployed in: k3s cluster on GCE VM (invotek-k3s, invotek-github-infra)
Manifest: dashecorp/rig-gitops/apps/rig-planner/
Triggers: signal:rig-planner LIST + assignments:rig-planner STREAM
Discord: #planner
Notes: GitHub App rig-planner-bot (App ID 3546083) handles GitHub issue intake. KEDA scales 0→1 on signal:rig-planner (Redis LIST); also reads assignments:rig-planner (Redis STREAM). Provider: claude-cli + claude-sonnet-4-6. Persona reference: /whitepaper/planner/.

Primary flows

PR lifecycle in dashecorp (orchestrator-owned — DO NOT copy legacy personal-org workflow files)

Trigger: Any PR opened in a dashecorp-org repo (dashe-, rig-, infra, etc.)

GitHub — Fires webhook to POST rig-conductor /api/webhook/github
rig-conductor — Normalizes the PR event, enforces gates (issue-link rule, labels), assigns review
Review-E — Consumes its review assignment from the assignments:review-e Valkey stream (woken by KEDA scaling 0→1 on the signal:review-e list), reviews, posts approval or CHANGES_REQUESTED
rig-conductor — On approval + green CI + no unresolved threads + no manual-merge label, calls POST /api/merge to merge server-side

Rules:

Do NOT copy the operator’s per-repo .github/workflows/request-review.yml or auto-merge.yml from legacy personal-org repos into dashecorp repos. Those files are the legacy pattern from before rig-conductor. The conductor endpoints above own this lifecycle for dashecorp. If a dashecorp repo isn’t getting reviewed or merged, the fix is configure the GitHub webhook, not add a workflow file.
The operator’s personal-org repos still use the per-repo workflow pattern because they predate rig-conductor’s scope. That pattern stays until those repos are archived post-migration. Complete when: conductor emits PR_MERGED event and downstream consumers (CF Pages, iBuild-E, etc.) react

Epic to merged work

Trigger: Human opens a user-story GitHub issue in dashecorp/rig-docs

rig-conductor — Scans open issues, classifies, dispatches to appropriate agent
Dev-E — Reads issue + relevant research; authors research / proposal / code PR
Review-E (stream-consumed via assignments:review-e, KEDA-scaled) — Finds PR, reviews against AGENTS.md + memory, requests changes or approves
Human — Merges (or Review-E’s approval satisfies branch protection; auto-merge fires)
Cloudflare Pages — Redeploys research.rig.dashecorp.com and docs.rig.dashecorp.com Complete when: issue closed via `Closes

rig-conductor self-deploy (post-merge image rollout via Flux)

Trigger: A PR merges to dashecorp/rig-conductor main (the PR_MERGED step in the PR lifecycle flow)

GitHub Actions (.github/workflows/publish-image.yml — job publish) — Builds the container and pushes :latest + :sha-<commit> to GHCR and Artifact Registry (europe-north1-docker.pkg.dev/invotek-github-infra/dashecorp/rig-conductor); 3-attempt retry on transient registry / WIF / BuildKit errors
GitHub Actions (job update-gitops, needs publish) — sed-bumps the image tag in deploy/k8s/deployment.yaml to :sha-<commit>, opens a chore: pin rig-conductor image to sha-<sha> PR, validates the diff is exactly the one-line image bump, then admin-merges it
Flux source-controller (GitRepository rig-conductor) — Polls dashecorp/rig-conductor main every 5m and picks up the new deployment.yaml
Flux kustomize-controller (Kustomization rig-conductor-api) — Reconciles every 10m from path: deploy (prune true); the new image (imagePullPolicy Always) rolls the pod

Rules:

paths-ignore: deploy/k8s/deployment.yaml on the workflow stops the pin commit from re-triggering Publish Image (loop guard). The publish build+push is the real success signal — a pin PR that fails because a newer pin already landed is benign supersession (rc#1532), not a failure.
Worst-case propagation is ~15m (5m GitRepository interval + 10m Kustomization interval), not 5m.
Verify a deploy from inside the cluster: kubectl -n rig-conductor exec deploy/rig-conductor-api -- curl -s localhost:8080/api/version (returns the git SHA) and the same for /healthz/deep (200 = Marten + Valkey + deps healthy). Both endpoints are cluster-internal — already in surfaces.yaml. Complete when: the rig-conductor-api pod runs the new sha- image and /healthz/deep passes

Research and proposal authoring

Trigger: An Epic needs investigation before implementation

author dated research/YYYY-MM-DD-slug.md with user_story frontmatter
author proposals/YYYY-MM-DD-slug.md with source_research frontmatter
user_story file gets research_docs and proposal fields pointing back
RelatedDocs component auto-renders the graph; no manual cross-linking

Rules:

bidirectional links required
schema enforced in src/content.config.ts
CI rejects PRs missing required fields

Cold-start agent session

Trigger: Fresh agent with blank memory receives an Epic or task

WebFetch https://research.rig.dashecorp.com/brain/ (or raw BRAIN.md)
Parse facts/repos.yaml equivalent in BRAIN.md — learn repo manifest
Parse facts/surfaces.yaml equivalent — learn URLs and endpoints
WebFetch https://research.rig.dashecorp.com/llms.txt for topic index
WebFetch relevant research/proposal docs directly via raw URL
For the target repo, fetch its AGENTS.md (compiled or imports-rig-gitops)
read_memories scoped to repo + topic via Memory MCP
Begin work with full context in ~15 KB total Token budget: ~15 KB read, leaves 200K+ for actual work on Opus

Docs-memory promotion (weekly Lint)

Trigger: Weekly scheduled Lint job

Scan Memory MCP for rows with importance >= 4 AND hit_count >= 5
For each candidate, check if docs already cover the topic (BM25 sim)
If not covered, propose a docs PR with the memory content promoted
Human approves PR, merge triggers redeploy Status: not-yet-built (design in research/2026-04-18-docs-memory-drift-lint)

Diagram-as-code authoring

Trigger: A research / proposal / user-story needs a diagram Rule: Mermaid source inline in fenced code block. No PNG or SVG ever committed. Rendering: remark-mermaid plugin wraps in <figure> with <pre class=mermaid> and <details> source; mermaid.js renders client-side; source preserved post-render for agent readers.

Frontmatter schema (for authoring rig-docs content)

type (optional): one of research | proposal | decision | postmortem | reference | user-story | runbook
audience (optional): one of human | agent | both — not a free-form array
Required: title, description
Optional linkage fields (paths are relative to src/content/docs/, no leading slash, no .md or .mdx extension):
- type — See type enum above.
- subtype — See subtype enum above (whitepapers only).
- audience — See audience enum above.
- created — ISO date string YYYY-MM-DD.
- updated — ISO date string YYYY-MM-DD.
- topic — Short slug grouping related docs.
- source_refs — Array of URLs (external sources supporting this doc).
- supersedes — Path to doc this replaces (no leading slash, no .md extension).
- superseded_by — Path to newer doc that replaces this (same format).
- user_story — (research/proposal only) Path to the user story this supports.
- research_docs — (user-story only) Array of research doc paths this story spawned.
- proposal — (user-story only) Path to the proposal answering this story.
- source_research — (proposal only) Array of research paths this proposal synthesises.
- github_issue — (user-story only) Full GitHub issue URL. Omit the field entirely if there is no issue — do NOT use empty string.
- whitepaper — (user-story, optional) Slug of the single whitepaper this story primarily supports. Matches the whitepaper filename without extension (e.g. “safety”, “memory”, “observability”). Used by the Starlight sidebar to roll up story counts next to each whitepaper link at build time.
- whitepapers — (user-story, optional) List form of whitepaper: — use when a story supports more than one whitepaper (e.g. a domain paper AND a synthesis paper). Format whitepapers: [a, b] or a block list. A story tagged for multiple papers counts on each paper’s sidebar badge and appears in each paper’s page-level Related list. Accepts both inline and block-list YAML. Mutually exclusive in spirit with whitepaper:, but supplying both is tolerated (values are merged and deduped).

Path examples: user-stories/2026-04-18-docs-memory-strategy, research/2026-04-18-docs-tools-evaluation, proposals/2026-04-18-docs-tooling-decision, decisions/2026-04-18-docs-tooling-decision.

Omit a field entirely when it has no value — do not use empty string.

Whitepapers (private — catalog only)

These whitepapers live at dashecorp/rig-gitops/docs/whitepaper/*.md (private repo — requires gh auth to fetch). BRAIN.md surfaces their titles + 1-line summaries so agents know what exists. Full content must be fetched with: gh api /repos/dashecorp/rig-gitops/contents/docs/whitepaper/<file> --jq .download_url | xargs curl -sL.

Whitepaper index (index.md) — Entry point listing all whitepaper sections and their companion docs.
MVP scope (mvp-scope.md) — What the rig does in the minimum viable release. Gatekeeper for “is this in scope?”
Design principles (principles.md) — First principles (measurement precedes trust; honest gaps; provider portability).
Trust model (trust-model.md) — Who can approve what, which gates exist, human-in-the-loop rules.
Safety (safety.md) — Dangerous-command guards, sandboxing, blast-radius containment.
Security (security.md) — Secrets handling, attestation, audit trail, SOPS+age.
Agent secrets broker (agent-secrets-broker.md) — Capability-based secret lifecycle broker for LLM agents. Agents operate on opaque references; the broker handles plaintext across Bitwarden, GitHub, SOPS, k8s, and Cloudflare — plaintext never enters a prompt, tool argument, or log line. Covers tool surface (mint/store/deploy/rotate/ retire/verify/list/generate_and_deploy), destination ref grammar (gh:, gh-env:, sops:, k8s:, cf-worker:, bw:), policy model with hardware-key override, and append-only audit schema. Complementary to security.md (supply-chain: Sigstore/SLSA/Kyverno); covers the runtime secret-lifecycle layer.
Provider portability (provider-portability.md) — Multi-runtime (Claude Code, Codex CLI, Gemini CLI) via OTel GenAI conventions. Swap runtime without changing backend.
Observability — OTel, Langfuse, Prometheus, SLOs (observability.md) — Self-hosted Langfuse (agent traces) + Grafana Cloud (infra) + local Prometheus (SLO gates) hybrid. Native OTel via CLAUDE_CODE_ENABLE_TELEMETRY=1. OTel Collector runs per-cluster, routes LLM traces to Langfuse, infra to managed. Per implementation-status: OTel Collector “Partial” (deployed for rig-conductor, agents not yet emitting), Langfuse “Planned”, cost dashboard “Partial” (TokenUsageProjection exists, no LiteLLM proxy yet).
Cost framework (cost-framework.md) — Budget policy, per-model rate tables, cost attribution strategy. Companion to observability.
Self-healing (self-healing.md) — Automatic recovery loops, StaleHeartbeatService, escalation severity routing.
Memory architecture (memory.md) — Memory MCP scope, importance/hit_count model, promotion-to-docs threshold design.
Quality and evaluation (quality-and-evaluation.md) — How the rig evaluates its own output. Judge-agent pattern, fixed rubrics.
Drift detection (drift-detection.md) — Schema drift, docs drift, infra drift — detection thresholds and response.
Development process (development-process.md) — Issue → Epic → research → proposal → PR lifecycle, agent-human gates.
Example first story (example-first-story.md) — Worked walkthrough of one Epic end-to-end.
Glossary (glossary.md) — Rig-specific terminology (Epic, proposal, rig-conductor, Review-E, etc).
Known limitations (limitations.md) — Honest catalog of what the rig can’t do today.
Implementation status (implementation-status.md) — Single source of truth for deployed vs planned per capability. 78 tracked across 11 domains; 21 deployed/partial (27%), 44 planned/deferred (56%). Every capability named in the whitepapers gets a row with status + whitepaper section + ticket/evidence.
Tool choices (ADRs) (tool-choices.md) — Decision records for tooling. Includes rejection list with rationale.

Most agents should start with: the /implementation/ dashboard (structured per-capability status — see summary below) and whichever domain-specific whitepaper matches the Epic.

Capability status (38 in registry · full dashboard)

shipped:15 · partial:7 · planned:15 · deferred:0 (registry seed — full migration tracked in rig-docs#124)

Top blockers: default-deny-egress (dashecorp/rig-docs#57)

Multi-tenancy

Shared control plane + per-tenant siloed data plane. A single tenant_id is resolved ONCE at a trusted edge (GitHub installation_id for webhooks, Cloudflare Access identity for the dashboard, the conductor-issued session token for agents) and threaded immutably as a Marten event header — never asserted by the LLM, never read from a request body or a tool argument (“the LLM is the threat model, not the guard”).

Boundary: The database IS the tenant boundary. Marten master-table tenancy (a rig_control registry) routes each tenant to its own Postgres database (rig_t__evt); read models stay tenant-naive (no WHERE tenant_id) and isolation is enforced at the per-request connection, not a row filter.

Active tenants: invotek — tenant-0; the SOLE active tenant today (legacy rig_conductor database until the PR-5 cutover rename to rig_t_invotek_evt).

Target (gated): Locked 3-tenant target (gated, not yet onboarded): dashecorp (org-default host), invotek (today tenant-0), stigjohnny (BYO — own org + GitHub App install) — each with its own Discord channel.

Human gates:

Review — ReviewRoutingPolicy: a PR routes to Review-E iff authored by a known bot (dev-e-bot/ibuild-e-bot) or dependabot, OR it carries the needs-review operator opt-in label.
Merge — MergeGate (event-driven): merges only on review-approved + CI-passed + not-blocked; agents never self-merge (fallback is a manual operator squash).
Fail-closed write — RequireTenant throws UnattributedTenantException on a blank/invalid tenant, so an unattributed write lands in no tenant’s DB; a CI scope-guard blocks new unattributed write paths. Dashboard READS stay lenient (coalesce to invotek).
Onboarding — TenantOnboardingGate refuses activating an external tenant until its data-plane DB + erasure prerequisites exist (invotek exempt; preserves the GDPR no-window invariant).
Schema fence — env-gated (MARTEN_TENANT_SCHEMA_LOCKED): armed only at the PR-5 cutover window, then the boot assert checks EVERY registered tenant’s schema and fails closed on drift; default-off and behavior-neutral until armed.
GDPR — dashecorp/rig-conductor#1486 is the launch blocker: DPA + sub-processor disclosure + EU residency + Art.17 erasure must be signed off before ANY tenant launches (first-party included).

Built (merged + deployed): DB-per-tenant (rc#1515), the fail-closed write boundary (rc#1608), tenant resolution (TenantMatch), per-tenant dispatch (rc#1481) + per-installation GitHub tokens (rc#1665), the cross-tenant isolation gate (rc#1617), the onboarding gate (rc#1614), per-tenant Discord notification routing (rc#1643/#1661/#1668), and the PR-5 schema-fence mechanism (rc#1685).

Pending (gated): pgvector per-tenant memory (#1478), secrets-broker tenant-prefixed refs (#1479, partner-gated), the PR-5 cutover execution, and the #1486 GDPR pack (launch blocker). invotek runs as the sole tenant until these clear.

Launch blocker: dashecorp/rig-conductor#1486. Canonical operating guide: https://research.rig.dashecorp.com/proposals/multi-tenancy/.

rig-conductor event types (POST /api/events)

All events from dashecorp/rig-conductor/src/ConductorE.Core/UseCases/SubmitEvent.cs MapToEvent switch. Names only here — fetch /events.md for full field schemas (no auth required).

Pipeline (issue → PR → merge → deploy): ISSUE_APPROVED, ISSUE_ASSIGNED, ISSUE_UNASSIGNED, WORK_STARTED, BRANCH_CREATED, PR_CREATED, CI_PASSED, CI_FAILED, REVIEW_ASSIGNED, REVIEW_PASSED, REVIEW_DISPUTED, HUMAN_GATE_TRIGGERED, HUMAN_GATE_REMINDER, MERGED, MERGE_GATE_WAITING, MERGE_GATE_MERGED, MERGE_GATE_TIMEOUT, MAIN_CI_STARTED, MAIN_CI_PASSED, MAIN_CI_FAILED, DEPLOYED_STAGING, DEPLOYED_PRODUCTION, SMOKE_PASSED, SMOKE_FAILED, BUILD_FAILED, VERIFIED, ISSUE_DONE, ESCALATED, MILESTONE_COMPLETE, DUPLICATE_PR_CLOSED

Direct PR path (no issue): PR_OPENED, PR_REVIEW_ASSIGNED, PR_REVIEW_APPROVED, PR_REVIEW_REJECTED

Agent lifecycle: AGENT_STARTED, HEARTBEAT, AGENT_STUCK

CLI sessions: CLI_STARTED, CLI_PROGRESS, CLI_COMPLETED

Observability (cost + tooling): TOKEN_USAGE, TOOL_USED

Memory MCP: MEMORY_WRITE, MEMORY_READ, MEMORY_HIT_USED

Known gaps (rig backlog)

Cold-start agents should see these so they don’t re-discover what’s already identified. Each gap links to prior_art — existing stubs, research, or PRs that have already touched it. When a gap is being worked, linked_user_story points to the user story; when closed, the entry is removed from facts/backlog.yaml.

[observability] Cost tracking mostly deployed — LiteLLM proxy + external access are the remaining gaps

DO NOT propose “build a cost pipeline” — most of it is already shipped:

Data pipeline: TokenUsageProjection + CostProjection in rig-conductor consume TOKEN_USAGE + CLI_COMPLETED events. Read models live on Marten/Postgres.
API: GET /api/usage, /api/costs/issue, /api/costs/summary, /api/costs/daily on the rig-conductor cluster-internal URL (see BRAIN.md Published surfaces). Query by agent, repo, date range.
Dashboard: src/ConductorE.Api/Dashboard.html (~42 KB SPA, “Engineering Rig — Control Plane”). Served at / and /dashboard. Has a Costs tab driven by the /api/costs/* endpoints.

The remaining gaps: a. LiteLLM proxy — not deployed. Blocks hard budget enforcement (agent ceiling kill-switch). b. External access — /dashboard is cluster-internal. A human on laptop can’t view it without kubectl port-forward or a Cloudflare tunnel. Consider publishing a read-only projection. c. Alerting — no Discord webhook on cost threshold breach yet.

Rough current spend: ~$5-15/day fleet-wide (order-of-magnitude only).

Prior art:

rig-conductor cost endpoints and Dashboard.html — dashecorp/rig-conductor src/ConductorE.Api/
TokenUsageProjection + CostProjection source: dashecorp/rig-conductor src/ConductorE.Api/Adapters/MartenProjections.cs
TOKEN_USAGE + CLI_COMPLETED events defined and emitted — see /events.md
Cost framework design: rig-gitops/docs/whitepaper/cost-framework.md (private)
Observability whitepaper: rig-gitops/docs/whitepaper/observability.md (private; summary in facts/whitepapers.yaml)
LiteLLM proxy not yet deployed — blocks hard budget enforcement

Status: mostly-deployed

[observability] OTel collector deployed for rig-conductor only — agents not yet emitting

OpenTelemetry Collector is “Partial”: deployed for rig-conductor; agent pods (Dev-E, Review-E, iBuild-E) have not yet enabled native OTel via CLAUDE_CODE_ENABLE_TELEMETRY=1. Langfuse (self-hosted) and Grafana Cloud ingest are both “Planned”. Full design in the observability whitepaper.

Prior art:

Observability whitepaper: rig-gitops/docs/whitepaper/observability.md (private; summary in facts/whitepapers.yaml)
Implementation status: whitepaper/implementation-status.md marks OTel Collector ‘Partial’, Langfuse ‘Planned’
rig-memory-mcp/events.js FUTURE comment: migrate to OTel GenAI spans
Env var to enable native OTel: CLAUDE_CODE_ENABLE_TELEMETRY=1 + OTEL_EXPORTER_OTLP_ENDPOINT pointed at the in-cluster collector

Status: partial

[docs-memory] Docs-memory drift lint not implemented

Weekly LLM-as-judge pass that promotes memory→docs (when importance≥4 AND hit_count≥5), flags stale research, catches orphan docs. Designed but no runtime built.

Prior art:

Full design in research/2026-04-18-docs-memory-drift-lint
Parent user story: user-stories/2026-04-18-docs-memory-strategy
Principles synthesis: research/2026-04-18-docs-vs-memory-principles

Linked user story: user-stories/2026-04-18-docs-memory-strategy

Status: open

[docs-surfaces] Two docs surfaces with overlapping scope

docs.rig.dashecorp.com (MkDocs aggregation from rig-gitops/docs-site/) and research.rig.dashecorp.com (Starlight research hub from dashecorp/rig-docs). Both host rig docs; boundaries not formalised. Agents currently learn this empirically. Eventually unify or formalise the split.

Prior art:

MkDocs site built by dashecorp/rig-gitops/scripts/build-docs.sh
Starlight site defined in dashecorp/rig-docs/ (this repo)
Docs tooling decision: decisions/2026-04-18-docs-tooling-decision (picked Starlight for research hub; MkDocs kept for aggregation)

Status: open

[deployment] CLOUDFLARE_API_TOKEN / CLOUDFLARE_ACCOUNT_ID not in rig-docs repo secrets

The deploy workflow gracefully skips deploy when secrets absent (notice only). Current deploys happen via direct wrangler pages deploy from the operator’s laptop. Adding the secrets would enable per-PR preview deploys and automatic main-branch publishing.

Prior art:

.github/workflows/deploy.yml has the has_cf_secrets guard
Cloudflare Pages project already exists: rig-research (created via wrangler)

Status: open

[agents] ATL-E retired, no active coordinator agent

ATL-E (a legacy personal-org atl-agent repo) was previously deployed as a k3s CronJob on a personal host and handled handoff-stall Discord notifications. As of ~2026-03-26 it is no longer deployed (not present in the operator’s personal-org cluster GitOps manifests). The repo still exists but is dormant. If an Epic needs a coordinator/team-lead role, decide whether to redeploy ATL-E or build a replacement.

Prior art:

Dormant personal-org atl-agent repo (last push 2026-03-26)
Operator’s personal-org cluster GitOps repo — no atl-agent ArgoCD manifest

Status: open

[networking] iBuild-E cannot reach rig-conductor cluster-internal API

Empirically verified on 2026-04-19: from iBuild-E (Mac Mini, Oslo, on the operator’s Tailnet), curling the conductor’s in-cluster API endpoint (/api/health) fails with DNS resolve timeout. The cluster-internal DNS name only resolves inside the k3s cluster via CoreDNS; Tailscale connects the host but doesn’t federate cluster DNS.

Impact: iBuild-E today cannot:

Send TOKEN_USAGE / HEARTBEAT / CLI_COMPLETED events (POST /api/events)
Pick up assignments (GET /api/assignments/next)
Reach the cost Dashboard or /api/costs/*

iBuild-E is effectively disconnected from rig-conductor coordination. She operates from GitHub issues + Discord channels directly.

Fix options (none implemented): a. Tailscale subnet router on a cluster node → expose the cluster service range b. Ingress / GCP load balancer for the conductor API with mTLS c. Cloudflare tunnel into the cluster d. Accept the gap: iBuild-E never sees rig-conductor; she runs on GitHub-only flows

This has been a chronic “unknown” flagged by every cold-start test (v1 through v5). Now measured.

Prior art:

facts/agents.yaml — iBuild-E: deployed_in: Mac Mini (Oslo, on the operator’s Tailnet)
curl to the conductor’s in-cluster API endpoint → DNS resolve timeout after 3s (measured 2026-04-19)
Every cold-start test session-log flagged ‘iBuild-E routing through cluster-internal services — latency unknown’. Not latency — reachability. Zero, not high.

Status: open

[cleanup] Plane residue — uninstall GitHub App + archive workspace

Plane was retired 2026-04-18 but the makeplane GitHub App is still installed on the dashecorp org, and the Plane workspace at app.plane.so is still alive (token revoked). Manual UI action needed.

Prior art:

Retraction decision: decisions/2026-04-18-docs-tooling-decision (What retires section)
Retirement commit: dashecorp/infra PR #74

Status: open

Architecture at a glance

flowchart LR
  H[Human]

  subgraph Code["Code repos"]
    RD[rig-docs]
    RG[rig-gitops]
    RAR[rig-agent-runtime]
    CE_R[rig-conductor]
    RMM_R[rig-memory-mcp]
    RT[rig-tools]
    INF[infra]
  end

  subgraph Deployed["Deployed services + agents"]
    direction TB
    CE[rig-conductor svc]
    RMM[rig-memory-mcp svc]
    DE[Dev-E pod]
    RE[Review-E cron]
    IB[iBuild-E — Mac Mini]
  end

  subgraph Publish["Published surfaces"]
    direction TB
    S1[research.rig.dashecorp.com<br/>Astro Starlight]
    S2[docs.rig.dashecorp.com<br/>MkDocs aggregator]
    CFP[Cloudflare Pages]
  end

  %% Authoring + dispatch
  H -->|user-story issue| RD
  RD -->|dispatch| CE
  CE -->|assign issue| DE
  CE -->|assign PR review| RE
  CE -->|assign iOS build| IB
  DE -->|author PR| RD
  RD -->|PR opens| RE
  RE -->|approve / request changes| RD
  RD -->|merge| CFP
  CFP -->|publish| S1
  RG -->|docs aggregation| S2

  %% MCP + memory
  DE -->|tool use| RMM
  RE -->|tool use| RMM
  IB -->|tool use| RMM
  RMM_R -.implements.-> RMM

  %% Flux GitOps
  RG -->|Flux deploys| CE
  RG -->|Flux deploys| RMM
  RG -->|Flux deploys| DE
  RG -->|Flux deploys| RE

  %% Runtime image used by all agent deployments
  RAR -.image.-> DE
  RAR -.image.-> RE
  RAR -.image.-> IB
  CE_R -.image.-> CE

  %% Per-repo docs/ feeding into the MkDocs aggregator
  RG -.docs/.-> S2
  RAR -.docs/.-> S2
  CE_R -.docs/.-> S2
  RMM_R -.docs/.-> S2
  RT -.docs/.-> S2

  %% Infra — outside the loop but manages everything above
  INF -.OpenTofu.-> CFP

View Mermaid source

flowchart LR
  H[Human]

  subgraph Code["Code repos"]
    RD[rig-docs]
    RG[rig-gitops]
    RAR[rig-agent-runtime]
    CE_R[rig-conductor]
    RMM_R[rig-memory-mcp]
    RT[rig-tools]
    INF[infra]
  end

  subgraph Deployed["Deployed services + agents"]
    direction TB
    CE[rig-conductor svc]
    RMM[rig-memory-mcp svc]
    DE[Dev-E pod]
    RE[Review-E cron]
    IB[iBuild-E — Mac Mini]
  end

  subgraph Publish["Published surfaces"]
    direction TB
    S1[research.rig.dashecorp.com<br/>Astro Starlight]
    S2[docs.rig.dashecorp.com<br/>MkDocs aggregator]
    CFP[Cloudflare Pages]
  end

  %% Authoring + dispatch
  H -->|user-story issue| RD
  RD -->|dispatch| CE
  CE -->|assign issue| DE
  CE -->|assign PR review| RE
  CE -->|assign iOS build| IB
  DE -->|author PR| RD
  RD -->|PR opens| RE
  RE -->|approve / request changes| RD
  RD -->|merge| CFP
  CFP -->|publish| S1
  RG -->|docs aggregation| S2

  %% MCP + memory
  DE -->|tool use| RMM
  RE -->|tool use| RMM
  IB -->|tool use| RMM
  RMM_R -.implements.-> RMM

  %% Flux GitOps
  RG -->|Flux deploys| CE
  RG -->|Flux deploys| RMM
  RG -->|Flux deploys| DE
  RG -->|Flux deploys| RE

  %% Runtime image used by all agent deployments
  RAR -.image.-> DE
  RAR -.image.-> RE
  RAR -.image.-> IB
  CE_R -.image.-> CE

  %% Per-repo docs/ feeding into the MkDocs aggregator
  RG -.docs/.-> S2
  RAR -.docs/.-> S2
  CE_R -.docs/.-> S2
  RMM_R -.docs/.-> S2
  RT -.docs/.-> S2

  %% Infra — outside the loop but manages everything above
  INF -.OpenTofu.-> CFP

Legend: solid arrows are runtime flows (dispatch, tool calls, deploys). Dashed arrows are source-of relationships — “this repo’s image powers that pod” or “this repo’s docs/ feeds that site”. Every rig repo from facts/repos.yaml is represented.

Conventions (rig-wide)

Docs are markdown with YAML frontmatter. Required fields: title, description, type, audience, created/updated, topic. See AGENTS.md in this repo.
Bidirectional linkage. User story ↔ research ↔ proposal → decision via research_docs, proposal, user_story, source_research, supersedes/superseded_by. RelatedDocs component renders the graph.
Diagrams as code. Mermaid source inline in markdown. No PNG or SVG committed. Source preserved post-render via <details> blocks.
Per-repo CLAUDE.md auto-loads when Claude Code starts a session in that repo’s cwd (Claude Code reads CLAUDE.md, not AGENTS.md — cross-vendor standard is AGENTS.md but the loader is CLAUDE.md). Same-repo local @AGENTS.md imports work; cross-repo @owner/repo/file does not fetch from GitHub (filesystem-only, max 5 hops).
Rig-wide agent instructions live in TWO places: (1) each running agent’s HelmRelease character.personality prompt (authoritative for Dev-E, Review-E in-cluster), (2) each repo’s root CLAUDE.md (authoritative for interactive sessions). Both include the BRAIN.md fetch at session start.
Closes #N required in PR bodies. Review-E blocks on this.
Memory MCP scope: operational / ephemeral state only. Durable knowledge goes to rig-docs.
Default to a two-PR split for feature work >500 LOC. large-pr-ok is reserved for migrations, codemods, dependency bumps, and generated code — not feature work that decomposes into policy + adapter. A/B-validated 2026-05-18: same code shipped as a labelled single PR got zero code-level feedback; the disciplined split caught 3 real bugs. Rig-side enforcement in rar#492; full decision tree in research/2026-05-18-pr-size-and-large-pr-ok-semantics.
Behavior PRs ship their doc updates in the same PR. Per-file convention: when src/<X>.{cs,js,ts,go,py,...} changes, docs/<X>.md (if it exists) updates alongside. Rig-side enforcement in rar#497 (detectDocMismatches surfaces a warning in the size-gate review body).
Three-layer drift-prevention playbook. When the operator catches the orchestrator drifting on a discipline recurringly + structurally observable + measurable cost: ship L1 memory rule + L2 rig-side enforcement at the trigger point + L3 durable artifact. Three instances codified the week of 2026-05-18 (PR-split shortcut, doc-staleness, main-guard rig-internal dispatch). Meta-playbook in research/2026-05-18-three-layer-drift-prevention-playbook.

Token-efficient cold start

When you pick up a new Epic with blank memory, the cheapest order of operations:

Fetch this file (https://research.rig.dashecorp.com/BRAIN.md, public, no auth) — ~27 KB.
Fetch /llms.txt for the research hub topic index — ~2 KB.
Identify 1-3 relevant research / proposal docs, fetch raw — ~5-15 KB.
Fetch target repo’s AGENTS.md (each repo’s is ≤8 KB) — ~5 KB.
read_memories from Memory MCP scoped to repo + topic — ~2 KB.

Total cold-start context: ~35-45 KB. Leaves the rest of the budget for actual work.

When this file needs updating

Manual fields that live in facts/*.yaml — update when the matching reality changes:

facts/repos.yaml — annotations only (purpose, depends_on, used_by, agents_md, docs_surface). The repo list itself is auto-derived from gh api on every compile. Adding a new annotation, or updating an existing one, happens here.
facts/surfaces.yaml — URLs, API endpoints, MCP tools. Update when an endpoint changes or a new surface is published.
facts/agents.yaml — agent deployment instances. Compile validates each manifest: path exists on GitHub and warns on drift (how ATL-E retirement was caught).
facts/flows.yaml — documented rig processes. Update after retrospectives.
facts/schema.yaml — mirrors the Zod schema in src/content.config.ts. Keep in sync manually when the schema changes.
facts/events.yaml — rig-conductor event types. Keep in sync with MapToEvent in the C# source.
facts/backlog.yaml — known gaps. Add when identified; remove when closed.

Then run npm run brain. CI (build workflow) runs brain:check and fails on drift.