Multi-Tenancy — Serving Multiple Isolated Customers on The Rig
Multi-Tenancy — Serving Multiple Isolated Customers on The Rig
Section titled “Multi-Tenancy — Serving Multiple Isolated Customers on The Rig”!!! abstract “TL;DR”
Run The Rig as a shared control plane + a per-tenant siloed data plane (the AWS SaaS “bridge” posture), keyed on a single tenant_id resolved once at a trusted edge — GitHub installation_id for webhooks, Cloudflare Access identity for the dashboard, the conductor-issued session token for agents — and threaded immutably through every layer. It is never asserted by the LLM, never read from a request body or a tool argument. Pool the expensive orchestration (one conductor process, one runtime image, one Postgres instance, one k3s cluster); silo only the bytes a prompt or a customer can perceive (GitHub org/App-install, Marten event database, agent namespace + session, pgvector memory database, secrets, notifications, dashboard queries). This is the cheapest design that still gives hard isolation for a solo-plus-agents team, and it answers the open questions deferred by the two-layer-brain / multi-project user story. Invotek is tenant #0; the first external partner is the real productization milestone.
Current state (as built — 2026-06-13)
Section titled “Current state (as built — 2026-06-13)”The design sections that follow are the original 2026-06-07 plan. Phase-1 is now substantially built and merged. This section is the consolidated operating guide for the system as it runs today; the per-change rationale lives in the decision records
dashecorp/rig-conductor/docs/2026-06-*-multi-tenancy-*.md.
How it works now
Section titled “How it works now”tenant_idresolution —TenantMatch.Resolve(pure Core): repo →RepoTenantrow (authoritative) →installation_id→ org login; active-only, reject-unknown (never auto-provisions); default tenantinvotek. Resolved once at the webhook edge and carried as a Marten event header — never LLM-asserted, never read from a payload/tool arg.- DB-per-tenant (Marten master-table tenancy, rc#1515) — a control DB
rig_controlholds the tenant→database registry (mt_tenant_databases) and the allowlist (Tenant/RepoTenantdocs via a dedicated single-tenantIControlStore, PR-4L); each tenant’s events live in its ownrig_t_<id>_evtdatabase. tenant-0invotekis the legacyrig_conductorDB until the PR-5 cutover renames it. The database is the tenant boundary (separate-DB tenancy — no row-filtering). - Fail-closed write boundary —
RequireTenant(rc#1608): the write session (IDocumentSession) throwsUnattributedTenantExceptionon a blank/invalid tenant, so an unattributed unit of work lands in no tenant’s DB. Dashboard reads (IQuerySession) stay lenient (coalesce to invotek). The CI guardscripts/check-tenant-scope-guard.shfails the build on any new unattributed write path. - Per-tenant dispatch (rc#1481: #1671/#1676/#1679/#1682) —
RepoDiscoveryServicediscovers each active tenant’s repos per-org (assigned viaTenantMatch.Resolve);ReviewScanService/IssueScanServiceiterate tenants (ITenantWorkRunner.ForEachTenantAsync) usingGetRepos(tenant)+ that tenant’s token (TenantGitHubToken: the tenant’s GitHub-App installation token, else the global PAT for invotek — byte-identical for single-tenant). - Per-installation GitHub tokens (rc#1665) — one shared App (
dev-e-bot) mints a short-lived installation token per tenant org. Broker-independent (the broker only later moves the PEM out of env). - Per-tenant memory (rig-memory-mcp, rc#1478) — pgvector DB-per-tenant (
rig_t_<id>_mem), fail-closed startup, no shared table (cross-tenant RAG bleed is the #1 risk). - The 3-tenant target (operator-locked) — dashecorp (Model-B host, the dashecorp-org default:
Tenant.GithubOrg=dashecorp); invotek (Model-B hosted-for-cost, resolved viaRepoTenantrows — today’s tenant-0); stigjohnny (Model-A BYO: own GitHub org + own App installation). Each gets its own Discord channel. - The cutover (PR-5) — renames
rig_conductor→rig_t_invotek_evt, protected by the env-gated schema fence (rc#1684/#1685). It is an operator maintenance window, not autonomous.
Discord channels & notification routing
Section titled “Discord channels & notification routing”Rig (operator-internal) channels — unchanged by multi-tenancy:
| Channel | Purpose |
|---|---|
#tasks | Conductor-E announcements of every filed issue / PR threads |
#admin | Admin questions, blockers, alerts |
#review-e | Review-E activity; #review-e-logs for its raw logs |
#codi-e / #pi-e / #volt-e / #ibuild-e / #iclaw-e | Per-agent private channels |
(Full IDs in the dashecorp infra docs docs/infrastructure/agents.md.)
Per-tenant channels (multi-tenancy) — each tenant gets its own channel; routing is a per-tenant decision the conductor makes, fail-closed + forge-proof:
NotificationRoutingPolicy(rc#1643, pure Core) — the tenant comes from the server-resolved event header ONLY (never a payload/LLM field); unattributed / no-sink / mismatch → drop + alarm, never deliver to the wrong tenant.IDiscordWebhookResolver+EnvDiscordWebhookResolver(rc#1661) — resolves a tenant’s webhook fromTENANT_DISCORD_WEBHOOK__<id>; disabled-until-configured; forge-proof (a ref whosetenant-<id>≠ the caller’s tenant resolves tonull, never another tenant’s URL).TenantNotificationDiscordRelay(rc#1668) — the outbound consumer (per-tenant Valkey cursor → classify severity → resolve sink → POST); a double dormancy gate keeps a tenant provably silent until its webhook is configured; alarms carry tenant + reason + event-type only (no payload leak).- The webhook URL is stored as a broker SecretRef (
sops:tenant-<id>/discord-webhook), never raw (broker rc#1479 is partner-gated; today the env var is the seam). - External partners do NOT join the shared Discord server — any guild member can enumerate the channel list + roster, leaking other tenants’ existence. Flip an external tenant’s sink to a per-tenant webhook / Slack Connect (structurally bilateral).
- Today: invotek’s
tenant-invotekchannel exists (Dashecorp Agents guild,1513515150169083905) but its webhook is not minted yet, so the relay is dormant — invotek’s notifications still flow via the legacy rig channels above. Minting the webhook + storing the SecretRef makes the per-tenant relay live.
Human gates (how they work now)
Section titled “Human gates (how they work now)”Every change passes these gates; each is fail-closed. None can be asserted by an LLM — they are server/operator-side.
- Review gate — a PR is routed to Review-E iff its author is a known agent bot (
dev-e-bot/ibuild-e-bot) ordependabot, or it carries theneeds-reviewlabel (the operator opt-in, since thereview-e-dashecorpreviewer was retired). Operator-authored PRs MUST addneeds-reviewto be reviewed. (ReviewRoutingPolicy; rc#1635.) - Merge gate —
MergeGateis event-driven: it merges only when review-approved AND CI-passed AND not-blocked (24h TTL), into the PR’s base (alwaysmain). If those events are missed (e.g. a degraded window), the documented fallback is a manual squash-merge by the operator. Agents never self-merge. - Fail-closed write gate —
RequireTenant(rc#1608) + thecheck-tenant-scope-guard.shCI guard: an unattributed write throws and CI fails — no event ever lands in the wrong (or no) tenant DB. - Onboarding gate —
TenantOnboardingGate.EnsureCanActivate(rc#1493): the single chokepoint that seeds aTenantrow refuses to activate an external tenantactiveuntil its per-tenant erasure prerequisites exist (provisioned per-tenant DB + backup-erasure path + read-modelTenantId). tenant-0invotekis exempt (the controller, already live). Structurally prevents a second live tenant before its data plane exists — the “no-window” invariant. - Schema fence (cutover) — rc#1684/#1685: armed only at the PR-5 cutover window. Armed, boot fails closed (the host stops) if any registered tenant’s schema is missing/mismatched, so a misfired rename can’t silently re-create an empty schema. Default-off and behavior-neutral until armed.
- GDPR gate — rc#1486 (LAUNCH BLOCKER): no external tenant launches until the DPA, sub-processor disclosure, EU-residency, and Art.17 erasure path are signed off. The operator confirmed this applies even to the first-party tenants (dashecorp / invotek-hosted / stigjohnny). DPO question pack:
rig-conductor/docs/2026-06-08-multi-tenancy-gdpr-dpo-questions.md.
Motivation
Section titled “Motivation”The two-layer-brain / multi-project user story established knowledge scoping (an invariant rig brain + per-project brains) but explicitly deferred the isolation half to “when the second tenant onboards”:
- conductor-side routing + per-tenant cost attribution;
- dispatch scope beyond the hardcoded
org:dashecorp; - the undesigned details: access control, secret isolation, per-tenant event / notification boundaries (Discord per customer), and dashboards.
The Rig is being productized: Invotek is tenant #0, and onboarding a second, external partner is the milestone that forces these questions. The hard requirement: a client must never see another client’s code, issues, events, costs, memory, or notifications.
This is not ordinary web-SaaS multi-tenancy. The Rig’s data plane is an LLM context window, and the documented failure mode is organic cross-tenant retrieval/embedding bleed (research has shown the majority of benign queries leaking in a shared multi-tenant corpus). The governing principle of this proposal:
!!! warning “The LLM is the threat model, not the guard”
Tenant isolation must be a server-enforced data-plane boundary, never an instruction the model is trusted to honor. tenant_id is resolved by the platform and enforced in the database and the token layer. A prompt-injected agent that tries to reach another tenant must hit a 403/404, not a polite refusal.
The model: shared control plane, siloed data plane
Section titled “The model: shared control plane, siloed data plane”Adopt the bridge posture, but push the silo line further down the stack than a typical web SaaS, because anything an agent’s prompt can read — or a customer can perceive — must be isolated.
graph TB
classDef pool fill:#e3f2fd,color:#000
classDef silo fill:#c8e6c9,color:#000
classDef edge fill:#fff3cd,color:#000
subgraph EDGE["Trusted edge — resolves tenant_id once"]
WH[GitHub webhook → installation_id]:::edge
CA[Cloudflare Access identity]:::edge
SS[Conductor session token]:::edge
end
subgraph CP["SHARED control plane (one of each)"]
CONDUCTOR[rig-conductor process + dispatch loop]:::pool
RUNTIME[rig-agent-runtime image / Helm]:::pool
INFRA[KEDA · Flux · Cloudflare · k3s · one Postgres instance]:::pool
DASHAPP[Dashboard app]:::pool
end
subgraph DP["SILOED data plane (per tenant)"]
GH[GitHub org + App installation]:::silo
DB[Marten event database rig_t_tenant]:::silo
NS[Agent k8s namespace + session PVC]:::silo
MEM[pgvector memory database]:::silo
SEC[Tenant-prefixed secret refs]:::silo
SINK[Notification sink]:::silo
end
WH --> CONDUCTOR
CA --> DASHAPP
SS --> RUNTIME
CONDUCTOR --> DB
CONDUCTOR --> GH
RUNTIME --> NS
RUNTIME --> MEM
RUNTIME --> SEC
CONDUCTOR --> SINKView Mermaid source
graph TB
classDef pool fill:#e3f2fd,color:#000
classDef silo fill:#c8e6c9,color:#000
classDef edge fill:#fff3cd,color:#000
subgraph EDGE["Trusted edge — resolves tenant_id once"]
WH[GitHub webhook → installation_id]:::edge
CA[Cloudflare Access identity]:::edge
SS[Conductor session token]:::edge
end
subgraph CP["SHARED control plane (one of each)"]
CONDUCTOR[rig-conductor process + dispatch loop]:::pool
RUNTIME[rig-agent-runtime image / Helm]:::pool
INFRA[KEDA · Flux · Cloudflare · k3s · one Postgres instance]:::pool
DASHAPP[Dashboard app]:::pool
end
subgraph DP["SILOED data plane (per tenant)"]
GH[GitHub org + App installation]:::silo
DB[Marten event database rig_t_tenant]:::silo
NS[Agent k8s namespace + session PVC]:::silo
MEM[pgvector memory database]:::silo
SEC[Tenant-prefixed secret refs]:::silo
SINK[Notification sink]:::silo
end
WH --> CONDUCTOR
CA --> DASHAPP
SS --> RUNTIME
CONDUCTOR --> DB
CONDUCTOR --> GH
RUNTIME --> NS
RUNTIME --> MEM
RUNTIME --> SEC
CONDUCTOR --> SINKRule of thumb: pure orchestration logic carrying no tenant payload is pooled; anything an agent prompt can read or a customer can perceive is siloed. This is correct for ~2–10 tenants; revisit Marten conjoined (tenant-id-column) mode and a dedicated/silo tier only at many tenants or a contractual ask.
!!! danger “Do not build”
- Per-tenant copies of the conductor or runtime image — maintenance death for a tiny team. Pool the code, silo the data.
- Shared-schema-with-WHERE tenant_id as the only boundary — the forgotten-filter is the #1 documented leak class.
- A single Dev-E pod multiplexed across tenants — a long-lived shared context window is exactly the cross-tenant bleed surface.
The keystone: a server-resolved tenant_id
Section titled “The keystone: a server-resolved tenant_id”One change unblocks everything: a stable tenant_id resolved at a trusted source and used as the connection-resolution key for every layer.
- Webhooks → resolve from the GitHub payload
installation_id/org before any domain logic. A misrouted webhook lands in no tenant rather than the wrong one. - Dashboard/API → from the Cloudflare Access identity claim, mapped to a DB connection at handler entry (never a query param the caller can tamper with).
- Agents → stamped by the conductor into the dispatcher annotation + session token; the agent never resolves or carries it.
Gate auto-provisioning behind a tenants allowlist table so a typo can’t silently create a phantom tenant.
Per-layer isolation
Section titled “Per-layer isolation”| Layer | Posture | Mechanism |
|---|---|---|
| GitHub | silo (free, highest-leverage) | Separate App installation per customer org; installation-scoped 1 h tokens physically cannot read another org. Conductor stores (tenant_id → org, installationId). The shared App private key becomes the top-protected secret (broker-held, rotated). |
| Event store | silo | Marten database-per-tenant on the one Postgres instance (rig_t_<tenant>); connection resolved per-request. Projections/dashboard SQL stay tenant-naive (no forgotten-WHERE). Also stamp tenant_id on the event envelope for self-describing audit + a non-breaking path to conjoined mode. |
| Dispatch / routing | pooled engine, tenant-aware loop | Replace hardcoded org:dashecorp (and Review-E’s author search filter) with the tenants allowlist the loop iterates; per iteration open that tenant’s DB + mint its installation token. |
| Agent pods | silo namespace, pool image | One k8s namespace per tenant (default-deny cross-namespace NetworkPolicy, per-tenant ServiceAccount + session PVC); KEDA scale-to-zero per persona. One pod = one tenant for its whole lifecycle; hard-wipe the session PVC at handoff. |
| Memory (pgvector) | silo | Database-per-tenant, not soft scope filters (filter-after-retrieval is the documented leak path). If a shared table is unavoidable: Postgres FORCE ROW LEVEL SECURITY + a non-superuser role + session-sourced tenant_id. Forbid cross-tenant promotion of “learnings”. |
| Secrets | silo (mostly free) | Tenant-namespaced refs (gh:tenant-b/…, sops:tenants/<id>/…, k8s:tenant-<id>/…). The broker resolves the tenant prefix from the session token and hard-rejects (403) any ref whose prefix ≠ the session’s bound tenant. Per-tenant SOPS age key (kustomize-patched per namespace). |
| Dashboard / API | shared app, tenant-scoped | tenant_id from the Access claim → DB connection at handler entry; a platform/admin scope reads aggregate for billing. Access-policy → tenant map managed in OpenTofu. |
| Two-layer brain | fits cleanly | Rig brain = pooled control-plane fact; per-project brain at <project>-docs.pages.dev/BRAIN.md = siloed, fetched only with that tenant’s installation token. Never inject tenant-A’s brain into a tenant-B session. |
Cross-tenant leakage & security (the non-negotiables)
Section titled “Cross-tenant leakage & security (the non-negotiables)”These must ship before any external tenant’s data enters the system:
tenant_idis forge-proof and server-resolved everywhere — derived once at the edge, passed as an immutable context object, never re-derived from a body/tool-arg/LLM output.- Secrets-broker hard-reject (the keystone control) — without it, a single poisoned issue/PR/Discord body telling an agent to fetch
gh:tenant-a/…while serving tenant-b is game over; with it, the worst case is a403. - Hard memory isolation — DB-per-tenant pgvector (or FORCE RLS), shipped before a 2nd tenant’s data lands.
- Single-tenant agent sessions — one pod = one tenant; never reuse a session/PVC across tenants; tenant-scoped system prompt + both brains.
- Per-tenant GitHub App installation replacing the hardcoded
org:dashecorpscope; webhook resolves tenant frominstallation_id. - A cross-tenant isolation integration test in CI that gates every PR (assert tenant A can never touch tenant B’s DB; feed the notification formatter mixed-tenant events and assert each sink sees only its own).
- Event store + merge gate are tenant-scoped — a tenant-B approval can never satisfy a tenant-A PR.
This proposal covers the runtime tenancy layer; it complements the Agent Secrets Broker proposal (secret lifecycle) and the trust-model / security whitepapers (gates, supply chain).
Client communication & notifications (the Discord question)
Section titled “Client communication & notifications (the Discord question)”Treat “where notifications land” as a per-tenant routing decision the conductor makes, not a Discord topology decision. Add a tenant_notification_sink (tenant_id, sink_type[discord|slack_connect|webhook|email], secret_ref, dashboard_base_url, severity_filter) table and a single background consumer on the existing event stream that filters by event.tenant_id, resolves that tenant’s sink, formats, and delivers (Valkey for retry/dedupe). Onboarding a tenant becomes “insert a config row.”
- The dashboard is the system of record; notifications are deep-link-only pointers (never inline another tenant’s repo names/costs).
- Invotek / internal → Discord (channel per tenant) — fine while the operator is the only human consumer.
- External partner’s own staff need in → do NOT invite them to the shared Discord server. Any guild member can enumerate the channel list + member roster, which leaks other tenants’ existence. Flip that tenant’s sink to a per-tenant webhook or Slack Connect channel (structurally bilateral — tenant B is never in tenant A’s channel; developer buyers expect it).
- A separate Discord server per client adds bot/role/invite plumbing for zero extra hard isolation (the bot + operator see everything anyway) — only justified by a contractual demand for an isolated space.
- Universal fallback that never blocks onboarding: an EU-region email/webhook digest rendered from the same tenant-scoped read models.
Cost & metering
Section titled “Cost & metering”The dominant variable cost is LLM tokens, and the provider invoice (Anthropic) aggregates at org level and cannot be disaggregated — so per-tenant cost must be metered at the application layer. The Rig already emits TOKEN_USAGE / CLI_COMPLETED events carrying repo but not tenant; stamping tenant_id at ingest turns these into true per-tenant COGS, grouped by the cost projection and exposed via ?tenantId= on the cost/usage endpoints. Add a tenant=unknown alarm so unattributed cost never silently becomes margin leak.
Economics: the pooled control plane adds ~zero marginal infra per tenant (same instance, N logical databases; scale-to-zero pods; a free App install). Price cost-plus: a base/seat fee + a metered margin on token spend (~1.3–1.8× pass-through), with per-tenant budgets (hard $ cap) and quotas enforced at the runtime layer so a runaway tenant degrades only itself. Reconcile the provider invoice monthly (caching/rounding drift); a manual GET /api/costs/summary?tenantId= is sufficient for the first few tenants.
Compliance (EU / GDPR)
Section titled “Compliance (EU / GDPR)”A launch blocker, not a nice-to-have — sending an external tenant’s code to Anthropic without a signed DPA + sub-processor disclosure is a compliance breach before any technical leak.
- Residency: pin Postgres / pgvector / k3s to EU regions (already EU — document it).
- Paperwork: publish the sub-processor list (Anthropic as LLM processor, GitHub, Cloudflare, GCP) and sign a DPA per external tenant (Rig = processor; tenant = controller of their code/issues).
- Right to erasure:
DROPthe tenant’s event DB + memory DB + PVCs + secrets + notification space. For the immutable event store, pair with crypto-shredding (per-tenant key held outside the store; deletion destroys the key) — but note encrypted PII is still PII, so the actual DB drop is what makes erasure defensible.
Phasing
Section titled “Phasing”!!! note “Phase 0 — Tenant-0 retrofit (Invotek; do first; zero behavior change)”
Add tenant_id to the event envelope and backfill all existing dashecorp events as tenant_id=invotek (must precede the second tenant’s first event or projections mis-attribute). Stand up the tenants + repo_tenant registry (allowlist) and the webhook tenant-resolver. Group cost/usage projections by tenant_id and add ?tenantId= to the cost/usage endpoints. Wire Invotek as sink_type=discord (existing channels unchanged). Add the tenant=unknown cost alarm. Net: the keystone is in, nothing changes for Invotek.
!!! note “Phase 1 — First external design partner (the productization milestone)”
Marten database-per-tenant + connection resolution. Dispatch loop + Review-E filter iterate the tenants allowlist; per-tenant installation tokens; webhook → tenant by installation_id. Partner installs the agent GitHub Apps on their own org. Namespace-per-tenant pods (default-deny NetworkPolicy, per-tenant SA/PVC, hard wipe at handoff). pgvector database-per-tenant. Tenant-prefixed broker refs + hard-reject guard + per-tenant SOPS key. Dashboard tenant_id from the Access claim. Ship the cross-tenant isolation integration test. Notification sink table + outbound consumer (webhook + email-digest); flip the partner to a per-tenant webhook/Slack Connect — not the shared Discord. GDPR pack signed. Give the partner a real (even if discounted) priced contract so metering + invoicing get exercised.
!!! note “Phase 2 — GA / general onboarding”
An OpenTofu tenant module so onboarding = one tofu apply + one merge (the manual GitHub App install on the customer org is the one accepted bootstrap step). Per-tenant budgets + quotas. Slack Connect sink on demand. Tiers in the table (shared default; build the dedicated/silo tier only when a regulated/high-ACV customer pays). Later: customer-facing portal, automated invoicing + provider reconciliation, self-serve signup, Marten conjoined mode if tenant count grows large, RLS as belt-and-suspenders inside each tenant DB, formal DPIA + pen-test once 2+ external tenants are live.
- The shared conductor is now the crown-jewel boundary — a tenant-resolution bug (wrong DB connection or installation token) leaks everything; a shared event-store outage hits all tenants. Mitigate: resolve
tenant_idonce at the edge, pass it immutably, tightest review gate on the conductor, the A-cannot-touch-B test. Acceptable at 2–3 tenants; revisit before scaling. - LLM context bleed remains the highest-severity risk even with infra isolation; it is contained only if agents stay strictly single-tenant per session and never resume across tenants.
- Soft pgvector filters in the current
rig-memory-mcpare an active liability — ship DB-per-tenant (or FORCE RLS) before tenant #2. - Marten auto-provisioning a DB on first-seen
tenant_idis a footgun — gate it behind the allowlist; reject unknowntenant_id. - Discord roster/channel enumeration is the structural reason to keep the shared server operator-internal and flip external partners to webhook/Slack Connect.
- GDPR exposure is a launch blocker — missing DPA/sub-processor disclosure precedes any technical leak.
- Unattributed cost / provider-invoice drift — any event without
tenant_idis eaten margin; thetenant=unknownalarm + monthly reconciliation are mandatory.
Open decisions (owner)
Section titled “Open decisions (owner)”- Per-tenant App installation (recommended default) vs one App per tenant.
- Design-partner pricing (recommend a real, even if discounted, priced contract); token markup multiple + base fee.
- External-partner notification surface (webhook / Slack Connect / email) — ask the partner; default to webhook/email so onboarding never blocks.
- GDPR ownership — who signs the DPA, the published sub-processor list, whether a DPIA is needed before partner #2.
- Whether/when to offer a dedicated silo tier, and at what price.
- The immutable
tenant_idnaming convention (becomes DB names, namespace suffixes, audit keys) — bless it before backfill.
Relation to existing work
Section titled “Relation to existing work”- Answers: two-layer-brain / multi-project user story (its deferred isolation, cost-attribution, and dispatch-scope questions).
- Complements: Agent Secrets Broker (extended here with tenant-namespaced refs + hard-reject), plus the trust-model and security whitepapers (gates, supply chain).