Multi-Tenancy — Serving Multiple Isolated Customers on The Rig

!!! abstract “TL;DR” Run The Rig as a shared control plane + a per-tenant siloed data plane (the AWS SaaS “bridge” posture), keyed on a single tenant_id resolved once at a trusted edge — GitHub installation_id for webhooks, Cloudflare Access identity for the dashboard, the conductor-issued session token for agents — and threaded immutably through every layer. It is never asserted by the LLM, never read from a request body or a tool argument. Pool the expensive orchestration (one conductor process, one runtime image, one Postgres instance, one k3s cluster); silo only the bytes a prompt or a customer can perceive (GitHub org/App-install, Marten event database, agent namespace + session, pgvector memory database, secrets, notifications, dashboard queries). This is the cheapest design that still gives hard isolation for a solo-plus-agents team, and it answers the open questions deferred by the two-layer-brain / multi-project user story. Invotek is tenant #0; the first external partner is the real productization milestone.

Current state (as built — 2026-06-13)

The design sections that follow are the original 2026-06-07 plan. Phase-1 is now substantially built and merged. This section is the consolidated operating guide for the system as it runs today; the per-change rationale lives in the decision records dashecorp/rig-conductor/docs/2026-06-*-multi-tenancy-*.md.

How it works now

tenant_id resolution — TenantMatch.Resolve (pure Core): repo → RepoTenant row (authoritative) → installation_id → org login; active-only, reject-unknown (never auto-provisions); default tenant invotek. Resolved once at the webhook edge and carried as a Marten event header — never LLM-asserted, never read from a payload/tool arg.
DB-per-tenant (Marten master-table tenancy, rc#1515) — a control DB rig_control holds the tenant→database registry (mt_tenant_databases) and the allowlist (Tenant/RepoTenant docs via a dedicated single-tenant IControlStore, PR-4L); each tenant’s events live in its own rig_t_<id>_evt database. tenant-0 invotek is the legacy rig_conductor DB until the PR-5 cutover renames it. The database is the tenant boundary (separate-DB tenancy — no row-filtering).
Fail-closed write boundary — RequireTenant (rc#1608): the write session (IDocumentSession) throws UnattributedTenantException on a blank/invalid tenant, so an unattributed unit of work lands in no tenant’s DB. Dashboard reads (IQuerySession) stay lenient (coalesce to invotek). The CI guard scripts/check-tenant-scope-guard.sh fails the build on any new unattributed write path.
Per-tenant dispatch (rc#1481: #1671/#1676/#1679/#1682) — RepoDiscoveryService discovers each active tenant’s repos per-org (assigned via TenantMatch.Resolve); ReviewScanService/IssueScanService iterate tenants (ITenantWorkRunner.ForEachTenantAsync) using GetRepos(tenant) + that tenant’s token (TenantGitHubToken: the tenant’s GitHub-App installation token, else the global PAT for invotek — byte-identical for single-tenant).
Per-installation GitHub tokens (rc#1665) — one shared App (dev-e-bot) mints a short-lived installation token per tenant org. Broker-independent (the broker only later moves the PEM out of env).
Per-tenant memory (rig-memory-mcp, rc#1478) — pgvector DB-per-tenant (rig_t_<id>_mem), fail-closed startup, no shared table (cross-tenant RAG bleed is the #1 risk).
The 3-tenant target (operator-locked) — dashecorp (Model-B host, the dashecorp-org default: Tenant.GithubOrg=dashecorp); invotek (Model-B hosted-for-cost, resolved via RepoTenant rows — today’s tenant-0); stigjohnny (Model-A BYO: own GitHub org + own App installation). Each gets its own Discord channel.
The cutover (PR-5) — renames rig_conductor→rig_t_invotek_evt, protected by the env-gated schema fence (rc#1684/#1685). It is an operator maintenance window, not autonomous.

Discord channels & notification routing

Rig (operator-internal) channels — unchanged by multi-tenancy:

Channel	Purpose
`#tasks`	Conductor-E announcements of every filed issue / PR threads
`#admin`	Admin questions, blockers, alerts
`#review-e`	Review-E activity; `#review-e-logs` for its raw logs
`#codi-e` / `#pi-e` / `#volt-e` / `#ibuild-e` / `#iclaw-e`	Per-agent private channels

(Full IDs in the dashecorp infra docs docs/infrastructure/agents.md.)

Per-tenant channels (multi-tenancy) — each tenant gets its own channel; routing is a per-tenant decision the conductor makes, fail-closed + forge-proof:

NotificationRoutingPolicy (rc#1643, pure Core) — the tenant comes from the server-resolved event header ONLY (never a payload/LLM field); unattributed / no-sink / mismatch → drop + alarm, never deliver to the wrong tenant.
IDiscordWebhookResolver + EnvDiscordWebhookResolver (rc#1661) — resolves a tenant’s webhook from TENANT_DISCORD_WEBHOOK__<id>; disabled-until-configured; forge-proof (a ref whose tenant-<id> ≠ the caller’s tenant resolves to null, never another tenant’s URL).
TenantNotificationDiscordRelay (rc#1668) — the outbound consumer (per-tenant Valkey cursor → classify severity → resolve sink → POST); a double dormancy gate keeps a tenant provably silent until its webhook is configured; alarms carry tenant + reason + event-type only (no payload leak).
The webhook URL is stored as a broker SecretRef (sops:tenant-<id>/discord-webhook), never raw (broker rc#1479 is partner-gated; today the env var is the seam).
External partners do NOT join the shared Discord server — any guild member can enumerate the channel list + roster, leaking other tenants’ existence. Flip an external tenant’s sink to a per-tenant webhook / Slack Connect (structurally bilateral).
Today: invotek’s tenant-invotek channel exists (Dashecorp Agents guild, 1513515150169083905) but its webhook is not minted yet, so the relay is dormant — invotek’s notifications still flow via the legacy rig channels above. Minting the webhook + storing the SecretRef makes the per-tenant relay live.

Human gates (how they work now)

Every change passes these gates; each is fail-closed. None can be asserted by an LLM — they are server/operator-side.

Review gate — a PR is routed to Review-E iff its author is a known agent bot (dev-e-bot/ibuild-e-bot) or dependabot, or it carries the needs-review label (the operator opt-in, since the review-e-dashecorp reviewer was retired). Operator-authored PRs MUST add needs-review to be reviewed. (ReviewRoutingPolicy; rc#1635.)
Merge gate — MergeGate is event-driven: it merges only when review-approved AND CI-passed AND not-blocked (24h TTL), into the PR’s base (always main). If those events are missed (e.g. a degraded window), the documented fallback is a manual squash-merge by the operator. Agents never self-merge.
Fail-closed write gate — RequireTenant (rc#1608) + the check-tenant-scope-guard.sh CI guard: an unattributed write throws and CI fails — no event ever lands in the wrong (or no) tenant DB.
Onboarding gate — TenantOnboardingGate.EnsureCanActivate (rc#1493): the single chokepoint that seeds a Tenant row refuses to activate an external tenant active until its per-tenant erasure prerequisites exist (provisioned per-tenant DB + backup-erasure path + read-model TenantId). tenant-0 invotek is exempt (the controller, already live). Structurally prevents a second live tenant before its data plane exists — the “no-window” invariant.
Schema fence (cutover) — rc#1684/#1685: armed only at the PR-5 cutover window. Armed, boot fails closed (the host stops) if any registered tenant’s schema is missing/mismatched, so a misfired rename can’t silently re-create an empty schema. Default-off and behavior-neutral until armed.
GDPR gate — rc#1486 (LAUNCH BLOCKER): no external tenant launches until the DPA, sub-processor disclosure, EU-residency, and Art.17 erasure path are signed off. The operator confirmed this applies even to the first-party tenants (dashecorp / invotek-hosted / stigjohnny). DPO question pack: rig-conductor/docs/2026-06-08-multi-tenancy-gdpr-dpo-questions.md.

Motivation

The two-layer-brain / multi-project user story established knowledge scoping (an invariant rig brain + per-project brains) but explicitly deferred the isolation half to “when the second tenant onboards”:

conductor-side routing + per-tenant cost attribution;
dispatch scope beyond the hardcoded org:dashecorp;
the undesigned details: access control, secret isolation, per-tenant event / notification boundaries (Discord per customer), and dashboards.

The Rig is being productized: Invotek is tenant #0, and onboarding a second, external partner is the milestone that forces these questions. The hard requirement: a client must never see another client’s code, issues, events, costs, memory, or notifications.

This is not ordinary web-SaaS multi-tenancy. The Rig’s data plane is an LLM context window, and the documented failure mode is organic cross-tenant retrieval/embedding bleed (research has shown the majority of benign queries leaking in a shared multi-tenant corpus). The governing principle of this proposal:

!!! warning “The LLM is the threat model, not the guard” Tenant isolation must be a server-enforced data-plane boundary, never an instruction the model is trusted to honor. tenant_id is resolved by the platform and enforced in the database and the token layer. A prompt-injected agent that tries to reach another tenant must hit a 403/404, not a polite refusal.

The model: shared control plane, siloed data plane

Adopt the bridge posture, but push the silo line further down the stack than a typical web SaaS, because anything an agent’s prompt can read — or a customer can perceive — must be isolated.

graph TB
    classDef pool fill:#e3f2fd,color:#000
    classDef silo fill:#c8e6c9,color:#000
    classDef edge fill:#fff3cd,color:#000

    subgraph EDGE["Trusted edge — resolves tenant_id once"]
        WH[GitHub webhook → installation_id]:::edge
        CA[Cloudflare Access identity]:::edge
        SS[Conductor session token]:::edge
    end

    subgraph CP["SHARED control plane (one of each)"]
        CONDUCTOR[rig-conductor process + dispatch loop]:::pool
        RUNTIME[rig-agent-runtime image / Helm]:::pool
        INFRA[KEDA · Flux · Cloudflare · k3s · one Postgres instance]:::pool
        DASHAPP[Dashboard app]:::pool
    end

    subgraph DP["SILOED data plane (per tenant)"]
        GH[GitHub org + App installation]:::silo
        DB[Marten event database rig_t_tenant]:::silo
        NS[Agent k8s namespace + session PVC]:::silo
        MEM[pgvector memory database]:::silo
        SEC[Tenant-prefixed secret refs]:::silo
        SINK[Notification sink]:::silo
    end

    WH --> CONDUCTOR
    CA --> DASHAPP
    SS --> RUNTIME
    CONDUCTOR --> DB
    CONDUCTOR --> GH
    RUNTIME --> NS
    RUNTIME --> MEM
    RUNTIME --> SEC
    CONDUCTOR --> SINK

View Mermaid source

graph TB
    classDef pool fill:#e3f2fd,color:#000
    classDef silo fill:#c8e6c9,color:#000
    classDef edge fill:#fff3cd,color:#000

    subgraph EDGE["Trusted edge — resolves tenant_id once"]
        WH[GitHub webhook → installation_id]:::edge
        CA[Cloudflare Access identity]:::edge
        SS[Conductor session token]:::edge
    end

    subgraph CP["SHARED control plane (one of each)"]
        CONDUCTOR[rig-conductor process + dispatch loop]:::pool
        RUNTIME[rig-agent-runtime image / Helm]:::pool
        INFRA[KEDA · Flux · Cloudflare · k3s · one Postgres instance]:::pool
        DASHAPP[Dashboard app]:::pool
    end

    subgraph DP["SILOED data plane (per tenant)"]
        GH[GitHub org + App installation]:::silo
        DB[Marten event database rig_t_tenant]:::silo
        NS[Agent k8s namespace + session PVC]:::silo
        MEM[pgvector memory database]:::silo
        SEC[Tenant-prefixed secret refs]:::silo
        SINK[Notification sink]:::silo
    end

    WH --> CONDUCTOR
    CA --> DASHAPP
    SS --> RUNTIME
    CONDUCTOR --> DB
    CONDUCTOR --> GH
    RUNTIME --> NS
    RUNTIME --> MEM
    RUNTIME --> SEC
    CONDUCTOR --> SINK

Rule of thumb: pure orchestration logic carrying no tenant payload is pooled; anything an agent prompt can read or a customer can perceive is siloed. This is correct for ~2–10 tenants; revisit Marten conjoined (tenant-id-column) mode and a dedicated/silo tier only at many tenants or a contractual ask.

!!! danger “Do not build” - Per-tenant copies of the conductor or runtime image — maintenance death for a tiny team. Pool the code, silo the data. - Shared-schema-with-WHERE tenant_id as the only boundary — the forgotten-filter is the #1 documented leak class. - A single Dev-E pod multiplexed across tenants — a long-lived shared context window is exactly the cross-tenant bleed surface.

The keystone: a server-resolved `tenant_id`

One change unblocks everything: a stable tenant_id resolved at a trusted source and used as the connection-resolution key for every layer.

Webhooks → resolve from the GitHub payload installation_id/org before any domain logic. A misrouted webhook lands in no tenant rather than the wrong one.
Dashboard/API → from the Cloudflare Access identity claim, mapped to a DB connection at handler entry (never a query param the caller can tamper with).
Agents → stamped by the conductor into the dispatcher annotation + session token; the agent never resolves or carries it.

Gate auto-provisioning behind a tenants allowlist table so a typo can’t silently create a phantom tenant.

Per-layer isolation

Layer	Posture	Mechanism
GitHub	silo (free, highest-leverage)	Separate App installation per customer org; installation-scoped 1 h tokens physically cannot read another org. Conductor stores `(tenant_id → org, installationId)`. The shared App private key becomes the top-protected secret (broker-held, rotated).
Event store	silo	Marten database-per-tenant on the one Postgres instance (`rig_t_<tenant>`); connection resolved per-request. Projections/dashboard SQL stay tenant-naive (no forgotten-`WHERE`). Also stamp `tenant_id` on the event envelope for self-describing audit + a non-breaking path to conjoined mode.
Dispatch / routing	pooled engine, tenant-aware loop	Replace hardcoded `org:dashecorp` (and Review-E’s author search filter) with the `tenants` allowlist the loop iterates; per iteration open that tenant’s DB + mint its installation token.
Agent pods	silo namespace, pool image	One k8s namespace per tenant (default-deny cross-namespace NetworkPolicy, per-tenant ServiceAccount + session PVC); KEDA scale-to-zero per persona. One pod = one tenant for its whole lifecycle; hard-wipe the session PVC at handoff.
Memory (pgvector)	silo	Database-per-tenant, not soft scope filters (filter-after-retrieval is the documented leak path). If a shared table is unavoidable: Postgres FORCE ROW LEVEL SECURITY + a non-superuser role + session-sourced `tenant_id`. Forbid cross-tenant promotion of “learnings”.
Secrets	silo (mostly free)	Tenant-namespaced refs (`gh:tenant-b/…`, `sops:tenants/<id>/…`, `k8s:tenant-<id>/…`). The broker resolves the tenant prefix from the session token and hard-rejects (403) any ref whose prefix ≠ the session’s bound tenant. Per-tenant SOPS age key (kustomize-patched per namespace).
Dashboard / API	shared app, tenant-scoped	`tenant_id` from the Access claim → DB connection at handler entry; a platform/admin scope reads aggregate for billing. Access-policy → tenant map managed in OpenTofu.
Two-layer brain	fits cleanly	Rig brain = pooled control-plane fact; per-project brain at `<project>-docs.pages.dev/BRAIN.md` = siloed, fetched only with that tenant’s installation token. Never inject tenant-A’s brain into a tenant-B session.

Cross-tenant leakage & security (the non-negotiables)

These must ship before any external tenant’s data enters the system:

tenant_id is forge-proof and server-resolved everywhere — derived once at the edge, passed as an immutable context object, never re-derived from a body/tool-arg/LLM output.
Secrets-broker hard-reject (the keystone control) — without it, a single poisoned issue/PR/Discord body telling an agent to fetch gh:tenant-a/… while serving tenant-b is game over; with it, the worst case is a 403.
Hard memory isolation — DB-per-tenant pgvector (or FORCE RLS), shipped before a 2nd tenant’s data lands.
Single-tenant agent sessions — one pod = one tenant; never reuse a session/PVC across tenants; tenant-scoped system prompt + both brains.
Per-tenant GitHub App installation replacing the hardcoded org:dashecorp scope; webhook resolves tenant from installation_id.
A cross-tenant isolation integration test in CI that gates every PR (assert tenant A can never touch tenant B’s DB; feed the notification formatter mixed-tenant events and assert each sink sees only its own).
Event store + merge gate are tenant-scoped — a tenant-B approval can never satisfy a tenant-A PR.

This proposal covers the runtime tenancy layer; it complements the Agent Secrets Broker proposal (secret lifecycle) and the trust-model / security whitepapers (gates, supply chain).

Client communication & notifications (the Discord question)

Treat “where notifications land” as a per-tenant routing decision the conductor makes, not a Discord topology decision. Add a tenant_notification_sink (tenant_id, sink_type[discord|slack_connect|webhook|email], secret_ref, dashboard_base_url, severity_filter) table and a single background consumer on the existing event stream that filters by event.tenant_id, resolves that tenant’s sink, formats, and delivers (Valkey for retry/dedupe). Onboarding a tenant becomes “insert a config row.”

The dashboard is the system of record; notifications are deep-link-only pointers (never inline another tenant’s repo names/costs).
Invotek / internal → Discord (channel per tenant) — fine while the operator is the only human consumer.
External partner’s own staff need in → do NOT invite them to the shared Discord server. Any guild member can enumerate the channel list + member roster, which leaks other tenants’ existence. Flip that tenant’s sink to a per-tenant webhook or Slack Connect channel (structurally bilateral — tenant B is never in tenant A’s channel; developer buyers expect it).
A separate Discord server per client adds bot/role/invite plumbing for zero extra hard isolation (the bot + operator see everything anyway) — only justified by a contractual demand for an isolated space.
Universal fallback that never blocks onboarding: an EU-region email/webhook digest rendered from the same tenant-scoped read models.

Cost & metering

The dominant variable cost is LLM tokens, and the provider invoice (Anthropic) aggregates at org level and cannot be disaggregated — so per-tenant cost must be metered at the application layer. The Rig already emits TOKEN_USAGE / CLI_COMPLETED events carrying repo but not tenant; stamping tenant_id at ingest turns these into true per-tenant COGS, grouped by the cost projection and exposed via ?tenantId= on the cost/usage endpoints. Add a tenant=unknown alarm so unattributed cost never silently becomes margin leak.

Economics: the pooled control plane adds ~zero marginal infra per tenant (same instance, N logical databases; scale-to-zero pods; a free App install). Price cost-plus: a base/seat fee + a metered margin on token spend (~1.3–1.8× pass-through), with per-tenant budgets (hard $ cap) and quotas enforced at the runtime layer so a runaway tenant degrades only itself. Reconcile the provider invoice monthly (caching/rounding drift); a manual GET /api/costs/summary?tenantId= is sufficient for the first few tenants.

A launch blocker, not a nice-to-have — sending an external tenant’s code to Anthropic without a signed DPA + sub-processor disclosure is a compliance breach before any technical leak.

Residency: pin Postgres / pgvector / k3s to EU regions (already EU — document it).
Paperwork: publish the sub-processor list (Anthropic as LLM processor, GitHub, Cloudflare, GCP) and sign a DPA per external tenant (Rig = processor; tenant = controller of their code/issues).
Right to erasure: DROP the tenant’s event DB + memory DB + PVCs + secrets + notification space. For the immutable event store, pair with crypto-shredding (per-tenant key held outside the store; deletion destroys the key) — but note encrypted PII is still PII, so the actual DB drop is what makes erasure defensible.

Phasing

!!! note “Phase 0 — Tenant-0 retrofit (Invotek; do first; zero behavior change)” Add tenant_id to the event envelope and backfill all existing dashecorp events as tenant_id=invotek (must precede the second tenant’s first event or projections mis-attribute). Stand up the tenants + repo_tenant registry (allowlist) and the webhook tenant-resolver. Group cost/usage projections by tenant_id and add ?tenantId= to the cost/usage endpoints. Wire Invotek as sink_type=discord (existing channels unchanged). Add the tenant=unknown cost alarm. Net: the keystone is in, nothing changes for Invotek.

!!! note “Phase 1 — First external design partner (the productization milestone)” Marten database-per-tenant + connection resolution. Dispatch loop + Review-E filter iterate the tenants allowlist; per-tenant installation tokens; webhook → tenant by installation_id. Partner installs the agent GitHub Apps on their own org. Namespace-per-tenant pods (default-deny NetworkPolicy, per-tenant SA/PVC, hard wipe at handoff). pgvector database-per-tenant. Tenant-prefixed broker refs + hard-reject guard + per-tenant SOPS key. Dashboard tenant_id from the Access claim. Ship the cross-tenant isolation integration test. Notification sink table + outbound consumer (webhook + email-digest); flip the partner to a per-tenant webhook/Slack Connect — not the shared Discord. GDPR pack signed. Give the partner a real (even if discounted) priced contract so metering + invoicing get exercised.

!!! note “Phase 2 — GA / general onboarding” An OpenTofu tenant module so onboarding = one tofu apply + one merge (the manual GitHub App install on the customer org is the one accepted bootstrap step). Per-tenant budgets + quotas. Slack Connect sink on demand. Tiers in the table (shared default; build the dedicated/silo tier only when a regulated/high-ACV customer pays). Later: customer-facing portal, automated invoicing + provider reconciliation, self-serve signup, Marten conjoined mode if tenant count grows large, RLS as belt-and-suspenders inside each tenant DB, formal DPIA + pen-test once 2+ external tenants are live.

Risks

The shared conductor is now the crown-jewel boundary — a tenant-resolution bug (wrong DB connection or installation token) leaks everything; a shared event-store outage hits all tenants. Mitigate: resolve tenant_id once at the edge, pass it immutably, tightest review gate on the conductor, the A-cannot-touch-B test. Acceptable at 2–3 tenants; revisit before scaling.
LLM context bleed remains the highest-severity risk even with infra isolation; it is contained only if agents stay strictly single-tenant per session and never resume across tenants.
Soft pgvector filters in the current rig-memory-mcp are an active liability — ship DB-per-tenant (or FORCE RLS) before tenant #2.
Marten auto-provisioning a DB on first-seen tenant_id is a footgun — gate it behind the allowlist; reject unknown tenant_id.
Discord roster/channel enumeration is the structural reason to keep the shared server operator-internal and flip external partners to webhook/Slack Connect.
GDPR exposure is a launch blocker — missing DPA/sub-processor disclosure precedes any technical leak.
Unattributed cost / provider-invoice drift — any event without tenant_id is eaten margin; the tenant=unknown alarm + monthly reconciliation are mandatory.

Open decisions (owner)

Per-tenant App installation (recommended default) vs one App per tenant.
Design-partner pricing (recommend a real, even if discounted, priced contract); token markup multiple + base fee.
External-partner notification surface (webhook / Slack Connect / email) — ask the partner; default to webhook/email so onboarding never blocks.
GDPR ownership — who signs the DPA, the published sub-processor list, whether a DPIA is needed before partner #2.
Whether/when to offer a dedicated silo tier, and at what price.
The immutable tenant_id naming convention (becomes DB names, namespace suffixes, audit keys) — bless it before backfill.

Relation to existing work

Answers: two-layer-brain / multi-project user story (its deferred isolation, cost-attribution, and dispatch-scope questions).
Complements: Agent Secrets Broker (extended here with tenant-namespaced refs + hard-reject), plus the trust-model and security whitepapers (gates, supply chain).

Multi-Tenancy — Serving Multiple Isolated Customers on The Rig

Multi-Tenancy — Serving Multiple Isolated Customers on The Rig

Current state (as built — 2026-06-13)

How it works now

Discord channels & notification routing

Human gates (how they work now)

Motivation

The model: shared control plane, siloed data plane

The keystone: a server-resolved tenant_id

Per-layer isolation

Cross-tenant leakage & security (the non-negotiables)

Client communication & notifications (the Discord question)

Cost & metering

Compliance (EU / GDPR)

Phasing

Risks

Open decisions (owner)

Relation to existing work

The keystone: a server-resolved `tenant_id`