LLM Wiki pattern — Karpathy analysis

What Karpathy proposes

The “LLM Wiki” gist is not a documentation style guide — it’s an architecture pattern where the LLM maintains a knowledge base rather than RAG-retrieving from raw docs.

Concrete techniques

Three-layer architecture:
- raw/ — immutable sources, LLM-read-only
- wiki/ — LLM-owned markdown synthesis
- Schema file (CLAUDE.md or AGENTS.md) defining conventions and workflows
Schema file is the key config. Quoted from the gist:

“a document (e.g. CLAUDE.md for Claude Code or AGENTS.md for Codex) that tells the LLM how the wiki is structured, what the conventions are, and what workflows to follow when ingesting sources, answering questions, or maintaining the wiki.”
Three named operations:
- Ingest — new source → touches 10-15 pages
- Query — “good answers can be filed back into the wiki as new pages”
- Lint — health-check for contradictions, stale claims, orphans, missing cross-refs
Two special files:
- index.md — content-oriented catalog, LLM reads first. Karpathy claims this works up to “~100 sources, ~hundreds of pages” and avoids embedding RAG entirely.
- log.md — append-only chronological. Greppable prefix: ## [2026-04-02] ingest | Article Title so grep "^## \[" log.md | tail -5 works.
Optional CLI tool escalation: qmd (hybrid BM25/vector + LLM rerank, CLI + MCP) when wiki outgrows index.md.
Frontmatter for Dataview-style queries: “tags, dates, source counts” → dynamic tables.
Wiki is a git repo. Version history is free.

How this applies to the dashecorp rig

Strengthens

Root CLAUDE.md / AGENTS.md as schema file. Karpathy explicitly names these as the config entry point.
Memory MCP ↔ docs split. Karpathy’s raw/ vs wiki/ split is the same shape the rig’s memory-store-vs-git-docs boundary wants to take.

Contradicts (and Karpathy’s call is better for agent rigs)

llms.txt (per llmstxt.org) — Karpathy doesn’t use it. His equivalent is index.md (LLM-maintained, categorized summaries). For an agent rig that MAINTAINS its own docs, index.md is more correct because it’s dynamic.
“One canonical docs/ directory” — Karpathy implies three dirs (raw/ wiki/ schema). Collapsing raw and wiki into one tree loses the provenance layer.

Net-new ideas the dashecorp rig is missing

log.md with greppable prefix — cheap audit trail, no DB needed. Dashecorp has none.
Lint as a scheduled operation — contradictions, stale claims, orphans. Dashecorp has supersedes: as a passive field with no process consuming it.
“Query outputs get filed back” — compounding knowledge. Currently Dev-E’s good analyses die in chat history.
Avoid embedding RAG at small scale — index.md + grep beats vector DBs under ~hundreds of pages. Dashecorp has rig-memory-mcp with pgvector; may be premature.

Application to this proposal tree

This research/ directory IS the Karpathy raw/ + partial wiki/ pattern:

research/*.md — agent-authored synthesis of external sources (wiki-layer)
External URLs in source_refs: — the raw/ pointer (until we add a raw/ archive layer)
proposals/*.md — decisions informed by research

Diagrams live as Mermaid source inline in markdown (Karpathy’s “assets” treated as code). The .mmd source is the canonical artifact; no PNG/SVG is committed.

Open questions

Do we mirror Karpathy’s three-layer strictly, or keep the simpler two-layer (research/ + proposals/)? Current rig settled on two-layer plus a separate user-stories/ — see proposals/2026-04-18-docs-tooling-decision.
How aggressive should Lint be? LLM-as-judge is expensive; scheduled weekly probably right.
How do agents decide when to “file back”? Needs an explicit rule in character prompts.