Production agent docs patterns — Vercel, Cloudflare, HumanLayer, Anthropic

Production agent docs patterns

Headline finding (Vercel eval, Jan 2026)

Vercel ran a formal eval measuring agent success rates across documentation strategies on hardened Build/Lint/Test workflows:

Strategy	Pass rate
Baseline (no AGENTS.md)	53%
Skills, default loaded	53% (no improvement)
Skills with explicit invocation instructions	79%
AGENTS.md with embedded 8 KB compressed docs index	100%

Vercel’s quote on why: “Prefer retrieval-led reasoning over pre-training-led reasoning.” Passive context (facts the agent reads without deciding to retrieve) beats active retrieval steps where the agent might mis-route.

Production AGENTS.md shapes

Vercel (`vercel/vercel/AGENTS.md`)

~140 lines, 11 sections. Not aspirational prose — a rulebook.
Sections: Repository Structure, Essential Commands, Changesets (with mandatory rules), Code Style, Testing Patterns, Package Development, Runtime Packages, CLI Development, Common Pitfalls.
Ends with a numbered “Common Pitfalls” list — the literal anti-patterns agents hit in this repo.

Cloudflare (`cloudflare/cloudflare-docs/AGENTS.md`)

733 lines — an outlier driven by MDX rendering quirks.
Verbatim “MDX gotchas — the #1 cause of build failures” table mapping {, }, <, > to specific fixes.
12-item “Common mistakes to avoid” numbered list.
CI-vs-local split: “npm run build will time out in CI environments… use npm run check and linters only — do not run a full build.”

Vercel Labs (`vercel-labs/agent-skills/AGENTS.md`)

Compiled, not hand-written. Source is individual rule files in a repo, CI composes AGENTS.md.
Hallucinated facts fail npm run check, not human review.

CLAUDE.md specifics (HumanLayer, followed by Anthropic internal)

≤ 60 lines. Claude Code’s system prompt already ships ~50 instructions; frontier models track ~150–200 reliably.
Everything else goes in agent_docs/*.md loaded on-demand.
Claude Code auto-reads CLAUDE.md + .claude/skills/*/SKILL.md. It does NOT auto-read llms.txt, docs/index.md, or frontmatter queries: fields — those are inert to the CLI.
Quote: “Never send an LLM to do a linter’s job.” Don’t put code-style rules in CLAUDE.md.

What works (evidence-based)

Embedded docs index in AGENTS.md, not link-out. Skills-style “go fetch X” loses because agents need a decision point they frequently mis-route.
Numbered “Common Pitfalls” lists at the end — both Vercel and Cloudflare do this; it’s the section agents hit when blocked.
Split CI-vs-local validation matrices.
Schema-validated YAML frontmatter (Cloudflare’s pcx_content_type enum, tag allowlist in src/schemas/tags.ts) — build fails on drift, so drift is prevented by the compiler, not by goodwill.
Two-tier lint in CI — Vale + markdownlint + lychee (Datadog, GitLab, Fern all published this stack).

What’s vestigial (evidence-based)

llmstxt.org — no production rig in the sample adopted it. AGENTS.md won.
Code-style rules in CLAUDE.md — “Never send an LLM to do a linter’s job” (HumanLayer).
Auto-generated /init CLAUDE.md — unanimously treated as a starting draft, not a keeper.
OpenAPI specs as general agent docs — useful for tool surfaces, overkill for project conventions.

Failure modes documented elsewhere

Silent rate-limit cascades → hallucinated gap-filling. Petrenko’s 16-agent refactor: “2 of 9 interview agents hit API rate limits and failed silently.” The dashecorp rig’s 529-overloaded rule addresses part of this; extend it: any agent that skipped a doc update must write an explicit “skipped: reason” line.
Schema drift becoming canon. Industry data: ~7 engineer-days/month lost. Cloudflare prevents this by making invalid pcx_content_type or tags hard-fail the build.
Doc index bloat (“context rot”). Past ~8 KB, agent success rate measurably regresses (Vercel’s data, Morph’s research). Surge HQ: 693 lines of hallucinations from uncompressed context.

Three concrete recommendations beyond the public spec docs

Compile, don’t hand-write, AGENTS.md. Vercel’s pattern: canonical facts in facts/*.yaml with JSON-Schema validation, compiled into AGENTS.md on each PR. Hallucinated facts fail npm run check.
Split into three files with hard size budgets. AGENTS.md ≤ 150 lines; CLAUDE.md ≤ 60 lines; docs/agent-runbooks/*.md for task-specific, referenced by file:line. Enforce with CI. Memory MCP stays separate.
Two-tier lint: deterministic first, LLM-as-judge second, temperature=0. markdownlint + vale + lychee every commit; scheduled job runs a different Claude instance over docs with a fixed rubric (contradictions, stale claims, orphans, factual drift against git log since last lint). Evidently/Arize guidance: “strict separation between generation and evaluation.”