What We Actually Learned About Agent Docs: Mono vs TOC, Claude Code vs Codex

I got tired of hearing a familiar sentence: “the model forgot.”

Sometimes that is true. Sometimes the benchmark is bad.

So I rebuilt the experiment from scratch to answer one concrete repo question:

If you care about policy recall in real code tasks, is a long single AGENTS.md/CLAUDE.md worse than a TOC + modular docs architecture?

This post replaces my earlier draft with the stronger run.

Hook: the practical pain

In production repos, we add instructions like:

use testify/require
migrate touched assertions
this folder is an exception

Then the agent edits tests in a different folder and drifts toward old local style.

That drift is exactly what this benchmark tries to measure.

Background context

Claude Code memory and instruction handling

Claude Code has layered memory and instructions:

project instructions (for example CLAUDE.md, .claude/rules/*)
optional user/project memory stores
runtime context and compaction behavior as conversations grow

This means instruction-following depends on what is active and retrieved at task time, not just what exists somewhere in the repo.

Codex memory and instruction handling

Codex follows instruction discovery through AGENTS.md rules and directory scoping, then uses thread context for tool calls, file reads, and edits.

In practice, this has two implications for benchmarking:

instruction architecture can change what gets read early
context growth and tool behavior can dominate outcomes if isolation is weak

So the experiment must control for both document content and execution isolation.

Hypothesis

I tested three hypotheses:

Any explicit docs architecture (mono or toc) beats control on policy recall.
toc and mono may tie on recall, but differ in cost (latency/tokens).
If failures happen, they should concentrate on policy-sensitive tasks, not basic functional correctness.

Experiment design

Scenario

I used the repo_policy_conflict benchmark scenario in benchmark-memory with realistic code-edit tasks in a Go fixture repo.

Task families:

T401: new tests should use require/error assertions
T402: legacy folder exception (no testify)
T403: touched-file migration to require
T404: adapter folder exception (package http_test + require)
T405: multi-phase change with conflicting defaults/exceptions

Architectures compared

control: no policy docs
mono: one root doc (CLAUDE.md or AGENTS.md)
toc: index file + modular docs (same facts split by topic)

Core fairness rule: mono and toc had equivalent core facts; only structure changed.

Models and sample size

Claude: claude-sonnet-4-5-20250929
Codex: gpt-5.3-codex
Strong run size: 120 attempts total
- 60/model
- 20 attempts per architecture per model (5 tasks x 4 repeats)

Scoring and isolation updates (critical)

I fixed several flaws from earlier runs:

added semantic checks (reduced brittle regex-only grading)
repaired task patterns that over-penalized equivalent code forms
enforced per-attempt isolated workdirs
verified no outside-root reads in this final run

Observed run health:

Claude raw: 60/60 ok
Codex raw: 60/60 ok
outside-root reads: 0 in both models

Results

Figure 1: success by architecture

Success by architecture

Main result table (strong run)

Model	Architecture	Task success	Policy pass	Mean latency	Mean input tokens
Claude	control	5.0%	5.0%	58.2s	58.4
Claude	mono	100.0%	100.0%	72.7s	72.8
Claude	toc	100.0%	100.0%	71.2s	70.0
Codex	control	25.0%	25.0%	69.1s	206,481.9
Codex	mono	100.0%	100.0%	77.7s	235,672.7
Codex	toc	100.0%	100.0%	86.8s	289,331.5

Confidence intervals (Wilson, 95%)

Model	Architecture	Success (k/n)	95% CI
Claude	control	1/20	[0.9%, 23.6%]
Claude	mono	20/20	[83.9%, 100%]
Claude	toc	20/20	[83.9%, 100%]
Codex	control	5/20	[11.2%, 46.9%]
Codex	mono	20/20	[83.9%, 100%]
Codex	toc	20/20	[83.9%, 100%]

Figure 2: efficiency tradeoff

Efficiency tradeoff

Takeaway: on this task set, mono and toc tie on recall, but toc is costlier for Codex.

Figure 3: task-level heatmap

Task-level heatmap

Control failures cluster where policy conflicts are explicit (framework choice, folder exceptions, touched-file migration). Functional checks still passed.

Optional interactive view

For hover details and cross-plot exploration:

Interactive dashboard

Discussion

What this supports strongly

Documented policy architecture matters. control underperformed sharply in both models.
In this scenario, both mono and toc are highly effective for recall/adherence.
For Codex, mono appears more efficient than toc at equal quality.

What this does not prove yet

That toc is universally better or worse than mono for recall.
That long docs always create harmful noise.

Current interpretation: with a clean policy set and this task complexity, both structured approaches saturate quality. Differences become mostly economic (latency/tokens), especially for Codex.

Remaining validity limits

still one repo fixture (Go testing policy domain)
all runs are single-turn tasks, even for the multi-phase prompt
perfect recall in mono/toc suggests we now need harder discrimination tasks, not only larger N

Next steps

Add longer multi-turn task chains with delayed policy recall.
Add contradictory local-code priors (legacy style nearby) to increase interference pressure.
Include mixed-language repos to test portability.
Add ablations for file naming/discovery (CLAUDE.md, AGENTS.md, both, nested overrides).
Publish preregistered rubric + holdout tasks before next run.

Reproducibility

Primary artifacts for this post:

benchmark-memory/results/run_plan_claude_repo_policy_v2_strong_r1.jsonl
benchmark-memory/results/run_plan_codex_repo_policy_v2_strong_r1.jsonl
benchmark-memory/results/raw_results_claude_repo_policy_v2_strong_r1.jsonl
benchmark-memory/results/raw_results_codex_repo_policy_v2_strong_r1.jsonl
benchmark-memory/results/graded_results_claude_repo_policy_v2_strong_r1.jsonl
benchmark-memory/results/graded_results_codex_repo_policy_v2_strong_r1.jsonl
benchmark-memory/results/summary_claude_repo_policy_v2_strong_r1/summary_by_architecture.csv
benchmark-memory/results/summary_codex_repo_policy_v2_strong_r1/summary_by_architecture.csv
benchmark-memory/results/summary_claude_repo_policy_v2_strong_r1/summary_by_task.csv
benchmark-memory/results/summary_codex_repo_policy_v2_strong_r1/summary_by_task.csv

Figure generation command:

cd benchmark-memory
python3 scripts/plot_repo_policy_strong_results.py

Sources

Claude Code memory docs: https://code.claude.com/docs/en/memory
Anthropic context windows: https://docs.anthropic.com/en/docs/build-with-claude/context-windows
Codex prompting guide: https://developers.openai.com/codex/prompting/
Codex AGENTS.md guide: https://developers.openai.com/codex/guides/agents-md