Agent Docs Architecture Benchmark

Strong run (R1): 120 attempts total, 60/model, n=20 per architecture.

Data source: benchmark-memory/results/summary_*_repo_policy_v2_strong_r1/*.csv