Do AGENTS.md/CLAUDE.md Files Help Coding Agents? A New Paper Challenges this

· Source: To Data & Beyond · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, medium

Summary

A new paper from ETH Zurich and LogicStar.ai, "Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?", rigorously tests the common practice of including AGENTS.md or CLAUDE.md files in coding repositories for AI agents. The study evaluated four coding agents—Claude Code (Sonnet-4.5), Codex (GPT-5.2 and GPT-5.1 Mini), and Qwen Code (Qwen3-30B)—across two benchmarks, SWE-Bench Lite and a new AGENTBENCH, under three conditions: no context file, LLM-generated context file, and human-written context file. Contrary to intuition, LLM-generated context files decreased task success by 2-3% and human-written files offered only a marginal 4% improvement. Both types of context files increased inference cost by over 20% and led to more steps to complete tasks. The research found that agents follow instructions from these files to a fault, increasing tool use and reasoning tokens without improving outcomes, suggesting that context files are primarily beneficial as a compensation mechanism for missing documentation.

Key takeaway

For AI Architects and MLOps Engineers optimizing coding agent performance and cost, re-evaluate the necessity of AGENTS.md or CLAUDE.md files. If your repository is already well-documented, skip auto-generated context files to avoid increased inference costs and reduced efficiency. Instead, consider a concise, human-written file addressing only non-obvious instructions, like specific tool usage (e.g., "use uv, not pip"), or generate one only for undocumented projects, as redundant information hinders agent performance.

Key insights

Context files for coding agents often increase cost and steps without improving task success in well-documented repositories.

Principles

Method

The study evaluated four coding agents across two benchmarks (SWE-Bench Lite, AGENTBENCH) and three context conditions: no file, LLM-generated file, and human-written file, measuring task success, inference cost, and tool use.

In practice

Topics

Best for: AI Architect, MLOps Engineer, Machine Learning Engineer, AI Engineer, Software Engineer, AI Researcher

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by To Data & Beyond.