Does AGENTS.md Actually Help Coding Agents?

2025-07-05 · Source: AI Newsletter · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, medium

Summary

A new paper from ETH Zurich's SRI Lab, "Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?", rigorously tests the effectiveness of repository-level context files (like CLAUDE.md or AGENTS.md) for coding agents. The study evaluated Claude Code (Sonnet-4.5), Codex (GPT-5.2 and GPT-5.1 mini), and Qwen Code (Qwen3-30b-coder) on hundreds of real GitHub issues using both the standard SWE-bench Lite and a new benchmark, AGENTbench, which comprises 138 instances from 12 less-popular Python repositories with existing developer-written context files. Key findings indicate that LLM-generated context files reduce task success rates by 0.5% on SWE-bench Lite and 2% on AGENTbench, while increasing inference cost by over 20%. Conversely, human-written context files improved success rates by 4% on average across both benchmarks, but still incurred 14-22% more reasoning tokens and 2-4 additional steps per task. The core difference lies in redundancy: LLM-generated files often duplicate existing documentation, whereas human-written files provide unique, non-obvious information.

Key takeaway

For AI Scientists and Machine Learning Engineers designing or implementing coding agents, you should critically evaluate the content of your context files. Prioritize human-written files that provide specific, non-redundant information about project quirks or non-default tooling. Avoid LLM-generated context files that merely rehash existing documentation, as these can decrease success rates and significantly increase inference costs without providing tangible benefits.

Key insights

Context files for coding agents are beneficial only when they provide unique, non-redundant information.

Principles

Instruction-following does not guarantee task success.
Redundant context files increase cost without improving performance.

Method

The study introduced AGENTbench, a benchmark of 138 real-world Python repository instances with developer-written context files, to evaluate coding agents with and without context files.

In practice

Focus context files on non-obvious tooling and conventions.
Avoid LLM-generated context files that duplicate existing documentation.

Topics

Coding Agents
Context Files
LLM Evaluation
Software Engineering Benchmarks
Instruction Following

Code references

eth-sri/agentbench

Best for: Machine Learning Engineer, AI Scientist, Research Scientist, AI Engineer, Software Engineer, AI Researcher

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Newsletter.