CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies
Summary
CoffeeBench is a new benchmark designed to evaluate Large Language Model (LLM) agents in long-horizon, heterogeneous multi-agent economic systems. Unlike traditional benchmarks that focus on single agents in passive environments, CoffeeBench simulates a 90-day economy involving two farmers, two roasters, and two retailers. Agents autonomously operate their businesses, aiming to maximize cumulative net income through communication and transactions, while managing cash, inventory, and pricing. The benchmark evaluates one LLM-controlled coffee roaster against fixed reference agents. Initial evaluations of recent open-weight and proprietary LLMs show all models surpass a passive baseline, with most achieving positive net income. Analysis revealed higher-performing models engage in more active communication, whereas Claude Haiku 4.5 exhibited an "idle-drift" failure mode, characterized by inaction despite generating coherent assessments. The code and agent trajectories are publicly released to support further research.
Key takeaway
For AI Engineers deploying LLM agents in complex, multi-agent economic simulations, you should prioritize agent designs that foster active communication and robust long-horizon decision-making. This research highlights that passive or "idle-drift" behaviors, even with coherent internal plans, significantly hinder performance. Ensure your evaluation frameworks specifically test for sustained interaction and strategic execution over extended periods to avoid deploying agents prone to inaction in dynamic environments.
Key insights
Active communication is crucial for LLM agents to succeed in long-horizon, multi-agent economic simulations.
Principles
- Multi-agent economic systems demand communication and negotiation.
- Long-horizon tasks reveal agent failure modes like "idle-drift."
- Heterogeneous agent roles add complexity to economic interactions.
Method
CoffeeBench simulates a 90-day multi-agent economy with farmers, roasters, and retailers. An LLM agent controls one roaster, interacting with fixed agents to maximize net income by managing cash, inventory, and pricing.
In practice
- Benchmark LLM agents in multi-party negotiation.
- Prioritize active communication in agent design.
- Test for "idle-drift" in autonomous agent systems.
Topics
- CoffeeBench
- LLM Agents
- Multi-Agent Systems
- Economic Simulation
- Long-Horizon Planning
- Agent Communication
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.