CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies

· Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, quick

Summary

CoffeeBench is a new benchmark designed to evaluate Large Language Model (LLM) agents in long-horizon, heterogeneous multi-agent economic systems. Unlike traditional benchmarks that focus on single agents in passive environments, CoffeeBench simulates a 90-day economy involving two farmers, two roasters, and two retailers. Agents autonomously operate their businesses, aiming to maximize cumulative net income through communication and transactions, while managing cash, inventory, and pricing. The benchmark evaluates one LLM-controlled coffee roaster against fixed reference agents. Initial evaluations of recent open-weight and proprietary LLMs show all models surpass a passive baseline, with most achieving positive net income. Analysis revealed higher-performing models engage in more active communication, whereas Claude Haiku 4.5 exhibited an "idle-drift" failure mode, characterized by inaction despite generating coherent assessments. The code and agent trajectories are publicly released to support further research.

Key takeaway

For AI Engineers deploying LLM agents in complex, multi-agent economic simulations, you should prioritize agent designs that foster active communication and robust long-horizon decision-making. This research highlights that passive or "idle-drift" behaviors, even with coherent internal plans, significantly hinder performance. Ensure your evaluation frameworks specifically test for sustained interaction and strategic execution over extended periods to avoid deploying agents prone to inaction in dynamic environments.

Key insights

Active communication is crucial for LLM agents to succeed in long-horizon, multi-agent economic simulations.

Principles

Method

CoffeeBench simulates a 90-day multi-agent economy with farmers, roasters, and retailers. An LLM agent controls one roaster, interacting with fixed agents to maximize net income by managing cash, inventory, and pricing.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.