CoffeeBench: Benchmarking Long-Horizon LLM Agents in Heterogeneous Multi-Agent Economies

2026-06-15 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, quick

Summary

CoffeeBench is a new benchmark designed to evaluate Large Language Model (LLM) agents in long-horizon, heterogeneous multi-agent economic systems. Unlike traditional benchmarks that focus on single agents in passive environments, CoffeeBench simulates a 90-day economy involving two farmers, two roasters, and two retailers. Agents autonomously operate their businesses, aiming to maximize cumulative net income through communication and transactions, while managing cash, inventory, and pricing. The benchmark evaluates one LLM-controlled coffee roaster against fixed reference agents. Initial evaluations of recent open-weight and proprietary LLMs show all models surpass a passive baseline, with most achieving positive net income. Analysis revealed higher-performing models engage in more active communication, whereas Claude Haiku 4.5 exhibited an "idle-drift" failure mode, characterized by inaction despite generating coherent assessments. The code and agent trajectories are publicly released to support further research.

Key takeaway

For AI Engineers deploying LLM agents in complex, multi-agent economic simulations, you should prioritize agent designs that foster active communication and robust long-horizon decision-making. This research highlights that passive or "idle-drift" behaviors, even with coherent internal plans, significantly hinder performance. Ensure your evaluation frameworks specifically test for sustained interaction and strategic execution over extended periods to avoid deploying agents prone to inaction in dynamic environments.

Key insights

Active communication is crucial for LLM agents to succeed in long-horizon, multi-agent economic simulations.

Principles

Multi-agent economic systems demand communication and negotiation.
Long-horizon tasks reveal agent failure modes like "idle-drift."
Heterogeneous agent roles add complexity to economic interactions.

Method

CoffeeBench simulates a 90-day multi-agent economy with farmers, roasters, and retailers. An LLM agent controls one roaster, interacting with fixed agents to maximize net income by managing cash, inventory, and pricing.

In practice

Benchmark LLM agents in multi-party negotiation.
Prioritize active communication in agent design.
Test for "idle-drift" in autonomous agent systems.

Topics

CoffeeBench
LLM Agents
Multi-Agent Systems
Economic Simulation
Long-Horizon Planning
Agent Communication

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.