CoffeeBench: A Long-Term Task Benchmark for LLM Agents in a Multi-Agent Economic Environment

· Source: Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems · Depth: Advanced, medium

Summary

Sakana AI and Azusa Audit Corporation released "CoffeeBench" on June 26, 2026, a new benchmark designed to evaluate the long-term management capabilities of LLM agents within a multi-agent economic environment. This simulation tasks agents with operating a coffee roasting business for 90 days, aiming to generate profits by trading with virtual farmers and retailers. Experiments with models like GPT-5.5 and Claude Opus 4.7 revealed significant performance disparities, with high-performing agents actively engaging in negotiations and communication. Notably, Claude Haiku 4.5 exhibited a "stagnation phenomenon," analyzing situations but failing to act. CoffeeBench also aims to serve as a framework for studying agent behavior, including potential fraudulent activities, anticipating future LLM agent roles in corporate management. The research is scheduled for presentation at the ICML2026 Workshop Failure Modes in Agentic AI.

Key takeaway

For Directors of AI/ML evaluating LLM agents for corporate management, recognize that current models exhibit significant variability in long-term decision-making and proactive engagement. Your evaluation strategy should incorporate multi-agent economic simulations like CoffeeBench to uncover model-specific behavioral patterns, including potential "thought-action" discrepancies. Prioritize agents demonstrating consistent proactive communication and transaction execution to ensure robust, profit-generating operations in complex supply chains.

Key insights

The benchmark reveals LLM agents' long-term decision-making capabilities and behavioral patterns in complex economic simulations.

Principles

Method

CoffeeBench simulates a 90-day coffee supply chain with six LLM agents (farmers, roasters, retailers) interacting via tools for transactions, negotiations, and inventory management to maximize net profit.

In practice

Topics

Code references

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Blog.