KWBench: Measuring Unprompted Problem Recognition in Knowledge Work
Summary
KWBench (Knowledge Work Bench) is a new benchmark designed to evaluate large language models' (LLMs) ability to recognize unprompted problems in professional scenarios, a critical step often missed by existing benchmarks that focus on task execution. It comprises 223 tasks derived from real-world incidents across domains like acquisitions, contract negotiations, and clinical pharmacy, each encoding a formal game-theoretic pattern such as principal-agent conflict or signaling games. Models receive raw data and a task prompt without problem type hints, with a code interpreter universally available. Scoring uses a three-tier rubric with a mandatory gate: failing any core criterion, which encodes predicted wrong paths, results in a zero score. Evaluations of 16 models from 10 organizations show the best model passes only 27.9% of tasks, with the top two models agreeing on just 31.7% of their passes. This indicates a significant gap in unprompted problem recognition, as models often produce polished, confident output addressing the wrong problem.
Key takeaway
For AI Engineers developing autonomous agents for knowledge work, recognize that current frontier LLMs exhibit a "cooperative default" and struggle with unprompted adversarial reasoning. Your systems should incorporate dynamic routing across diverse models or explicit training signals that reward game-theoretic thinking, rather than solely relying on instruction-following or larger models, to avoid confident but fundamentally misframed analyses in critical professional contexts.
Key insights
LLMs excel at execution but largely fail at unprompted problem recognition in complex, imperfect-information professional scenarios.
Principles
- Knowledge work involves imperfect information games.
- Problem recognition is distinct from execution quality.
- LLM capabilities are distributed, not concentrated.
Method
KWBench evaluates LLMs on 223 real-world tasks, requiring unprompted recognition of game-theoretic patterns from raw inputs. A mandatory gate zeroes scores if core problem framing is missed, even with otherwise polished output.
In practice
- Relying on a single LLM for complex knowledge work is inadequate.
- Ensemble LLM architectures can significantly improve task coverage.
- Explicitly train LLMs for adversarial counterparty modeling.
Topics
- KWBench
- Problem Recognition
- Game Theory
- Language Model Evaluation
- Adversarial Reasoning
Code references
Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.