KWBench: Measuring Unprompted Problem Recognition in Knowledge Work
Summary
KWBench (Knowledge Work Bench) is a new benchmark designed to evaluate large language models' (LLMs) ability to perform unprompted problem recognition in professional scenarios. Unlike existing benchmarks that focus on extraction or task completion, KWBench assesses whether an LLM can identify the underlying structure of a situation from raw inputs without explicit problem type indication. The benchmark comprises 223 tasks derived from practitioners in fields such as acquisitions, clinical pharmacy, and fraud analysis. Each task incorporates a formal game-theoretic pattern, like principal-agent conflict or strategic omission, with ground truth detailing expert interpretations and anticipated failure modes. Models are scored using a three-tier rubric with mandatory conjunctive checks for predicted wrong paths. Initial evaluations of 16 models show the best model passing only 27.9% of tasks, and the top two models agreeing on just 31.7% of their successful passes. Routing across the top 8 models covers 50.7% of the benchmark, significantly outperforming any single model.
Key takeaway
For research scientists developing or deploying LLMs for knowledge work, you should prioritize evaluating models on their ability to recognize problems unprompted, rather than solely on task execution after problem framing. The low success rates on KWBench indicate a significant gap in current frontier models, suggesting that focusing on this pre-solution recognition phase is critical for developing truly intelligent assistants. Consider incorporating benchmarks like KWBench into your evaluation pipeline to identify and address these foundational limitations.
Key insights
LLMs struggle with unprompted problem recognition in knowledge work, even when they understand underlying concepts.
Principles
- Problem recognition precedes solution.
- Unprompted recognition is distinct from prompted application.
Method
KWBench evaluates LLMs on unprompted problem recognition using 223 practitioner-sourced tasks, each encoding a game-theoretic pattern, scored via a three-tier rubric with mandatory failure-path checks.
In practice
- Evaluate LLMs on problem framing.
- Consider ensemble approaches for complex tasks.
Topics
- KWBench
- Unprompted Problem Recognition
- Large Language Models
- Game-Theoretic Patterns
- Knowledge Work Evaluation
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.