KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

2026-04-17 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, quick

Summary

KWBench (Knowledge Work Bench) is a new benchmark designed to evaluate large language models' (LLMs) ability to perform unprompted problem recognition in professional scenarios. Unlike existing benchmarks that focus on extraction or task completion, KWBench assesses whether an LLM can identify the underlying structure of a situation from raw inputs without explicit problem type indication. The benchmark comprises 223 tasks derived from practitioners in fields such as acquisitions, clinical pharmacy, and fraud analysis. Each task incorporates a formal game-theoretic pattern, like principal-agent conflict or strategic omission, with ground truth detailing expert interpretations and anticipated failure modes. Models are scored using a three-tier rubric with mandatory conjunctive checks for predicted wrong paths. Initial evaluations of 16 models show the best model passing only 27.9% of tasks, and the top two models agreeing on just 31.7% of their successful passes. Routing across the top 8 models covers 50.7% of the benchmark, significantly outperforming any single model.

Key takeaway

For research scientists developing or deploying LLMs for knowledge work, you should prioritize evaluating models on their ability to recognize problems unprompted, rather than solely on task execution after problem framing. The low success rates on KWBench indicate a significant gap in current frontier models, suggesting that focusing on this pre-solution recognition phase is critical for developing truly intelligent assistants. Consider incorporating benchmarks like KWBench into your evaluation pipeline to identify and address these foundational limitations.

Key insights

LLMs struggle with unprompted problem recognition in knowledge work, even when they understand underlying concepts.

Principles

Problem recognition precedes solution.
Unprompted recognition is distinct from prompted application.

Method

KWBench evaluates LLMs on unprompted problem recognition using 223 practitioner-sourced tasks, each encoding a game-theoretic pattern, scored via a three-tier rubric with mandatory failure-path checks.

In practice

Evaluate LLMs on problem framing.
Consider ensemble approaches for complex tasks.

Topics

KWBench
Unprompted Problem Recognition
Large Language Models
Game-Theoretic Patterns
Knowledge Work Evaluation

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.