KWBench: Measuring Unprompted Problem Recognition in Knowledge Work

2026-04-20 · Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Robotics & Autonomous Systems, Data Science & Analytics · Depth: Expert, extended

Summary

KWBench (Knowledge Work Bench) is a new benchmark designed to evaluate large language models' (LLMs) ability to recognize unprompted problems in professional scenarios, a critical step often missed by existing benchmarks that focus on task execution. It comprises 223 tasks derived from real-world incidents across domains like acquisitions, contract negotiations, and clinical pharmacy, each encoding a formal game-theoretic pattern such as principal-agent conflict or signaling games. Models receive raw data and a task prompt without problem type hints, with a code interpreter universally available. Scoring uses a three-tier rubric with a mandatory gate: failing any core criterion, which encodes predicted wrong paths, results in a zero score. Evaluations of 16 models from 10 organizations show the best model passes only 27.9% of tasks, with the top two models agreeing on just 31.7% of their passes. This indicates a significant gap in unprompted problem recognition, as models often produce polished, confident output addressing the wrong problem.

Key takeaway

For AI Engineers developing autonomous agents for knowledge work, recognize that current frontier LLMs exhibit a "cooperative default" and struggle with unprompted adversarial reasoning. Your systems should incorporate dynamic routing across diverse models or explicit training signals that reward game-theoretic thinking, rather than solely relying on instruction-following or larger models, to avoid confident but fundamentally misframed analyses in critical professional contexts.

Key insights

LLMs excel at execution but largely fail at unprompted problem recognition in complex, imperfect-information professional scenarios.

Principles

Knowledge work involves imperfect information games.
Problem recognition is distinct from execution quality.
LLM capabilities are distributed, not concentrated.

Method

KWBench evaluates LLMs on 223 real-world tasks, requiring unprompted recognition of game-theoretic patterns from raw inputs. A mandatory gate zeroes scores if core problem framing is missed, even with otherwise polished output.

In practice

Relying on a single LLM for complex knowledge work is inadequate.
Ensemble LLM architectures can significantly improve task coverage.
Explicitly train LLMs for adversarial counterparty modeling.

Topics

KWBench
Problem Recognition
Game Theory
Language Model Evaluation
Adversarial Reasoning

Code references

ankitmaloo/fasteval

Best for: Research Scientist, AI Engineer, NLP Engineer, AI Scientist, Machine Learning Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.