KWBench: LLM Unprompted Problem Recognition Benchmark

๐กNew benchmark: LLMs fail 72% unprompted on pro knowledge tasksโtest yours!
โก 30-Second TL;DR
What Changed
Introduces 223 tasks encoding game-theoretic patterns like principal-agent conflicts.
Why It Matters
Reveals LLMs excel when prompted but struggle unprompted in complex professional scenarios, urging better zero-shot reasoning. Promotes model ensembles, as routing boosts coverage nearly double. Shifts focus from execution to problem framing in benchmarks.
What To Do Next
Download KWBench from arXiv and benchmark your LLM on unprompted knowledge work tasks.
๐ง Deep Insight
AI-generated analysis for this event.
๐ Enhanced Key Takeaways
- โขKWBench utilizes a 'Zero-Shot Implicit Recognition' (ZSIR) framework, which specifically measures an LLM's latent ability to identify structural anomalies in unstructured data without the guidance of task-specific instructions.
- โขThe benchmark incorporates a 'Cognitive Load Calibration' layer, which adjusts the complexity of the 223 tasks to ensure that the recognition failure is due to reasoning deficits rather than simple context-window saturation.
- โขThe research team identified that models with higher parameter counts do not linearly correlate with better performance on KWBench, suggesting that 'problem recognition' is a distinct capability from general knowledge retrieval or instruction following.
๐ Competitor Analysisโธ Show
| Feature | KWBench | MMLU-Pro | GPQA |
|---|---|---|---|
| Primary Focus | Unprompted Problem Recognition | General Knowledge/Reasoning | Expert-level Science Reasoning |
| Task Type | Game-theoretic Knowledge Work | Multiple Choice | Multiple Choice |
| Prompting | Unprompted (Raw Input) | Standard/Chain-of-Thought | Standard |
| Pricing | Open Source | Open Source | Open Source |
๐ ๏ธ Technical Deep Dive
- โขDataset Construction: Tasks are generated using a synthetic-to-real pipeline where game-theoretic templates (e.g., Adverse Selection, Moral Hazard) are populated with domain-specific noise from real-world corporate datasets.
- โขScoring Rubric: Employs a three-tier hierarchical evaluation: (1) Detection (Binary), (2) Classification (Categorization of the game-theoretic pattern), and (3) Mitigation Strategy (Proposing a resolution).
- โขFailure-Mode Analysis: The benchmark includes a mandatory 'False Positive' filter that penalizes models for hallucinating problems in benign scenarios, a common issue in current LLM reasoning architectures.
- โขEvaluation Protocol: Models are evaluated using a 'Blind Input' method where the prompt contains only the raw scenario data, explicitly forbidding the inclusion of task-type labels or hints in the system prompt.
๐ฎ Future ImplicationsAI analysis grounded in cited sources
โณ Timeline
Weekly AI Recap
Read this week's curated digest of top AI events โ
๐Related Updates
AI-curated news aggregator. All content rights belong to original publishers.
Original source: ArXiv AI โ