Correct Chains, Wrong Answers: Dissociating Reasoning from Output in LLM Logic
Summary
A new benchmark, the Novel Operator Test, reveals that Large Language Models (LLMs) can execute every step of chain-of-thought (CoT) reasoning correctly yet still produce incorrect final answers. This test evaluates Boolean operators under unfamiliar names across depths 1-10, using five models including GPT-4o, Claude Sonnet 4, and Llama 3.1 70B Instruct. The study found that Claude Sonnet 4 exhibited 31 errors at depth 7 where reasoning was verifiably correct but the declared answer was wrong, a pattern also seen in 17 out of 19 mixed-operator chain errors. This "reasoning-output dissociation" differs from CoT unfaithfulness, where reasoning itself is flawed. The benchmark identified two failure types: "strategy failures" at depth 2, where models attempt terse retrieval, and "content failures" at depth 7, where models reason fully but err systematically. A Trojan operator (XOR under a novel name) confirmed that name familiarity alone does not gate reasoning, while Llama's novelty gap widened to 28pp at depth 8-9, isolating genuine difficulty with novel logic.
Key takeaway
For AI Engineers and Research Scientists developing or deploying LLMs for logical tasks, you must implement robust validation beyond merely checking chain-of-thought steps. The observed reasoning-output dissociation means a model can appear to reason correctly but still fail to declare the right answer. Consider integrating explicit truth-table tracing (ETT) or similar structured scaffolding, especially for novel or complex logical operations, and always verify the final output independently of the reasoning trace to mitigate this failure mode.
Key insights
LLMs can produce correct reasoning steps but still output wrong final answers, revealing a reasoning-output dissociation.
Principles
- Verifying CoT correctness is insufficient to guarantee answer correctness.
- Familiar operators are internalized, novel ones require explicit reasoning.
- Restrictive token limits can create phantom performance gaps.
Method
The Novel Operator Test evaluates LLMs on Boolean operators with unfamiliar names across varying chain depths, using truth tables for definition and explicit truth-table tracing (ETT) for scaffolding.
In practice
- Scaffold LLMs with ETT for novel logical tasks.
- Ensure sufficient token limits for complex reasoning.
- Do not assume correct CoT guarantees correct final output.
Topics
- Novel Operator Test
- Reasoning-Output Dissociation
- Chain-of-Thought Reasoning
- LLM Logical Reasoning
- Strategy Failure Mode
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.