Correct Chains, Wrong Answers: Dissociating Reasoning from Output in LLM Logic

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, extended

Summary

A new benchmark, the Novel Operator Test, reveals that Large Language Models (LLMs) can execute every step of chain-of-thought (CoT) reasoning correctly yet still produce incorrect final answers. This test evaluates Boolean operators under unfamiliar names across depths 1-10, using five models including GPT-4o, Claude Sonnet 4, and Llama 3.1 70B Instruct. The study found that Claude Sonnet 4 exhibited 31 errors at depth 7 where reasoning was verifiably correct but the declared answer was wrong, a pattern also seen in 17 out of 19 mixed-operator chain errors. This "reasoning-output dissociation" differs from CoT unfaithfulness, where reasoning itself is flawed. The benchmark identified two failure types: "strategy failures" at depth 2, where models attempt terse retrieval, and "content failures" at depth 7, where models reason fully but err systematically. A Trojan operator (XOR under a novel name) confirmed that name familiarity alone does not gate reasoning, while Llama's novelty gap widened to 28pp at depth 8-9, isolating genuine difficulty with novel logic.

Key takeaway

For AI Engineers and Research Scientists developing or deploying LLMs for logical tasks, you must implement robust validation beyond merely checking chain-of-thought steps. The observed reasoning-output dissociation means a model can appear to reason correctly but still fail to declare the right answer. Consider integrating explicit truth-table tracing (ETT) or similar structured scaffolding, especially for novel or complex logical operations, and always verify the final output independently of the reasoning trace to mitigate this failure mode.

Key insights

LLMs can produce correct reasoning steps but still output wrong final answers, revealing a reasoning-output dissociation.

Principles

Method

The Novel Operator Test evaluates LLMs on Boolean operators with unfamiliar names across varying chain depths, using truth tables for definition and explicit truth-table tracing (ETT) for scaffolding.

In practice

Topics

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, NLP Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.