Recent LLMs can do 2-hop and 3-hop latent (no CoT) reasoning on natural facts
Summary
Recent large language models (LLMs) demonstrate improved latent n-hop reasoning capabilities, a task where models must answer multi-step questions without Chain-of-Thought (CoT). A new dataset, designed to evaluate n-hop latent reasoning on natural facts, reveals that Gemini 3 Pro achieves 60% accuracy on 2-hop questions and 34% on 3-hop questions. Opus 4 performs at 31% for 2-hop and 7% for 3-hop questions, outperforming Opus 4.5. All evaluated models show near-chance accuracy on 4-hop questions. Older models like GPT-4 perform significantly worse, with 9.7% on 2-hop and 3.9% on 3-hop questions. The study also found that filler tokens (e.g., counting from 1 to 300) and problem repeats substantially boost performance for capable models, with Gemini 3 Pro's 3-hop accuracy nearly doubling from 18% to 34% with filler tokens.
Key takeaway
AI Engineers evaluating LLM capabilities for complex, multi-step tasks without explicit reasoning chains should consider using filler tokens or problem repeats in their prompts. This technique can substantially improve latent n-hop reasoning performance, as demonstrated by Gemini 3 Pro and Opus 4, potentially enabling more accurate and direct answers for certain applications. You should also explore the provided dataset and code for robust evaluation.
Key insights
Recent LLMs show moderate latent multi-hop reasoning, significantly improved by filler tokens and problem repeats.
Principles
- Latent reasoning performance scales with model capability.
- Prompt engineering can significantly enhance latent reasoning.
Method
A new dataset for n-hop latent reasoning on natural facts was constructed, and models were evaluated using no-CoT prompting, often with filler tokens or problem repeats to enhance performance.
In practice
- Use filler tokens (e.g., counting 1 to 300) to boost LLM latent reasoning.
- Repeat problem statements to improve multi-hop accuracy.
Topics
- Latent Reasoning
- N-hop Reasoning
- LLM Evaluation
- Prompt Engineering
- Fact Composition
Code references
Best for: AI Engineer, NLP Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Redwood Research blog.