Recent LLMs can do 2-hop and 3-hop latent (no CoT) reasoning on natural facts

2024-06-17 · Source: Redwood Research blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, long

Summary

Recent large language models (LLMs) demonstrate improved latent n-hop reasoning capabilities, a task where models must answer multi-step questions without Chain-of-Thought (CoT). A new dataset, designed to evaluate n-hop latent reasoning on natural facts, reveals that Gemini 3 Pro achieves 60% accuracy on 2-hop questions and 34% on 3-hop questions. Opus 4 performs at 31% for 2-hop and 7% for 3-hop questions, outperforming Opus 4.5. All evaluated models show near-chance accuracy on 4-hop questions. Older models like GPT-4 perform significantly worse, with 9.7% on 2-hop and 3.9% on 3-hop questions. The study also found that filler tokens (e.g., counting from 1 to 300) and problem repeats substantially boost performance for capable models, with Gemini 3 Pro's 3-hop accuracy nearly doubling from 18% to 34% with filler tokens.

Key takeaway

AI Engineers evaluating LLM capabilities for complex, multi-step tasks without explicit reasoning chains should consider using filler tokens or problem repeats in their prompts. This technique can substantially improve latent n-hop reasoning performance, as demonstrated by Gemini 3 Pro and Opus 4, potentially enabling more accurate and direct answers for certain applications. You should also explore the provided dataset and code for robust evaluation.

Key insights

Recent LLMs show moderate latent multi-hop reasoning, significantly improved by filler tokens and problem repeats.

Principles

Latent reasoning performance scales with model capability.
Prompt engineering can significantly enhance latent reasoning.

Method

A new dataset for n-hop latent reasoning on natural facts was constructed, and models were evaluated using no-CoT prompting, often with filler tokens or problem repeats to enhance performance.

In practice

Use filler tokens (e.g., counting 1 to 300) to boost LLM latent reasoning.
Repeat problem statements to improve multi-hop accuracy.

Topics

Latent Reasoning
N-hop Reasoning
LLM Evaluation
Prompt Engineering
Fact Composition

Code references

Best for: AI Engineer, NLP Engineer, Research Scientist, AI Researcher, AI Scientist, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Redwood Research blog.