A Visual Guide to Reasoning LLMs
Summary
The article details the emerging paradigm shift in Large Language Model (LLM) development from scaling train-time compute to scaling test-time compute, exemplified by models like DeepSeek-R1, OpenAI o3-mini, and Google Gemini 2.0 Flash Thinking. Reasoning LLMs break down problems into smaller steps, often called "thought processes" or "Chain-of-Thought," to improve accuracy during inference. While train-time compute focuses on increasing model parameters, dataset size, and FLOPs during pre-training and fine-tuning, test-time compute allows models to "think longer" by generating more tokens for systematic reasoning. This approach includes techniques like search against verifiers (e.g., Majority Voting, Best-of-N samples with Outcome/Process Reward Models, Beam Search, Monte Carlo Tree Search) and modifying proposal distribution (e.g., prompting, STaR). DeepSeek-R1, a 671B parameter open-source model, achieved its reasoning capabilities primarily through reinforcement learning and synthetic data generation, without relying on verifiers, and its reasoning can be distilled into smaller models like Qwen-32B.
Key takeaway
For AI Scientists and Research Scientists optimizing LLM performance, focusing solely on increasing train-time compute is yielding diminishing returns. You should explore and implement test-time compute strategies, such as Chain-of-Thought prompting, verifier-based sampling (e.g., Best-of-N), or advanced search algorithms like Monte Carlo Tree Search, to enhance model reasoning and accuracy during inference. Consider the DeepSeek-R1 methodology of combining reinforcement learning with synthetic data generation to instill robust reasoning capabilities, even for distilling into smaller, more deployable models.
Key insights
Scaling LLM performance is shifting from training compute to inference-time reasoning, enabling models to "think longer."
Principles
- Reasoning LLMs break problems into smaller, structured inference steps.
- Test-time compute allows models to improve answers by generating more tokens for internal "thinking."
- Train-time and test-time compute are tightly related for optimal performance.
Method
DeepSeek-R1's reasoning capability was developed through a multi-step process involving cold start fine-tuning, reasoning-oriented reinforcement learning with accuracy and format rewards, rejection sampling for synthetic data generation, supervised fine-tuning, and final RL alignment with human preferences.
In practice
- Use "Let's think step-by-step" prompting for basic Chain-of-Thought behavior.
- Employ Best-of-N sampling with Reward Models to select optimal LLM outputs.
- Distill reasoning capabilities from large models into smaller ones for efficient deployment.
Topics
- Reasoning LLMs
- Test-Time Compute
- Train-Time Compute
- DeepSeek-R1
- Reinforcement Learning
Code references
Best for: AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, AI Student
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Exploring Language Models.