A Visual Guide to Reasoning LLMs

· Source: Exploring Language Models · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Data Science & Analytics · Depth: Advanced, long

Summary

The article details the emerging paradigm shift in Large Language Model (LLM) development from scaling train-time compute to scaling test-time compute, exemplified by models like DeepSeek-R1, OpenAI o3-mini, and Google Gemini 2.0 Flash Thinking. Reasoning LLMs break down problems into smaller steps, often called "thought processes" or "Chain-of-Thought," to improve accuracy during inference. While train-time compute focuses on increasing model parameters, dataset size, and FLOPs during pre-training and fine-tuning, test-time compute allows models to "think longer" by generating more tokens for systematic reasoning. This approach includes techniques like search against verifiers (e.g., Majority Voting, Best-of-N samples with Outcome/Process Reward Models, Beam Search, Monte Carlo Tree Search) and modifying proposal distribution (e.g., prompting, STaR). DeepSeek-R1, a 671B parameter open-source model, achieved its reasoning capabilities primarily through reinforcement learning and synthetic data generation, without relying on verifiers, and its reasoning can be distilled into smaller models like Qwen-32B.

Key takeaway

For AI Scientists and Research Scientists optimizing LLM performance, focusing solely on increasing train-time compute is yielding diminishing returns. You should explore and implement test-time compute strategies, such as Chain-of-Thought prompting, verifier-based sampling (e.g., Best-of-N), or advanced search algorithms like Monte Carlo Tree Search, to enhance model reasoning and accuracy during inference. Consider the DeepSeek-R1 methodology of combining reinforcement learning with synthetic data generation to instill robust reasoning capabilities, even for distilling into smaller, more deployable models.

Key insights

Scaling LLM performance is shifting from training compute to inference-time reasoning, enabling models to "think longer."

Principles

Method

DeepSeek-R1's reasoning capability was developed through a multi-step process involving cold start fine-tuning, reasoning-oriented reinforcement learning with accuracy and format rewards, rejection sampling for synthetic data generation, supervised fine-tuning, and final RL alignment with human preferences.

In practice

Topics

Code references

Best for: AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Exploring Language Models.