Making LLMs Think Longer: Context, State, and Post-Training Tricks

2024-03-06 · Source: The Salt - Curated AI · Field: Technology & Digital — Artificial Intelligence & Machine Learning · Depth: Expert, medium

Summary

This week's review highlights four significant advancements in large language models (LLMs). QwenLong-L1.5 introduces a post-training recipe for Qwen3-30B-A3B-Thinking, enabling it to handle 256K-token contexts and scale to 1M–4M tokens via an external memory agent, achieving ~10-point average gains on long-context benchmarks and parity with proprietary models like GPT-5. "State over Tokens" (SoT) recharacterizes LLM reasoning tokens as externalized computational state rather than transparent accounts of internal computation. Nemotron-Cascade proposes a cascaded, domain-wise Reinforcement Learning (RL) approach for mid-sized models (Qwen3 8B/14B), surpassing a 671B teacher model on code benchmarks. Finally, VersatileFFN redesigns Transformer feed-forward blocks for parameter efficiency, reusing a single FFN across width and depth with a difficulty-aware controller, outperforming parameter-matched baselines on zero-shot benchmarks using OLMo2-style 354M and 720M models.

Key takeaway

For AI engineers developing advanced LLMs, consider adopting cascaded Reinforcement Learning for domain-specific training to achieve superior performance in areas like code and math, as demonstrated by Nemotron-Cascade. Your team should also explore post-training recipes like QwenLong-L1.5 to extend context windows and integrate memory agents, or investigate FFN redesigns such as VersatileFFN to optimize parameter efficiency, especially when memory constraints are paramount.

Key insights

LLM reasoning tokens are externalized computational state, not transparent explanations of internal processes.

Principles

Post-training recipes enhance long-context reasoning.
Cascaded RL improves domain-specific model performance.
FFN redesigns can boost parameter efficiency.

Method

QwenLong-L1.5 uses synthetic long-context QA data, an RL setup for long sequences, and a memory-augmented agent. Nemotron-Cascade employs multi-stage SFT followed by five domain-specific RL stages using GRPO-style objectives.

In practice

Use GRPO for stable long-context multi-task RL.
Implement domain-wise RL to prevent catastrophic forgetting.
Redesign FFNs for compute-over-memory efficiency.

Topics

Long-Context LLMs
Reinforcement Learning
LLM Reasoning
Parameter Efficiency
Feed-Forward Networks

Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Researcher, AI Scientist, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Salt - Curated AI.