Making LLMs Think Longer: Context, State, and Post-Training Tricks
Summary
This week's review highlights four significant advancements in large language models (LLMs). QwenLong-L1.5 introduces a post-training recipe for Qwen3-30B-A3B-Thinking, enabling it to handle 256K-token contexts and scale to 1M–4M tokens via an external memory agent, achieving ~10-point average gains on long-context benchmarks and parity with proprietary models like GPT-5. "State over Tokens" (SoT) recharacterizes LLM reasoning tokens as externalized computational state rather than transparent accounts of internal computation. Nemotron-Cascade proposes a cascaded, domain-wise Reinforcement Learning (RL) approach for mid-sized models (Qwen3 8B/14B), surpassing a 671B teacher model on code benchmarks. Finally, VersatileFFN redesigns Transformer feed-forward blocks for parameter efficiency, reusing a single FFN across width and depth with a difficulty-aware controller, outperforming parameter-matched baselines on zero-shot benchmarks using OLMo2-style 354M and 720M models.
Key takeaway
For AI engineers developing advanced LLMs, consider adopting cascaded Reinforcement Learning for domain-specific training to achieve superior performance in areas like code and math, as demonstrated by Nemotron-Cascade. Your team should also explore post-training recipes like QwenLong-L1.5 to extend context windows and integrate memory agents, or investigate FFN redesigns such as VersatileFFN to optimize parameter efficiency, especially when memory constraints are paramount.
Key insights
LLM reasoning tokens are externalized computational state, not transparent explanations of internal processes.
Principles
- Post-training recipes enhance long-context reasoning.
- Cascaded RL improves domain-specific model performance.
- FFN redesigns can boost parameter efficiency.
Method
QwenLong-L1.5 uses synthetic long-context QA data, an RL setup for long sequences, and a memory-augmented agent. Nemotron-Cascade employs multi-stage SFT followed by five domain-specific RL stages using GRPO-style objectives.
In practice
- Use GRPO for stable long-context multi-task RL.
- Implement domain-wise RL to prevent catastrophic forgetting.
- Redesign FFNs for compute-over-memory efficiency.
Topics
- Long-Context LLMs
- Reinforcement Learning
- LLM Reasoning
- Parameter Efficiency
- Feed-Forward Networks
Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Researcher, AI Scientist, Deep Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by The Salt - Curated AI.