Open models in perpetual catch-up
Summary
The presentation details the development of Elmo3, a new "thinking model" designed to address issues like data contamination and evaluation inconsistencies in open-source reasoning models. It outlines a six-stage training process, including pre-training, long context extension, and a post-training recipe optimized for smaller models using distillation. Key architectural considerations, such as Group Query Attention (GQA) for memory efficiency, are discussed. The post-training recipe involves Supervised Fine-Tuning (SFT) with high-quality, filtered data, Direct Preference Optimization (DPO) leveraging a "delta learning hypothesis" for strong performance gains, and Reinforcement Learning from Verifiable Rewards (RLVR). The talk also emphasizes the critical role of robust evaluation suites, detailing challenges in balancing diverse metrics, managing high variance in reasoning tasks, and the significant computational cost of comprehensive evaluations, advocating for more granular and efficient assessment methods.
Key takeaway
For AI Scientists and Research Scientists developing or evaluating open-source reasoning models, prioritize clean base models and robust, computationally efficient evaluation suites. Your focus should be on understanding the nuanced performance across diverse, high-variance reasoning tasks rather than relying on misleading average scores. Invest in infrastructure for Reinforcement Learning from Verifiable Rewards (RLVR) and consider DPO as a high-impact, low-effort method for performance gains, especially for smaller models.
Key insights
Clean base models and rigorous evaluation are crucial for advancing reinforcement learning research in open-source reasoning models.
Principles
- Data quality and architecture are paramount for model performance.
- DPO offers significant, efficient performance gains for small models.
- Evaluation suites require continuous evolution and careful interpretation.
Method
Elmo3's post-training uses a three-stage recipe: Supervised Fine-Tuning (SFT) with filtered, high-quality data; Direct Preference Optimization (DPO) with a strong teacher-student delta; and Reinforcement Learning from Verifiable Rewards (RLVR) with active sampling and inflight updates.
In practice
- Implement GQA for memory efficiency in large models.
- Use DPO with a strong delta between chosen/rejected answers.
- Continuously update model weights during long generation tasks.
Topics
- Elmo3 Think Model
- Reinforcement Learning
- Model Evaluation
- Group Query Attention
- Direct Preference Optimization
Best for: AI Scientist, Research Scientist, AI Researcher, Machine Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Interconnects AI.