Olmo 3 and the Open LLM Renaissance

2024-03-04 · Source: Deep (Learning) Focus · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Data Science & Analytics · Depth: Advanced, extended

Summary

AI2 has released Olmo 3, a family of fully-open large language models (LLMs) available at 7B and 32B parameter scales, aiming to enhance transparency and accessibility in AI research. Unlike many "open-weight" models, Olmo 3 provides all training artifacts, including model checkpoints, data, and code, enabling full reproducibility. While Olmo 3 models currently lag behind top frontier models in raw performance, they outperform other fully-open models and approach the performance of leading open-weight models in many domains. The training pipeline involves general pretraining on a 6T token Dolma 3 Mix, midtraining on 100B targeted tokens, and a context extension phase to ~65K tokens using YaRN. Olmo 3 also features specialized "Think" models for complex reasoning, trained with SFT, DPO, and an enhanced GRPO-based Reinforcement Learning with Verifiable Rewards (RLVR), and "Instruct" models optimized for multi-turn chat and tool usage.

Key takeaway

For AI Engineers and Research Scientists focused on LLM development, Olmo 3's comprehensive release of models, data, and code provides an unparalleled foundation for reproducible research. You should explore its detailed training recipes and infrastructure, particularly the enhanced GRPO and OlmoRL framework, to understand and iterate on state-of-the-art LLM training. This transparency significantly reduces the barrier to entry for contributing to open LLM advancements.

Key insights

Olmo 3 offers fully-open LLMs with complete training transparency, fostering reproducible AI research despite a performance gap with closed models.

Principles

Full transparency accelerates open LLM research.
Iterative data mixing and quality-aware upsampling improve pretraining.
Hybrid-Sharded Data Parallelism (HSDP) optimizes distributed training.

Method

Olmo 3's training involves three pretraining stages, followed by sequential post-training (SFT, DPO, RLVR) to create specialized Instruct and Think models, leveraging optimized data curation and distributed computing techniques.

In practice

Use Olmo-Core for optimized supervised training code.
Employ model merging to combine checkpoints for improved performance.
Apply document packing to optimize long context training efficiency.

Topics

Olmo 3
Fully-Open LLMs
Reinforcement Learning
Distributed Training
Data Curation

Code references

Best for: Research Scientist, AI Engineer, Machine Learning Engineer, AI Researcher, AI Scientist, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Deep (Learning) Focus.