Open model, open metrics: How Lambda and the Olmo team trained Olmo Hybrid

· Source: The Lambda Deep Learning Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Advanced, long

Summary

Lambda and the Allen Institute for Artificial Intelligence (Ai2) collaborated to train Olmo Hybrid 7B, an open-source, 7-billion-parameter language model, on 512 NVIDIA Blackwell GPUs across 3 trillion tokens. This training, completed in 7 days, utilized 64 NVIDIA HGX B200 systems on Lambda's Superintelligence Cloud. Olmo Hybrid, a hybrid linear RNN–transformer model, significantly outperforms its predecessor, Olmo 3 7B, across various benchmarks, including MedQA MC (+7.1), MBPP (+6.7), and MMLU STEM (+4.5). The project emphasized infrastructure reliability, achieving 97% active training time and a median recovery time under 4 minutes, with the entire training stack, code, logs, and model weights released openly for reproducibility.

Key takeaway

For MLOps Engineers managing large-scale foundation model training, this case study demonstrates that robust infrastructure and automated recovery mechanisms are paramount. You should integrate pre-flight GPU health checks and sophisticated checkpointing into your SLURM-based workflows to achieve high training uptime and minimize data loss, even when hardware failures occur. This approach ensures long-duration training jobs can complete reliably within tight schedules.

Key insights

Hybrid RNN-transformer architectures can enhance model expressivity and performance, particularly in structured reasoning tasks.

Principles

Method

Training used Hybrid Sharded Data Parallelism (HSDP) with bfloat16 parameters, FP32 reductions, FlashAttention v2, cosine learning rate scheduling, and asynchronous checkpointing, launched via SLURM.

In practice

Topics

Code references

Best for: Machine Learning Engineer, Deep Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by The Lambda Deep Learning Blog.