Boost Training Goodput: How Continuous Checkpointing Optimizes Reliability in Orbax and MaxText
Summary
The new continuous checkpointing feature in Orbax and MaxText, released on March 31, 2026, optimizes the balance between training job reliability and performance. Unlike conventional fixed-frequency checkpointing, which can lead to resource wastage from infrequent saves or performance bottlenecks from overly frequent ones, continuous checkpointing asynchronously initiates saves only after the previous operation completes. This maximizes host machine and I/O bandwidth utilization while minimizing hardware failure risks with minimal performance degradation. Benchmarks with a llama-3.1-70B model on a v5p-128 cluster show continuous checkpointing significantly reduces P50 checkpoint intervals, leading to substantial resource preservation, especially in large-scale training where efficiency gains are amplified due to fragmented model files and inverse scaling of Mean Time Between Failure (MTBF).
Key takeaway
For MLOps Engineers managing large-scale model training, adopting continuous checkpointing in Orbax and MaxText is crucial for enhancing job reliability and resource efficiency. This feature dynamically adjusts checkpoint frequency, significantly reducing data loss from failures without blocking training. You should configure your training jobs to enable continuous checkpointing and ensure storage buckets are co-located with your training clusters to maximize network bandwidth and overall efficacy.
Key insights
Continuous checkpointing in Orbax and MaxText dynamically optimizes training reliability and performance by adapting save frequency.
Principles
- Asynchronous saves prevent training blocks.
- Dynamic frequency adapts to system state.
- Co-locate storage with training clusters.
Method
Orbax initiates an asynchronous checkpoint save only upon the successful completion of the preceding save operation, dynamically adjusting frequency to maximize I/O and host utilization while minimizing performance impact.
In practice
- Enable `enable_continuous_checkpointing: True` in MaxText.
- Set `minimum_interval_secs` for lightweight models.
- Implement custom policies for complex save logic.
Topics
- Continuous Checkpointing
- Orbax Framework
- MaxText
- Model Training Reliability
- Resource Optimization
Code references
Best for: Machine Learning Engineer, MLOps Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Google Developers Blog - AI.