Scaling seismic foundation models on AWS: Distributed training with Amazon SageMaker HyperPod and expanding context windows
Summary
TGS, a geoscience data provider, partnered with the AWS Generative AI Innovation Center to optimize the training infrastructure for their Vision Transformer-based Seismic Foundation Models (SFMs). This collaboration focused on enhancing SFM training on AWS, specifically addressing challenges related to large-scale 3D seismic data, training efficiency, and expanding analytical capabilities. The solution utilized Amazon SageMaker HyperPod for resilient, scalable training, streaming terabytes of data directly from Amazon S3, and employing DeepSpeed ZeRO-2 for distributed training. These optimizations reduced SFM training time from 6 months to 5 days, achieving a 36-fold speedup and near-linear scaling across 16 Amazon EC2 P5 instances, each with 8 NVIDIA H200 GPUs. Additionally, context parallelism was implemented, increasing the maximum input size from 640 × 640 × 1,024 voxels to 1,536 × 1,536 × 2,048 voxels, a 4.5x volume increase, enabling broader geological pattern analysis.
Key takeaway
For MLOps Engineers managing large-scale Vision Transformer training on AWS, consider adopting SageMaker HyperPod with direct Amazon S3 streaming and DeepSpeed ZeRO-2. This approach can drastically cut training times, from months to days, and enable processing of significantly larger 3D datasets. Evaluate context parallelism techniques to expand your models' analytical scope, allowing for more comprehensive feature capture and enhanced insights for your clients.
Key insights
Optimizing distributed training infrastructure with SageMaker HyperPod and direct S3 streaming dramatically accelerates large 3D Vision Transformer model training.
Principles
- Data pipeline optimization is critical for data-intensive workloads.
- Systematic scaling from single-node to clusters manages costs.
- Framework choice balances memory efficiency and communication overhead.
Method
The method involves establishing an efficient data pipeline by streaming from Amazon S3, optimizing distributed training with DeepSpeed ZeRO-2 on SageMaker HyperPod, and expanding context windows using ring attention and dynamic mask ratio adjustment.
In practice
- Stream large datasets directly from Amazon S3 for cost-effective scaling.
- Evaluate ZeRO-2 for efficient distributed training of large models.
- Implement context parallelism for expanded 3D volumetric analysis.
Topics
- Seismic Foundation Models
- Amazon SageMaker HyperPod
- Distributed Training
- Context Parallelism
- Amazon S3 Streaming
Best for: Machine Learning Engineer, MLOps Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.