Scaling seismic foundation models on AWS: Distributed training with Amazon SageMaker HyperPod and expanding context windows

2026-04-02 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure · Depth: Advanced, long

Summary

TGS, a geoscience data provider, partnered with the AWS Generative AI Innovation Center to optimize the training infrastructure for their Vision Transformer-based Seismic Foundation Models (SFMs). This collaboration focused on enhancing SFM training on AWS, specifically addressing challenges related to large-scale 3D seismic data, training efficiency, and expanding analytical capabilities. The solution utilized Amazon SageMaker HyperPod for resilient, scalable training, streaming terabytes of data directly from Amazon S3, and employing DeepSpeed ZeRO-2 for distributed training. These optimizations reduced SFM training time from 6 months to 5 days, achieving a 36-fold speedup and near-linear scaling across 16 Amazon EC2 P5 instances, each with 8 NVIDIA H200 GPUs. Additionally, context parallelism was implemented, increasing the maximum input size from 640 × 640 × 1,024 voxels to 1,536 × 1,536 × 2,048 voxels, a 4.5x volume increase, enabling broader geological pattern analysis.

Key takeaway

For MLOps Engineers managing large-scale Vision Transformer training on AWS, consider adopting SageMaker HyperPod with direct Amazon S3 streaming and DeepSpeed ZeRO-2. This approach can drastically cut training times, from months to days, and enable processing of significantly larger 3D datasets. Evaluate context parallelism techniques to expand your models' analytical scope, allowing for more comprehensive feature capture and enhanced insights for your clients.

Key insights

Optimizing distributed training infrastructure with SageMaker HyperPod and direct S3 streaming dramatically accelerates large 3D Vision Transformer model training.

Principles

Data pipeline optimization is critical for data-intensive workloads.
Systematic scaling from single-node to clusters manages costs.
Framework choice balances memory efficiency and communication overhead.

Method

The method involves establishing an efficient data pipeline by streaming from Amazon S3, optimizing distributed training with DeepSpeed ZeRO-2 on SageMaker HyperPod, and expanding context windows using ring attention and dynamic mask ratio adjustment.

In practice

Stream large datasets directly from Amazon S3 for cost-effective scaling.
Evaluate ZeRO-2 for efficient distributed training of large models.
Implement context parallelism for expanded 3D volumetric analysis.

Topics

Seismic Foundation Models
Amazon SageMaker HyperPod
Distributed Training
Context Parallelism
Amazon S3 Streaming

Best for: Machine Learning Engineer, MLOps Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.