How Outpost VFX Uses AWS to Accelerate AI Model Training for Visual Effects
Summary
Outpost VFX, a visual effects studio, significantly accelerated its AI model training for face replacement workflows by migrating from single-GPU workstations to AWS multi-GPU P5 instances. Previously, training on RTX 3090 GPUs took 1–2 weeks, creating production bottlenecks for client deliverables. Collaborating with the AWS Generative AI Innovation Center over a 6-week advisory period, Outpost VFX adapted its existing face swap model codebase to utilize PyTorch Distributed Data Parallel (DDP) training on NVIDIA H100 GPUs within P5 instances. This architecture, featuring 14,592 CUDA cores and 80GB HBM3 memory, achieved an 8x improvement in learning speeds. Consequently, initial client review deliveries (v001) now take 2 days, down from the previous 1–2 week timeline, enhancing iteration cycles and output quality.
Key takeaway
For Machine Learning Engineers or AI Architects struggling with slow AI model training in VFX or similar compute-intensive fields, migrating to a multi-GPU cloud architecture like AWS P5 instances is crucial. Your team can achieve significant speedups, potentially 8x faster, by implementing distributed training strategies such as PyTorch DDP. This shift will accelerate iteration cycles, improve output quality, and drastically reduce client delivery timelines, making your AI tools integral to production pipelines.
Key insights
Distributed multi-GPU training on cloud infrastructure dramatically accelerates AI model development for VFX production.
Principles
- Parallelize training across multiple GPUs.
- Use high-bandwidth GPU interconnects.
- Prioritize security for sensitive data.
Method
Adapt existing AI model code to PyTorch Distributed Data Parallel (DDP) for multi-GPU training on cloud instances like AWS P5.
In practice
- Evaluate single-GPU bottlenecks in AI workflows.
- Explore AWS EC2 P5 instances for distributed training.
- Engage AWS Generative AI Innovation Center for guidance.
Topics
- AI Model Training
- Visual Effects
- Distributed Training
- AWS EC2 P5 Instances
- PyTorch DDP
- NVIDIA H100 GPUs
Best for: Machine Learning Engineer, Computer Vision Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.