TorchTPU: Running PyTorch Natively on TPUs at Google Scale
Summary
Google has launched TorchTPU, an engineering initiative designed to enable native and efficient execution of PyTorch workloads on its Tensor Processing Units (TPUs). This integration addresses the growing demand for distributed systems capable of scaling AI models across thousands of accelerators, aiming for usability, portability, and high performance. TorchTPU allows developers to migrate existing PyTorch scripts with minimal code changes, providing APIs and tools to maximize TPU compute utilization. The architecture supports three eager execution modes: Debug Eager for troubleshooting, Strict Eager for asynchronous single-op dispatch, and Fused Eager, which delivers a 50% to 100%+ performance increase by automatically fusing operations. It also integrates with `torch.compile` using XLA and StableHLO for peak performance and supports custom kernels via Pallas and JAX, with Helion support planned. TorchTPU also handles distributed training with DDP, FSDPv2, and DTensor, specifically addressing the Multi-Program, Multiple-Data (MPMD) challenge to support divergent executions.
Key takeaway
For AI/ML Directors evaluating infrastructure for large-scale PyTorch deployments, TorchTPU offers a compelling solution by enabling native, high-performance execution on Google TPUs. You should investigate its Fused Eager mode for immediate performance boosts and consider refactoring models to align with TPU hardware efficiencies (e.g., 128/256 attention head dimensions) to maximize compute utilization. This integration simplifies migration and provides robust distributed training capabilities, reducing friction in scaling your AI workloads.
Key insights
TorchTPU enables native PyTorch execution on Google TPUs, prioritizing usability, performance, and hardware portability.
Principles
- "Eager First" philosophy for flexible execution.
- PyTorch-like experience with minimal code changes.
- Optimize for TPU hardware awareness.
Method
TorchTPU uses PyTorch's "PrivateUse1" interface for native tensor integration, offering Debug, Strict, and Fused Eager modes. It leverages Torch Dynamo, XLA, and StableHLO for static compilation and supports custom kernels via Pallas/JAX.
In practice
- Migrate PyTorch scripts by changing initialization to "tpu".
- Utilize Fused Eager mode for automatic performance gains.
- Refactor models for 128 or 256 attention head dimensions on TPUs.
Topics
- TorchTPU
- Google TPUs
- PyTorch Integration
- Distributed Training
- XLA Compiler
Code references
Best for: CTO, VP of Engineering/Data, Director of AI/ML, Machine Learning Engineer, AI Engineer, AI Architect
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Google Developers Blog - AI.