TorchTPU: Running PyTorch Natively on TPUs at Google Scale

2026-04-07 · Source: Google Developers Blog - AI · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cloud Computing & IT Infrastructure · Depth: Advanced, medium

Summary

Google has launched TorchTPU, an engineering initiative designed to enable native and efficient execution of PyTorch workloads on its Tensor Processing Units (TPUs). This integration addresses the growing demand for distributed systems capable of scaling AI models across thousands of accelerators, aiming for usability, portability, and high performance. TorchTPU allows developers to migrate existing PyTorch scripts with minimal code changes, providing APIs and tools to maximize TPU compute utilization. The architecture supports three eager execution modes: Debug Eager for troubleshooting, Strict Eager for asynchronous single-op dispatch, and Fused Eager, which delivers a 50% to 100%+ performance increase by automatically fusing operations. It also integrates with `torch.compile` using XLA and StableHLO for peak performance and supports custom kernels via Pallas and JAX, with Helion support planned. TorchTPU also handles distributed training with DDP, FSDPv2, and DTensor, specifically addressing the Multi-Program, Multiple-Data (MPMD) challenge to support divergent executions.

Key takeaway

For AI/ML Directors evaluating infrastructure for large-scale PyTorch deployments, TorchTPU offers a compelling solution by enabling native, high-performance execution on Google TPUs. You should investigate its Fused Eager mode for immediate performance boosts and consider refactoring models to align with TPU hardware efficiencies (e.g., 128/256 attention head dimensions) to maximize compute utilization. This integration simplifies migration and provides robust distributed training capabilities, reducing friction in scaling your AI workloads.

Key insights

TorchTPU enables native PyTorch execution on Google TPUs, prioritizing usability, performance, and hardware portability.

Principles

"Eager First" philosophy for flexible execution.
PyTorch-like experience with minimal code changes.
Optimize for TPU hardware awareness.

Method

TorchTPU uses PyTorch's "PrivateUse1" interface for native tensor integration, offering Debug, Strict, and Fused Eager modes. It leverages Torch Dynamo, XLA, and StableHLO for static compilation and supports custom kernels via Pallas/JAX.

In practice

Migrate PyTorch scripts by changing initialization to "tpu".
Utilize Fused Eager mode for automatic performance gains.
Refactor models for 128 or 256 attention head dimensions on TPUs.

Topics

TorchTPU
Google TPUs
PyTorch Integration
Distributed Training
XLA Compiler

Code references

pytorch/helion

Best for: CTO, VP of Engineering/Data, Director of AI/ML, Machine Learning Engineer, AI Engineer, AI Architect

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Google Developers Blog - AI.