Google doesn't pay the Nvidia tax. Its new TPUs explain why.
Summary
Google has unveiled its eighth-generation Tensor Processing Units (TPU v8), featuring two specialized custom silicon designs: TPU 8t for frontier model training and TPU 8i for low-latency agentic inference and real-time sampling. Previewed at a private gathering in Las Vegas, these chips are set to ship later in 2026. Google's SVP Amin Vahdat highlighted the company's end-to-end vertical integration of its AI stack, claiming it delivers unmatched cost-per-token economics. The decision to split the roadmap into two specialized chips was made in 2024, anticipating the industry's shift to reasoning models and agents. TPU 8t offers 2.8x the FP4 EFlops per pod (121 vs 42.5) compared to its predecessor, Ironwood, and can scale to over 1 million chips with Virgo networking. TPU 8i delivers 9.8x the FP8 EFlops per pod (11.6 vs 1.2) and 6.8x the HBM capacity per pod (331.8 TB vs 49.2), utilizing a new Boardfly topology for 5x latency improvement in real-time LLM sampling.
Key takeaway
For CTOs and VPs of Engineering evaluating cloud infrastructure for AI, Google's TPU v8 release signals a significant shift in compute economics. You should assess how the specialized TPU 8t for training and 8i for inference align with your specific workload profiles, particularly for large-scale model development or latency-sensitive agent deployments. Factor in Google's claimed cost-per-token advantages and the implications of its vertical integration against the "Nvidia tax" when planning your 2026-2027 compute strategy, while also considering portability friction with existing CUDA/PyTorch ecosystems.
Key insights
Google's new specialized TPUs and vertical integration aim to reduce AI compute costs and enhance performance.
Principles
- Specialized silicon outperforms general-purpose chips for distinct AI workloads.
- Vertical integration across the AI stack drives cost efficiency.
- Network topology is critical for optimizing AI inference latency.
Method
Google's approach involves designing two specialized chips (8t for training, 8i for inference) and integrating them with custom networking (Virgo, Boardfly) and storage (TPU Direct Storage) within a vertically integrated AI stack.
In practice
- Evaluate 8t for large-scale model training and fine-tuning.
- Consider 8i for production agents and real-time LLM sampling.
- Assess latency benchmarks for agentic workloads on Vertex AI.
Topics
- Tensor Processing Units
- TPU 8t
- TPU 8i
- AI Infrastructure
- Vertical Integration
Best for: CTO, VP of Engineering/Data, Investor, AI Architect, Director of AI/ML, IT Professional
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by VentureBeat.