Google doesn't pay the Nvidia tax. Its new TPUs explain why.

2026-04-22 · Source: VentureBeat · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cloud Computing & IT Infrastructure, Emerging Technologies & Innovation · Depth: Intermediate, short

Summary

Google has unveiled its eighth-generation Tensor Processing Units (TPU v8), featuring two specialized custom silicon designs: TPU 8t for frontier model training and TPU 8i for low-latency agentic inference and real-time sampling. Previewed at a private gathering in Las Vegas, these chips are set to ship later in 2026. Google's SVP Amin Vahdat highlighted the company's end-to-end vertical integration of its AI stack, claiming it delivers unmatched cost-per-token economics. The decision to split the roadmap into two specialized chips was made in 2024, anticipating the industry's shift to reasoning models and agents. TPU 8t offers 2.8x the FP4 EFlops per pod (121 vs 42.5) compared to its predecessor, Ironwood, and can scale to over 1 million chips with Virgo networking. TPU 8i delivers 9.8x the FP8 EFlops per pod (11.6 vs 1.2) and 6.8x the HBM capacity per pod (331.8 TB vs 49.2), utilizing a new Boardfly topology for 5x latency improvement in real-time LLM sampling.

Key takeaway

For CTOs and VPs of Engineering evaluating cloud infrastructure for AI, Google's TPU v8 release signals a significant shift in compute economics. You should assess how the specialized TPU 8t for training and 8i for inference align with your specific workload profiles, particularly for large-scale model development or latency-sensitive agent deployments. Factor in Google's claimed cost-per-token advantages and the implications of its vertical integration against the "Nvidia tax" when planning your 2026-2027 compute strategy, while also considering portability friction with existing CUDA/PyTorch ecosystems.

Key insights

Google's new specialized TPUs and vertical integration aim to reduce AI compute costs and enhance performance.

Principles

Specialized silicon outperforms general-purpose chips for distinct AI workloads.
Vertical integration across the AI stack drives cost efficiency.
Network topology is critical for optimizing AI inference latency.

Method

Google's approach involves designing two specialized chips (8t for training, 8i for inference) and integrating them with custom networking (Virgo, Boardfly) and storage (TPU Direct Storage) within a vertically integrated AI stack.

In practice

Evaluate 8t for large-scale model training and fine-tuning.
Consider 8i for production agents and real-time LLM sampling.
Assess latency benchmarks for agentic workloads on Vertex AI.

Topics

Tensor Processing Units
TPU 8t
TPU 8i
AI Infrastructure
Vertical Integration

Best for: CTO, VP of Engineering/Data, Investor, AI Architect, Director of AI/ML, IT Professional

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by VentureBeat.