Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU
Summary
Rigel, an empirical characterization, reverse-engineered the Metal 4.1 tensor compute path on a single Apple M4 Max GPU, revealing eleven previously hidden or contradicted facts. The study found that the Metal 4.1 fp8 (E4M3) matmul2d operation is emulated, not hardware-accelerated, sustaining only 0.94x the throughput of fp16 despite reading half the operand bytes. This indicates fp8 on M4 is a memory-footprint feature, not a performance one. Throughput, simdgroup_matrix comparisons, and power attribution triangulated that matmul2d executes entirely on GPU shader cores, without a dedicated matrix datapath or Apple Neural Engine routing. It accumulates in >=fp32, and Rigel reconstructed the opaque 8x8 cooperative_tensor fragment layout. A hand-fused GEMM + bias + GELU kernel, acting on these findings, outperformed the decomposed path by +6.5-12.9% in cache-resident scenarios.
Key takeaway
For AI Scientists or ML Engineers optimizing GPU kernels on Apple M4 Max, understand that Metal 4.1's fp8 matmul2d is emulated, offering memory footprint benefits but not raw performance gains over fp16. You should consider hand-fusing operations like GEMM + bias + GELU to achieve significant performance improvements, potentially +6.5-12.9%, rather than relying on decomposed paths. This insight is crucial for maximizing efficiency on Apple's M4 architecture.
Key insights
Metal 4.1's fp8 matmul2d on Apple M4 Max is emulated, prioritizing memory footprint over raw performance.
Principles
- Hardware behavior can be reverse-engineered empirically.
- Performance claims require multi-signal validation.
- Emulated low-precision types offer memory savings.
Method
Rigel used a checksum-gated, provenance-tracked microbenchmark harness to characterize hardware behavior and reconstruct opaque fragment layouts.
In practice
- Evaluate fp8 performance on M4 Max for memory-bound tasks.
- Consider hand-fusing kernels for Metal 4.1 performance gains.
Topics
- Apple M4 Max
- Metal 4.1
- Tensor Compute
- GPU Shader Cores
- Microbenchmarking
- Kernel Fusion
- fp8 (E4M3)
Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.