Rigel: Reverse-Engineering the Metal 4.1 Tensor Compute Path on the Apple M4 Max GPU

2026-06-11 · Source: Computation and Language · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

Rigel, an empirical characterization, reverse-engineered the Metal 4.1 tensor compute path on a single Apple M4 Max GPU, revealing eleven previously hidden or contradicted facts. The study found that the Metal 4.1 fp8 (E4M3) matmul2d operation is emulated, not hardware-accelerated, sustaining only 0.94x the throughput of fp16 despite reading half the operand bytes. This indicates fp8 on M4 is a memory-footprint feature, not a performance one. Throughput, simdgroup_matrix comparisons, and power attribution triangulated that matmul2d executes entirely on GPU shader cores, without a dedicated matrix datapath or Apple Neural Engine routing. It accumulates in >=fp32, and Rigel reconstructed the opaque 8x8 cooperative_tensor fragment layout. A hand-fused GEMM + bias + GELU kernel, acting on these findings, outperformed the decomposed path by +6.5-12.9% in cache-resident scenarios.

Key takeaway

For AI Scientists or ML Engineers optimizing GPU kernels on Apple M4 Max, understand that Metal 4.1's fp8 matmul2d is emulated, offering memory footprint benefits but not raw performance gains over fp16. You should consider hand-fusing operations like GEMM + bias + GELU to achieve significant performance improvements, potentially +6.5-12.9%, rather than relying on decomposed paths. This insight is crucial for maximizing efficiency on Apple's M4 architecture.

Key insights

Metal 4.1's fp8 matmul2d on Apple M4 Max is emulated, prioritizing memory footprint over raw performance.

Principles

Hardware behavior can be reverse-engineered empirically.
Performance claims require multi-signal validation.
Emulated low-precision types offer memory savings.

Method

Rigel used a checksum-gated, provenance-tracked microbenchmark harness to characterize hardware behavior and reconstruct opaque fragment layouts.

In practice

Evaluate fp8 performance on M4 Max for memory-bound tasks.
Consider hand-fusing kernels for Metal 4.1 performance gains.

Topics

Apple M4 Max
Metal 4.1
Tensor Compute
GPU Shader Cores
Microbenchmarking
Kernel Fusion
fp8 (E4M3)

Best for: AI Engineer, Research Scientist, AI Scientist, Machine Learning Engineer, AI Hardware Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Computation and Language.