TinyTPU: SystemVerilog systolic array compiled to WASM, running live in browser - RTL golden-verified against numpy [P]
Summary
TinyTPU is a 4x4 weight-stationary systolic array implemented in SystemVerilog, compiled to WebAssembly, and presented with a step-by-step browser visualization. This interactive tool allows users to input matrices and observe the actual hardware execution, including weights loading into Processing Elements (PEs), matrix A streaming diagonally, partial sums accumulating, and results draining. The project features three distinct levels: L1 isolates a single Multiply-Accumulate (MAC) cell, L2 demonstrates the full 4x4 array executing a real matrix multiplication, and L3 illustrates tiling for matrices larger than the hardware. The visualization directly reads state from the compiled Register-Transfer Level (RTL), ensuring accuracy and providing a concrete understanding of how matrix multiplication maps to hardware and why TPUs are efficient.
Key takeaway
For AI Hardware Engineers or Machine Learning Engineers learning hardware acceleration, TinyTPU offers an invaluable interactive tool. If you are struggling to grasp how matrix multiplication maps to hardware or the efficiency of TPUs, this visualization provides direct insight. You should explore its L1, L2, and L3 levels to understand MAC cell operations, full array execution, and matrix tiling, which can significantly deepen your comprehension beyond theoretical papers.
Key insights
TinyTPU provides an interactive, browser-based visualization of a 4x4 SystemVerilog systolic array, clarifying hardware matrix multiplication.
Principles
- Systolic arrays use diagonal data streaming.
- Weight-stationary design enhances efficiency.
- Tiling handles large matrices on fixed hardware.
Method
The TinyTPU method involves compiling SystemVerilog RTL to WebAssembly, then visualizing its execution in a browser, showing MAC cell operation, full array matmul, and matrix tiling.
In practice
- Visualize MAC cell operations.
- Observe 4x4 matrix multiplication.
- Understand matrix tiling for larger inputs.
Topics
- Systolic Arrays
- WebAssembly
- SystemVerilog
- Hardware Acceleration
- Matrix Multiplication
- TPU Architecture
Code references
Best for: AI Hardware Engineer, AI Student, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.