NeuronFabric: A Software Reference Architecture for On-Chip Transformer Training with Local Adam
Summary
NeuronFabric is a software reference architecture designed for future FPGA and ASIC implementations of on-chip transformer training, incorporating local Adam updates. A complete C# prototype validates numerical correctness and memory requirements, bypassing external machine-learning frameworks. The architecture was evaluated using a 334K-parameter autoregressive transformer (d=88, H=4, f=264, L=4, vocab=256) trained on the Shakespeare corpus. Its BF16W configuration, which stores weights in BF16 while retaining Adam optimizer moments in FP32, achieved an evaluation loss of 1.5426 after 80K samples, closely approaching the 1.5224 loss of an FP32 GPU reference. This BF16W variant reduces memory from approximately 4.0 MB (for FP32 with Adam) to 3.34 MB, making it suitable for devices like the Xilinx ZCU102 and freeing space for activation storage. This publication serves as an architectural disclosure and software reference for subsequent hardware exploration.
Key takeaway
For AI Hardware Engineers designing custom accelerators for transformer training, NeuronFabric offers a validated software reference for integrating local Adam updates directly on-chip. You should consider its BF16W mixed-precision approach, which reduces memory requirements to 3.34 MB for a 334K-parameter model, making it feasible for devices like the Xilinx ZCU102. This architecture provides a robust foundation for your FPGA and ASIC explorations, potentially streamlining development by validating numerical correctness early.
Key insights
NeuronFabric enables memory-efficient on-chip transformer training using local Adam updates and a BF16W mixed-precision approach.
Principles
- Integrate Adam updates directly on-chip.
- Use BF16 for weights, FP32 for Adam moments.
Method
NeuronFabric's method involves a C# prototype implementing forward pass, backpropagation, and local Adam optimization without external ML frameworks.
In practice
- Fit 334K-parameter models on Xilinx ZCU102 BRAM.
- Validate numerical correctness via C# prototype.
Topics
- NeuronFabric
- Transformer Training
- On-Chip AI
- Adam Optimization
- Mixed Precision
- FPGA/ASIC Design
Best for: Research Scientist, AI Hardware Engineer, AI Scientist, Machine Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.