6 Things I Learned Building LLMs From Scratch That No Tutorial Teaches You
Summary
This analysis details six critical architectural design choices within Large Language Models (LLMs) that impact their speed, cost, and capabilities, based on an end-to-end implementation of GPT-2 using PyTorch. It contrasts LoRA with RsLoRA, explaining how RsLoRA stabilizes weight updates by replacing "r" with "√r" in the scaling factor to prevent variance decrease as rank increases. The article also highlights the benefits of Rotary Positional Embeddings (RoPE) over Sinusoidal PEs and Learned Parameters, noting RoPE's zero parameter load and non-invasive encoding. It discusses the diminishing relevance of weight tying in large models, the stability-performance trade-off between Pre-LayerNorm and Post-LayerNorm, and the efficiency gains of KV-Cache, including Google Research's 2026 TurboQuant technique for 3-bit KV cache compression, achieving 5x-6x memory reduction. Finally, it explains why LayerNorm is typically skipped during INT8 quantization due to its negligible parameter count and high sensitivity to precision loss.
Key takeaway
For AI Engineers optimizing LLM deployments, understanding these architectural nuances is crucial. You should prioritize RsLoRA for fine-tuning to ensure stable weight updates and leverage RoPE for efficient positional encoding. When quantizing models, avoid applying INT8 to LayerNorm layers, as the minimal memory savings do not justify the significant quality degradation due to its mathematical sensitivity. Consider KV-Cache compression techniques like TurboQuant for substantial memory savings in long-context scenarios.
Key insights
LLM architecture involves non-obvious design choices significantly impacting performance, cost, and training stability.
Principles
- Variance stability improves LoRA fine-tuning.
- Positional embeddings should not alter token embeddings.
- Quantization sensitivity varies by layer.
Method
RsLoRA stabilizes LoRA by adjusting the scaling factor from α/r to α/√r, maintaining constant variance in weight updates as rank increases, preventing updates from shrinking.
In practice
- Use RsLoRA for stable fine-tuning.
- Implement RoPE for efficient positional encoding.
- Skip LayerNorm during INT8 quantization.
Topics
- LoRA
- RsLoRA
- Rotary Positional Embeddings
- KV Cache
- Quantization
Best for: AI Engineer, Machine Learning Engineer, AI Scientist
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.