6 Things I Learned Building LLMs From Scratch That No Tutorial Teaches You

· Source: Towards Data Science · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, long

Summary

This analysis details six critical architectural design choices within Large Language Models (LLMs) that impact their speed, cost, and capabilities, based on an end-to-end implementation of GPT-2 using PyTorch. It contrasts LoRA with RsLoRA, explaining how RsLoRA stabilizes weight updates by replacing "r" with "√r" in the scaling factor to prevent variance decrease as rank increases. The article also highlights the benefits of Rotary Positional Embeddings (RoPE) over Sinusoidal PEs and Learned Parameters, noting RoPE's zero parameter load and non-invasive encoding. It discusses the diminishing relevance of weight tying in large models, the stability-performance trade-off between Pre-LayerNorm and Post-LayerNorm, and the efficiency gains of KV-Cache, including Google Research's 2026 TurboQuant technique for 3-bit KV cache compression, achieving 5x-6x memory reduction. Finally, it explains why LayerNorm is typically skipped during INT8 quantization due to its negligible parameter count and high sensitivity to precision loss.

Key takeaway

For AI Engineers optimizing LLM deployments, understanding these architectural nuances is crucial. You should prioritize RsLoRA for fine-tuning to ensure stable weight updates and leverage RoPE for efficient positional encoding. When quantizing models, avoid applying INT8 to LayerNorm layers, as the minimal memory savings do not justify the significant quality degradation due to its mathematical sensitivity. Consider KV-Cache compression techniques like TurboQuant for substantial memory savings in long-context scenarios.

Key insights

LLM architecture involves non-obvious design choices significantly impacting performance, cost, and training stability.

Principles

Method

RsLoRA stabilizes LoRA by adjusting the scaling factor from α/r to α/√r, maintaining constant variance in weight updates as rank increases, preventing updates from shrinking.

In practice

Topics

Best for: AI Engineer, Machine Learning Engineer, AI Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Towards Data Science.