Training an LLM from Scratch, Locally — Angelos Perivolaropoulos, ElevenLabs

2026-05-04 · Source: AI Engineer · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, extended

Summary

This workshop provides a hands-on guide to training a Large Language Model (LLM) from scratch using PyTorch, without relying on pre-trained weights or high-level libraries. Led by Angelos, a research engineer from Eleven Labs, the session focuses on building a small, GPT-2-based causal decoder-only model. Key components covered include character-level tokenization, the transformer architecture's four building blocks (multi-head self-attention, MLP, residual connections, layer normalization), and a detailed training loop. The project utilizes a Shakespearean dataset of approximately 1 million characters, with training achievable locally on a laptop with 16GB RAM or via Google Colab. The workshop also delves into inference techniques like temperature and top-k sampling, and concludes with a challenge for participants to train the best Shakespearean text generation model.

Key takeaway

For AI Scientists and Machine Learning Engineers looking to deepen their understanding of LLM internals, this workshop provides a practical blueprint. You can build a functional, small-scale LLM from foundational components, gaining insight into how models are designed and trained in research labs. Focus on implementing the core transformer blocks and a well-structured training loop, paying close attention to learning rate schedules and validation loss to optimize your model's performance and avoid overfitting.

Key insights

Training an LLM from scratch involves understanding core components like tokenization, transformer architecture, and the training loop.

Principles

Character-level tokenization simplifies training for small models.
Transformer architecture fundamentals remain consistent across scales.
Learning rate schedules are critical for stable model training.

Method

The method involves character-level tokenization, implementing a GPT-2-based causal decoder-only transformer, and a training loop with cosine decay learning rate scheduling, using cross-entropy loss and validation for overfitting detection.

In practice

Use Google Colab for free GPU access for training.
Monitor validation loss to detect model overfitting.
Employ temperature and top-k sampling for creative text generation.

Topics

LLM Training from Scratch
GPT-2 Architecture
Character-Level Tokenization
Transformer Building Blocks
Training Loop Optimization

Best for: AI Scientist, Machine Learning Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by AI Engineer.