Building LLMs from the Ground Up: A 3-hour Coding Workshop

2024-08-31 · Source: Sebastian Raschka · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, extended

Summary

This coding workshop, led by Sebastian Rashka, a staff research engineer at Lightning AI, focuses on building large language models (LLMs) from the ground up. The workshop covers six main topics, starting with an introduction to LLMs and then diving into practical coding examples. Key areas include understanding LLM input data, coding the LLM architecture, pre-training an LLM on a small dataset, loading larger pre-trained LLMs, and fine-tuning LLMs to follow human instructions. The content emphasizes a "from scratch" approach to foster a deeper understanding of LLM mechanics, utilizing a public domain short story, "The Verdict" by Edith Wharton, for training purposes. It also introduces the `tiktoken` library for efficient tokenization and the L GPT library for more advanced LLM tasks, including Low-Rank Adaptation (LoRA) for efficient fine-tuning.

Key takeaway

For AI Engineers and students aiming to deeply understand LLM mechanics, this workshop provides a practical, code-centric approach. You should prioritize hands-on implementation of core components like tokenization and architecture to grasp underlying principles. Consider using tools like L GPT and techniques like LoRA to manage computational resources efficiently when experimenting with larger models or datasets, ensuring your learning is both foundational and scalable.

Key insights

Building LLMs from scratch, including data preparation, architecture, pre-training, and fine-tuning, enhances fundamental understanding.

Principles

Tokenization is foundational to LLM data processing.
LLMs are trained via next-token prediction.
LoRA enables efficient fine-tuning of large models.

Method

The workshop outlines a workflow: data preparation (tokenization, batching), architecture coding (Transformer blocks), pre-training (loss minimization), and instruction fine-tuning (using prompt templates and LoRA).

In practice

Use `tiktoken` for fast, GPT-compatible tokenization.
Employ L GPT for streamlined LLM downloading and fine-tuning.
Apply LoRA to reduce trainable parameters during fine-tuning.

Topics

Large Language Models
Tokenization
LLM Architecture
Model Pre-training
Instruction Fine-tuning

Best for: Machine Learning Engineer, AI Engineer, AI Student

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Sebastian Raschka.