unslothai / unsloth

2023-11-29 · Source: Github Trending: All languages · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Data Science & Analytics · Depth: Intermediate, long

Summary

Unsloth is a library designed to accelerate the fine-tuning of large language models (LLMs) such as gpt-oss, DeepSeek, Gemma, Qwen, and Llama. It claims to achieve up to 2x faster training speeds and reduce VRAM usage by up to 80% compared to Hugging Face + FA2, enabling longer context windows. For instance, Llama 3.3 (70B) can achieve 89,389 context length on an 80GB GPU, and Llama 3.1 (8B) can reach 342,733 context length on the same hardware. Unsloth supports various training methods including full-finetuning, pretraining, 4-bit, 16-bit, and FP8 training, and is compatible with NVIDIA, AMD, and Intel GPUs. It also offers pre-built Docker images and free notebooks for models like gpt-oss (20B), Qwen3, Gemma 3, and Llama 3.1.

Key takeaway

For NLP Engineers fine-tuning LLMs, Unsloth offers a compelling solution to significantly reduce training time and VRAM consumption, especially for large models and long context requirements. You should consider integrating Unsloth into your workflow to potentially train models like Llama 3.3 (70B) with 13x longer context on an 80GB GPU, or to enable FP8 Reinforcement Learning on consumer GPUs. Explore their free notebooks to quickly assess its performance benefits for your specific models.

Key insights

Unsloth significantly accelerates LLM fine-tuning and reduces VRAM, enabling longer context windows and broader model support.

Principles

Optimize kernels for speed and VRAM efficiency.
Support diverse quantization and training methods.
Enable long context windows for complex tasks.

Method

Unsloth utilizes custom Triton kernels and manual backpropagation, alongside techniques like padding-free + packing and dynamic 4-bit quantization, to optimize LLM training performance and memory footprint.

In practice

Fine-tune gpt-oss (20B) with 70% less VRAM.
Achieve 500K context length on an 80GB GPU.
Deploy models to GGUF, vLLM, or SGLang.

Topics

LLM Fine-tuning
VRAM Optimization
Reinforcement Learning
Large Language Models
Quantization

Code references

Best for: NLP Engineer, Machine Learning Engineer, AI Engineer, Data Scientist

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Github Trending: All languages.