lyogavin / airllm

2023-06-12 · Source: Github Trending: All languages · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Intermediate, medium

Summary

AirLLM is an open-source tool designed to significantly optimize large language model inference memory usage, enabling models like 70B LLMs to run on a single 4GB GPU card without initial quantization, distillation, or pruning. It also supports running the 405B Llama3.1 model on 8GB VRAM. The tool now includes block-wise quantization-based model compression, which can accelerate inference speed by up to 3x with minimal accuracy loss, primarily by reducing model loading size. AirLLM supports a wide range of models, including Llama3, Qwen2.5, ChatGLM, Baichuan, Mistral, and InternLM, and offers CPU inference and MacOS support for 70B models. Key features include an "AutoModel" for automatic type detection and prefetching for a 10% speed improvement.

Key takeaway

For MLOps Engineers deploying large language models on resource-constrained hardware, AirLLM provides a critical solution. You can now run 70B LLMs on a single 4GB GPU or 405B Llama3.1 on 8GB VRAM, drastically reducing VRAM requirements without complex quantization. Consider integrating AirLLM to lower infrastructure costs and expand deployment possibilities for powerful models, especially for local or edge inference, while potentially achieving 3x speed improvements with its compression features.

Key insights

AirLLM enables running large language models on low-VRAM GPUs by optimizing inference memory.

Principles

Memory optimization can bypass full quantization for large LLMs.
Block-wise weight quantization boosts inference speed up to 3x.

Method

Install "airllm", initialize "AutoModel.from_pretrained()" with a model ID, then call "model.generate()". Enable compression via "compression='4bit'" or "'8bit'".

In practice

Run 70B LLMs on a single 4GB GPU.
Deploy 405B Llama3.1 on 8GB VRAM.
Achieve 3x inference speed-up with compression.

Topics

LLM Inference Optimization
Low-VRAM Deployment
Block-wise Quantization
Llama3.1
GPU Memory Management
AirLLM

Code references

Best for: NLP Engineer, Entrepreneur, Machine Learning Engineer, AI Engineer, MLOps Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Github Trending: All languages.