I got an old server with lots of RAM, but no GPU, and ended up getting Grok 2 running anyway ;)
Summary
A user successfully ran Grok 2, a large language model, on a Dell r640 1U server equipped with dual Xeon Platinum 8268 processors and 1.5TB of 2666MHz RAM, despite the absence of a dedicated GPU. The setup achieved a prompt processing speed of 4.73 tokens/second and a generation speed of 1.35 tokens/second, supporting a 512K context and web search capabilities. This configuration utilized NUMA architecture and 40 threads. The user is now seeking advice on fitting Tesla GPUs into the 1U server's stock risers without physical modification and general recommendations for similar GPU-less AI builds.
Key takeaway
For AI Engineers evaluating LLM deployment on existing server infrastructure without GPUs, consider that high-RAM, multi-core CPU servers can run models like Grok 2. While performance will be lower than GPU-accelerated setups, this approach can serve as a viable interim solution or for less demanding inference tasks. Investigate specific GPU dimensions and server riser compatibility before purchasing hardware.
Key insights
Large language models like Grok 2 can operate on CPU-only servers with substantial RAM, albeit with reduced performance.
Principles
- High RAM capacity can compensate for GPU absence in LLM inference.
- NUMA architecture can optimize CPU-based LLM performance.
In practice
- Utilize high-capacity RAM servers for CPU-only LLM inference.
- Configure NUMA and thread counts for performance tuning.
Topics
- Grok 2
- Dell r640
- Server Hardware
- Large Language Models
- GPU-less AI
Best for: Machine Learning Engineer, AI Engineer, AI Hardware Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.