Running Andrej Karpathy’s Autoresearch on a Local RTX GPU: ESG Classification Case Study
Summary
An adaptation of Andrej Karpathy's autoresearch framework successfully trained a Transformer model for ESG text classification on a local Windows machine with an RTX GPU. The primary goal was to demonstrate LLM training accessibility on consumer-grade hardware, not to surpass the original implementation's performance on H100 GPUs. Key modifications included replacing Flash Attention with PyTorch's scaled_dot_product_attention, substituting the Triton-based Muon optimizer with AdamW, disabling torch.compile due to Triton incompatibility, and scaling down the model from 26M to 14.5M parameters to fit 16GB VRAM. The dataset was also transformed into an instruction-style format. The optimized model achieved a validation BPB of 0.52, processing approximately 2.3M tokens in about 5 minutes, demonstrating significant improvement despite its smaller size.
Key takeaway
For MLOps Engineers or Deep Learning Engineers aiming to conduct LLM training experiments on local RTX GPUs, this case study confirms that significant results are achievable by making strategic trade-offs. You should prioritize adapting attention mechanisms and optimizers for consumer hardware, scale models to fit available VRAM, and consider instruction-style data formatting to improve performance, rather than relying solely on datacenter-grade optimizations.
Key insights
LLM experimentation is feasible on consumer-grade GPUs with appropriate architectural and software adaptations.
Principles
- Hardware constraints dictate ML design.
- Simpler implementations enhance portability.
- Dataset format impacts performance.
Method
The method involved adapting Karpathy's autoresearch for Windows/RTX by replacing Flash Attention with PyTorch's scaled_dot_product_attention, using AdamW, disabling torch.compile, and reducing model size to fit 16GB VRAM.
In practice
- Use PyTorch's scaled_dot_product_attention for RTX.
- Opt for AdamW over Triton-based optimizers.
- Reduce model depth for VRAM constraints.
Topics
- ESG Classification
- Transformer Models
- Local GPU Training
- Model Optimization
- Autoresearch Framework
Code references
Best for: Machine Learning Engineer, Deep Learning Engineer, MLOps Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Naturallanguageprocessing on Medium.