Running Andrej Karpathy’s Autoresearch on a Local RTX GPU: ESG Classification Case Study

2026-03-25 · Source: Naturallanguageprocessing on Medium · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Advanced, quick

Summary

An adaptation of Andrej Karpathy's autoresearch framework successfully trained a Transformer model for ESG text classification on a local Windows machine with an RTX GPU. The primary goal was to demonstrate LLM training accessibility on consumer-grade hardware, not to surpass the original implementation's performance on H100 GPUs. Key modifications included replacing Flash Attention with PyTorch's scaled_dot_product_attention, substituting the Triton-based Muon optimizer with AdamW, disabling torch.compile due to Triton incompatibility, and scaling down the model from 26M to 14.5M parameters to fit 16GB VRAM. The dataset was also transformed into an instruction-style format. The optimized model achieved a validation BPB of 0.52, processing approximately 2.3M tokens in about 5 minutes, demonstrating significant improvement despite its smaller size.

Key takeaway

For MLOps Engineers or Deep Learning Engineers aiming to conduct LLM training experiments on local RTX GPUs, this case study confirms that significant results are achievable by making strategic trade-offs. You should prioritize adapting attention mechanisms and optimizers for consumer hardware, scale models to fit available VRAM, and consider instruction-style data formatting to improve performance, rather than relying solely on datacenter-grade optimizations.

Key insights

LLM experimentation is feasible on consumer-grade GPUs with appropriate architectural and software adaptations.

Principles

Hardware constraints dictate ML design.
Simpler implementations enhance portability.
Dataset format impacts performance.

Method

The method involved adapting Karpathy's autoresearch for Windows/RTX by replacing Flash Attention with PyTorch's scaled_dot_product_attention, using AdamW, disabling torch.compile, and reducing model size to fit 16GB VRAM.

In practice

Use PyTorch's scaled_dot_product_attention for RTX.
Opt for AdamW over Triton-based optimizers.
Reduce model depth for VRAM constraints.

Topics

ESG Classification
Transformer Models
Local GPU Training
Model Optimization
Autoresearch Framework

Code references

Best for: Machine Learning Engineer, Deep Learning Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Naturallanguageprocessing on Medium.