How I topped the Open LLM Leaderboard using 2x 4090 GPUs - Research notes in Blog form
Summary
An independent researcher discovered that duplicating a specific block of approximately seven middle layers within the Qwen2-72B model significantly improved its performance across all Open LLM Leaderboard benchmarks, leading to a #1 ranking. This technique, developed on two RTX 4090 GPUs, suggests that pre-training carves out discrete, functional "circuits" within a Transformer's layer stack that operate robustly even when rearranged or duplicated. The finding implies that Transformer layers are more interchangeable than previously assumed, capable of processing out-of-order hidden states without collapse. This method offers a "free upgrade" for models by reusing layers in VRAM, potentially reducing the need for extensive fine-tuning or RLHF to enhance model intelligence.
Key takeaway
For AI Engineers and Research Scientists optimizing LLM performance on consumer hardware, you should explore layer duplication as a novel, low-compute method. This technique, demonstrated to improve Qwen2-72B's leaderboard standing, suggests that simply giving a model "more layers to think with" via architectural rearrangement can yield significant gains without retraining. Consider replicating this approach with your existing models to potentially achieve a "free upgrade" in performance and efficiency.
Key insights
Duplicating specific Transformer layer blocks can significantly boost LLM performance without weight modification.
Principles
- Transformers form robust, functional layer circuits.
- Internal representations are homogenous and flexible.
- Layer duplication can enhance model reasoning.
Method
Identify and duplicate a specific block of ~7 middle layers in a pre-trained Transformer model. This reuses layers in VRAM, requiring a new KV cache but no weight changes.
In practice
- Experiment with duplicating 7-layer blocks in Qwen-like models.
- Investigate looping circuits instead of outright duplication.
- Test this method on other Transformer architectures.
Topics
- LLM Architecture
- Layer Duplication
- Transformer Circuits
- Open LLM Leaderboard
- Low-Compute AI
Code references
Best for: AI Scientist, Research Scientist, AI Engineer, AI Researcher, Machine Learning Engineer, Deep Learning Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.