How I topped the Open LLM Leaderboard using 2x 4090 GPUs - Research notes in Blog form

2026-03-10 · Source: Machine Learning · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Large Language Models · Depth: Advanced, medium

Summary

An independent researcher discovered that duplicating a specific block of approximately seven middle layers within the Qwen2-72B model significantly improved its performance across all Open LLM Leaderboard benchmarks, leading to a #1 ranking. This technique, developed on two RTX 4090 GPUs, suggests that pre-training carves out discrete, functional "circuits" within a Transformer's layer stack that operate robustly even when rearranged or duplicated. The finding implies that Transformer layers are more interchangeable than previously assumed, capable of processing out-of-order hidden states without collapse. This method offers a "free upgrade" for models by reusing layers in VRAM, potentially reducing the need for extensive fine-tuning or RLHF to enhance model intelligence.

Key takeaway

For AI Engineers and Research Scientists optimizing LLM performance on consumer hardware, you should explore layer duplication as a novel, low-compute method. This technique, demonstrated to improve Qwen2-72B's leaderboard standing, suggests that simply giving a model "more layers to think with" via architectural rearrangement can yield significant gains without retraining. Consider replicating this approach with your existing models to potentially achieve a "free upgrade" in performance and efficiency.

Key insights

Duplicating specific Transformer layer blocks can significantly boost LLM performance without weight modification.

Principles

Transformers form robust, functional layer circuits.
Internal representations are homogenous and flexible.
Layer duplication can enhance model reasoning.

Method

Identify and duplicate a specific block of ~7 middle layers in a pre-trained Transformer model. This reuses layers in VRAM, requiring a new KV cache but no weight changes.

In practice

Experiment with duplicating 7-layer blocks in Qwen-like models.
Investigate looping circuits instead of outright duplication.
Test this method on other Transformer architectures.

Topics

LLM Architecture
Layer Duplication
Transformer Circuits
Open LLM Leaderboard
Low-Compute AI

Code references

dnhkng/GLaDOS

Best for: AI Scientist, Research Scientist, AI Engineer, AI Researcher, Machine Learning Engineer, Deep Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Machine Learning.