Bigger Isn't Always Better: A Comparative Evaluation of LLMs for Automated Code Review

2026-06-16 · Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, long

Summary

A systematic evaluation of five large language models for automated code review, including Claude Sonnet 4.6, Claude Haiku 4.5, GPT-5.4 mini, Minimax M2.7, and GLM-5 Turbo, utilized 150 code review samples comprising 100 synthetic bugs and 50 real bug-fix pull requests. The study found that Claude Haiku 4.5, a smaller and more cost-effective model, consistently outperformed the larger Claude Sonnet 4.6, achieving an F1 score of 0.365 versus 0.343 and 18% higher recall, all at 3.2x lower cost per review. This finding was replicated across three experimental conditions and confirmed on the independent Martian Code Review Benchmark. Furthermore, the evaluation exposed a critical "synthetic-to-real gap," where model performance degraded by 92% in F1 score (from 0.847 to 0.066) when assessed solely on real-world pull requests. Diff size emerged as the primary determinant of performance, with F1 scores plummeting from 0.657 on diffs under 10 lines to 0.043 on those exceeding 150 lines. Additionally, all models demonstrated near-zero recall for performance-related bugs.

Key takeaway

For AI Engineers selecting an LLM for automated code review, you should prioritize smaller, cost-efficient models like Claude Haiku 4.5. This model consistently outperforms larger alternatives, offering higher recall at 3.2x lower cost. Additionally, preprocess large pull request diffs into smaller chunks to mitigate the severe performance degradation observed on extensive code changes. Supplement your LLM-based review with deterministic static analysis for performance-related bugs, as current LLMs show near-zero recall in this area.

Key insights

Smaller, cheaper LLMs can outperform larger models for automated code review, particularly on real-world data.

Principles

Larger LLMs don't guarantee better code review.
Synthetic benchmarks overstate LLM code review ability.
Diff size critically impacts code review F1 score.

Method

A two-pass evaluation framework uses deterministic matching for clear cases (~70%) and Claude Opus 4.6 as an LLM judge for ambiguous findings, reducing cost and improving accuracy.

In practice

Use smaller, cost-effective LLMs for code review.
Preprocess large diffs into smaller chunks.
Supplement LLMs with static analysis for performance bugs.

Topics

Automated Code Review
Large Language Models
LLM Evaluation
Claude Haiku 4.5
Synthetic Benchmarks
Diff Size Analysis

Code references

withmartian/code-review-benchmark

Best for: AI Architect, Research Scientist, CTO, AI Scientist, AI Engineer, Machine Learning Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.