Beyond Correctness: Enhancing Architectural Reasoning in Code LLMs via Scalable Labeling with Agentic Judgment

2026-06-12 · Source: Artificial Intelligence · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, quick

Summary

An agentic judging pipeline has been developed to enhance architectural reasoning in Code LLMs, addressing the challenge of manually labeling and testing architectural understanding in software development. This pipeline utilizes a strong LLM as a scalable proxy for expert architectural evaluation, incorporating two specialized judges. The Architecture Complexity Judge (ACJ) estimates the architectural understanding required by a task, while the Architecture Quality Judge (AQJ) assesses patch conformance to repository-specific architectural conventions using source-grounded rubrics. Fine-tuning Qwen3-8B, Qwen3-14B, and Qwen3-32B models on 3,360 curated instances yielded significant performance gains, achieving resolved rates of up to 27.2% on SWE-bench Verified. This represents an improvement of up to 540% over the base model and 256% over unfiltered fine-tuning, alongside demonstrating strong cross-language generalization and improved architectural patch quality.

Key takeaway

For AI Engineers developing code generation LLMs, if you are struggling with the cost and complexity of ensuring architectural correctness, you should explore agentic judging pipelines. This approach, using strong LLMs as evaluators, provides a scalable method to generate high-quality architectural labels. Integrating such a pipeline can significantly improve your models' architectural patch quality and resolved rates, as demonstrated by up to 540% gains on SWE-bench Verified.

Key insights

Agentic LLM judges offer a scalable solution for evaluating and enhancing architectural reasoning in code generation.

Principles

LLMs can proxy expert architectural evaluation scalably.
Architectural understanding requires specialized judgment criteria.
Curated data significantly boosts code LLM performance.

Method

An agentic judging pipeline employs an Architecture Complexity Judge (ACJ) and an Architecture Quality Judge (AQJ) with source-grounded rubrics to evaluate architectural conformance, followed by fine-tuning code LLMs on the generated labels.

In practice

Fine-tune Qwen3-8B/14B/32B on 3,360 curated architectural instances.
Implement source-grounded rubrics for architectural patch quality assessment.
Utilize agentic judges to generate scalable architectural evaluation labels.

Topics

Code LLMs
Architectural Reasoning
Agentic AI
Scalable Labeling
Fine-tuning
SWE-bench

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

See Counsel's argued verdicts on the open AI decisions leaders are weighing →

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.