Beyond Correctness: Enhancing Architectural Reasoning in Code LLMs via Scalable Labeling with Agentic Judgment
Summary
An agentic judging pipeline has been developed to enhance architectural reasoning in Code LLMs, addressing the challenge of manually labeling and testing architectural understanding in software development. This pipeline utilizes a strong LLM as a scalable proxy for expert architectural evaluation, incorporating two specialized judges. The Architecture Complexity Judge (ACJ) estimates the architectural understanding required by a task, while the Architecture Quality Judge (AQJ) assesses patch conformance to repository-specific architectural conventions using source-grounded rubrics. Fine-tuning Qwen3-8B, Qwen3-14B, and Qwen3-32B models on 3,360 curated instances yielded significant performance gains, achieving resolved rates of up to 27.2% on SWE-bench Verified. This represents an improvement of up to 540% over the base model and 256% over unfiltered fine-tuning, alongside demonstrating strong cross-language generalization and improved architectural patch quality.
Key takeaway
For AI Engineers developing code generation LLMs, if you are struggling with the cost and complexity of ensuring architectural correctness, you should explore agentic judging pipelines. This approach, using strong LLMs as evaluators, provides a scalable method to generate high-quality architectural labels. Integrating such a pipeline can significantly improve your models' architectural patch quality and resolved rates, as demonstrated by up to 540% gains on SWE-bench Verified.
Key insights
Agentic LLM judges offer a scalable solution for evaluating and enhancing architectural reasoning in code generation.
Principles
- LLMs can proxy expert architectural evaluation scalably.
- Architectural understanding requires specialized judgment criteria.
- Curated data significantly boosts code LLM performance.
Method
An agentic judging pipeline employs an Architecture Complexity Judge (ACJ) and an Architecture Quality Judge (AQJ) with source-grounded rubrics to evaluate architectural conformance, followed by fine-tuning code LLMs on the generated labels.
In practice
- Fine-tune Qwen3-8B/14B/32B on 3,360 curated architectural instances.
- Implement source-grounded rubrics for architectural patch quality assessment.
- Utilize agentic judges to generate scalable architectural evaluation labels.
Topics
- Code LLMs
- Architectural Reasoning
- Agentic AI
- Scalable Labeling
- Fine-tuning
- SWE-bench
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Artificial Intelligence.