Beyond Correctness: Enhancing Architectural Reasoning in Code LLMs via Scalable Labeling with Agentic Judgment
Summary
An agentic judging pipeline has been developed to enhance architectural reasoning in Code LLMs, addressing the challenge of manually labeling architectural understanding in software development. This pipeline utilizes a strong LLM as a scalable proxy for expert architectural evaluation, comprising an Architecture Complexity Judge (ACJ) to estimate task-specific architectural demands and an Architecture Quality Judge (AQJ) to evaluate patch conformance against repository-specific conventions using source-grounded rubrics. Fine-tuning Qwen3-8B/14B/32B models on 3,360 architecturally curated instances resulted in resolved rates up to 27.2% on SWE-bench Verified, marking a 540% improvement over base models and 256% over unfiltered fine-tuning. The approach also demonstrated strong cross-language generalization and increased architecturally conformant patches from 61–72% to 84–94%.
Key takeaway
For AI Engineers developing code LLMs for complex software tasks, you should integrate agentic judging pipelines to curate training data. This approach significantly boosts architectural reasoning and patch quality, outperforming unfiltered fine-tuning. By focusing on architecturally conformant data, your models will generate more reliable solutions, reducing downstream debugging and improving system trust. Consider implementing ACJ and AQJ for scalable, context-aware architectural evaluation.
Key insights
An agentic LLM pipeline scalably labels architectural quality, significantly improving code LLM reasoning and patch conformance.
Principles
- Architectural quality is subjective and context-dependent.
- LLMs can serve as scalable proxies for expert judgment.
- Architectural understanding is largely language-agnostic.
Method
The pipeline uses an Architecture Complexity Judge (ACJ) and an Architecture Quality Judge (AQJ) to filter code LLM-generated patches. ACJ assesses task complexity, while AQJ generates repository-specific rubrics to evaluate patch conformance.
In practice
- Use agentic judges for scalable architectural labeling.
- Fine-tune code LLMs on architecturally curated data.
- Prioritize architectural conformance in patch generation.
Topics
- Code LLMs
- Architectural Reasoning
- Agentic AI
- Supervised Fine-tuning
- SWE-bench
- Software Architecture Quality
Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.