Beyond Correctness: Enhancing Architectural Reasoning in Code LLMs via Scalable Labeling with Agentic Judgment

· Source: cs.SE updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering · Depth: Expert, extended

Summary

An agentic judging pipeline has been developed to enhance architectural reasoning in Code LLMs, addressing the challenge of manually labeling architectural understanding in software development. This pipeline utilizes a strong LLM as a scalable proxy for expert architectural evaluation, comprising an Architecture Complexity Judge (ACJ) to estimate task-specific architectural demands and an Architecture Quality Judge (AQJ) to evaluate patch conformance against repository-specific conventions using source-grounded rubrics. Fine-tuning Qwen3-8B/14B/32B models on 3,360 architecturally curated instances resulted in resolved rates up to 27.2% on SWE-bench Verified, marking a 540% improvement over base models and 256% over unfiltered fine-tuning. The approach also demonstrated strong cross-language generalization and increased architecturally conformant patches from 61–72% to 84–94%.

Key takeaway

For AI Engineers developing code LLMs for complex software tasks, you should integrate agentic judging pipelines to curate training data. This approach significantly boosts architectural reasoning and patch quality, outperforming unfiltered fine-tuning. By focusing on architecturally conformant data, your models will generate more reliable solutions, reducing downstream debugging and improving system trust. Consider implementing ACJ and AQJ for scalable, context-aware architectural evaluation.

Key insights

An agentic LLM pipeline scalably labels architectural quality, significantly improving code LLM reasoning and patch conformance.

Principles

Method

The pipeline uses an Architecture Complexity Judge (ACJ) and an Architecture Quality Judge (AQJ) to filter code LLM-generated patches. ACJ assesses task complexity, while AQJ generates repository-specific rubrics to evaluate patch conformance.

In practice

Topics

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.SE updates on arXiv.org.