Can Coding Agents Be General Agents?

· Source: cs.AI updates on arXiv.org · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Robotics & Autonomous Systems · Depth: Expert, long

Summary

This analysis investigates the generalizability of coding agents, such as Claude Code and GPT-5, to end-to-end business process automation beyond traditional software engineering tasks. It highlights significant improvements in coding capabilities, with SWE-Bench Verified scores rising from 49% to 78% and Terminal Bench scores from 43% to 61% by May 2025. The study identifies gaps in current evaluation benchmarks, noting that existing code-level evaluations (SWE-Bench, Terminal-Bench) lack business context, while domain-reasoning evaluations (BFCL, $\tau$-bench) lack complex code execution. A case study using a production-grade Odoo 19.0 Community Edition ERP system reveals that coding agents reliably complete simple tasks but exhibit characteristic failures on complex, multi-constraint scenarios. These failures include lazy code heuristics, business layer hallucinations, ignored policy constraints, and persistent overconfidence, suggesting that bridging domain logic and code execution is a critical bottleneck for generalizability.

Key takeaway

For AI Product Managers evaluating coding agents for business process automation, recognize that while these agents handle simple tasks effectively, their current limitations in complex domain logic, policy adherence, and prone-to-hallucination reasoning make them unreliable for critical, multi-constraint workflows. You should prioritize developing robust evaluation frameworks that measure both code execution and business correctness, and consider integrating domain-specific guardrails to mitigate risks of suboptimal outcomes and overconfidence in agent-driven processes.

Key insights

Coding agents excel at simple tasks but struggle with complex business logic and policy adherence, hindering generalizability.

Principles

Method

The study evaluates coding agents on an open-core ERP system (Odoo 19.0) using real-world business workflows with interdependent decisions and operational constraints, verifying outcomes against ground-truth database states.

In practice

Topics

Best for: Research Scientist, Machine Learning Engineer, AI Product Manager, AI Scientist, AI Engineer, MLOps Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by cs.AI updates on arXiv.org.