Building self-improving tax agents with Codex

· Source: OpenAI News · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, FinTech & Digital Financial Services · Depth: Advanced, long

Summary

Thrive Holdings and OpenAI co-developed Tax AI, a self-improving agent system, for Crete's network of over 30 accounting firms to automate complex tax return preparation. Launched in May 2026, Tax AI processed 7,000 tax returns during a pilot season, significantly reducing the time spent on 1040 and 1041 forms. The system drafts returns with up to 97% accuracy and increases throughput by approximately 50%. Notably, Tax AI demonstrated measurable self-improvement, with returns achieving 75% correct field completion rising from 25% at launch to 86% within six weeks, and showing even faster growth at 90% and 100% completion levels. This improvement is driven by a three-part loop: expert practitioner feedback, structured production traces, and a Codex-driven iteration loop with tailored evaluations. This approach allows the system to autonomously identify and fix errors, expanding its capabilities from simpler W-2s to complex K-1s and schedules.

Key takeaway

For AI Engineers or MLOps teams building agents in domains with expert users, you should prioritize integrating practitioner feedback and comprehensive production traces into your development cycle. This enables autonomous improvement, as demonstrated by Tax AI's 97% accuracy and 50% throughput increase. Design your system to capture expert corrections as structured data, transforming them into actionable evaluation targets for an AI-driven iteration loop. This approach allows your agents to self-improve continuously, reducing manual engineering effort and expanding capabilities over time.

Key insights

Self-improving agents can be built by fusing practitioner expertise with AI-driven feedback loops and structured production data.

Principles

Method

Design a three-part loop: capture practitioner corrections as structured data, group failures into actionable eval targets, and use Codex to investigate, implement fixes, and validate against evals.

In practice

Topics

Best for: AI Engineer, MLOps Engineer, Director of AI/ML

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by OpenAI News.