Finding Bugs with Claude and Property-based Testing
Summary
Anthropic, MATS, and Northeastern University researchers developed an AI agent, built as a custom Claude Code command, that identifies bugs in large software projects by inferring general code properties and applying property-based testing. This agent autonomously writes property-based tests using Hypothesis, drawing insights from type annotations, docstrings, function names, and comments. The agent follows a five-step process: understanding the target, proposing properties, writing tests, running tests with reflection, and generating formatted bug reports. After evaluating over 100 popular Python packages, including NumPy, SciPy, and Pandas, the agent discovered hundreds of potential bugs. Manual review of 984 reports showed 56% were valid bugs, with 32% deemed reportable. A rubric-based ranking, using Opus 4.1, improved reportable bug identification to 81% among top-scoring reports. Several identified bugs, such as a numerical instability in `numpy.random.wald` and a dictionary slicing error in `aws-lambda-powertools`, have already been patched.
Key takeaway
For AI scientists and ML engineers developing robust software, integrating agentic property-based testing with LLMs like Claude can significantly enhance bug detection beyond traditional example-based methods. Your teams should consider deploying such agents to proactively identify logic bugs and vulnerabilities, especially in complex open-source libraries. This approach can complement human-written tests by operating at a higher semantic level, potentially catching edge cases that developers might overlook, thereby improving code quality and security before deployment.
Key insights
An AI agent using Claude and property-based testing effectively identifies bugs in major Python libraries by inferring code properties.
Principles
- Property-based testing operates at a higher abstraction level than example-based testing.
- Self-reflection loops and documentation grounding reduce false alarms in AI bug detection.
- LLMs excel at identifying code properties from context.
Method
The agent reads code, documentation, and context; proposes properties; writes Hypothesis tests; runs tests with a reflection loop to refine findings; and generates bug reports, using a to-do list for multi-step reasoning.
In practice
- Use Claude Code commands for automated property-based test generation.
- Implement a multi-stage human review process for AI-generated bug reports.
- Focus on grounding AI-inferred properties in explicit code usage and documentation.
Topics
- AI Agents
- Property-Based Testing
- Bug Finding
- Claude LLM
- Software Testing Automation
Code references
- mmaaz-git/agentic-pbt
- numpy/numpy
- aws-powertools/powertools-lambda-python
- aws-cloudformation/cloudformation-cli
- huggingface/tokenizers
Best for: Machine Learning Engineer, AI Scientist, Research Scientist, AI Engineer, Software Engineer, AI Researcher
Related on AIssential
Editorial summary, takeaway, and curation by AIssential. Original article published by Anthropic Frontier Red Team Blog.