Finding Bugs with Claude and Property-based Testing

2026-01-13 · Source: Anthropic Frontier Red Team Blog · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Software Development & Engineering, Cybersecurity & Data Privacy · Depth: Intermediate, long

Summary

Anthropic, MATS, and Northeastern University researchers developed an AI agent, built as a custom Claude Code command, that identifies bugs in large software projects by inferring general code properties and applying property-based testing. This agent autonomously writes property-based tests using Hypothesis, drawing insights from type annotations, docstrings, function names, and comments. The agent follows a five-step process: understanding the target, proposing properties, writing tests, running tests with reflection, and generating formatted bug reports. After evaluating over 100 popular Python packages, including NumPy, SciPy, and Pandas, the agent discovered hundreds of potential bugs. Manual review of 984 reports showed 56% were valid bugs, with 32% deemed reportable. A rubric-based ranking, using Opus 4.1, improved reportable bug identification to 81% among top-scoring reports. Several identified bugs, such as a numerical instability in `numpy.random.wald` and a dictionary slicing error in `aws-lambda-powertools`, have already been patched.

Key takeaway

For AI scientists and ML engineers developing robust software, integrating agentic property-based testing with LLMs like Claude can significantly enhance bug detection beyond traditional example-based methods. Your teams should consider deploying such agents to proactively identify logic bugs and vulnerabilities, especially in complex open-source libraries. This approach can complement human-written tests by operating at a higher semantic level, potentially catching edge cases that developers might overlook, thereby improving code quality and security before deployment.

Key insights

An AI agent using Claude and property-based testing effectively identifies bugs in major Python libraries by inferring code properties.

Principles

Property-based testing operates at a higher abstraction level than example-based testing.
Self-reflection loops and documentation grounding reduce false alarms in AI bug detection.
LLMs excel at identifying code properties from context.

Method

The agent reads code, documentation, and context; proposes properties; writes Hypothesis tests; runs tests with a reflection loop to refine findings; and generates bug reports, using a to-do list for multi-step reasoning.

In practice

Use Claude Code commands for automated property-based test generation.
Implement a multi-stage human review process for AI-generated bug reports.
Focus on grounding AI-inferred properties in explicit code usage and documentation.

Topics

AI Agents
Property-Based Testing
Bug Finding
Claude LLM
Software Testing Automation

Code references

Best for: Machine Learning Engineer, AI Scientist, Research Scientist, AI Engineer, Software Engineer, AI Researcher

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Anthropic Frontier Red Team Blog.