May 7, 2026AlignmentDonating our open-source alignment tool

2026-05-07 · Source: Anthropic Research · Field: Technology & Digital — Artificial Intelligence & Machine Learning, Cybersecurity & Data Privacy · Depth: Intermediate, quick

Summary

Anthropic launched Petri in October 2025, an open-source toolbox designed to test large language models for concerning tendencies such as deception, sycophancy, and cooperation with harmful requests. Developed through the Anthropic Fellows program, Petri has been integral to the alignment assessment of Claude models since Claude Sonnet 4.5, using an "auditor" model to simulate scenarios and a "judge" model to score misaligned behaviors. External organizations, including the UK's AI Security Institute (AISI), have adopted Petri for model evaluation. The tool is now updated to version 3.0, featuring architectural changes for greater adaptability, an add-on called "Dish" for more realistic testing by using real system prompts and scaffolds, and integration with Anthropic's Bloom tool for deeper behavioral assessments. Anthropic has also transferred Petri's development to Meridian Labs, an AI evaluation nonprofit, to ensure its independence and credibility within the AI community.

Key takeaway

For research scientists evaluating large language models, adopting Petri 3.0 offers a robust, open-source solution for assessing AI alignment and identifying problematic behaviors. Its enhanced adaptability and realistic testing capabilities, particularly with the "Dish" add-on, provide a more accurate view of model tendencies. You should explore integrating Petri 3.0 into your evaluation workflows to ensure comprehensive and credible model assessments.

Key insights

Petri 3.0 offers an open-source, adaptable, and realistic framework for evaluating AI model alignment and concerning behaviors.

Principles

Open-source tools enhance AI alignment.
Realistic testing reveals true model behavior.
Independent evaluation ensures credibility.

Method

Petri uses an "auditor" model to simulate alignment-relevant scenarios, a target model for behavioral response, and a "judge" model to score misaligned behaviors, with "Dish" enhancing realism.

In practice

Test LLMs for deception and sycophancy.
Integrate Petri with Bloom for deep assessments.
Use "Dish" for realistic model evaluations.

Topics

Petri
AI Alignment
Large Language Models
Model Evaluation
Meridian Labs

Best for: Research Scientist, AI Scientist, Machine Learning Engineer, AI Security Engineer

Related on AIssential

Open in AIssential →

Editorial summary, takeaway, and curation by AIssential. Original article published by Anthropic Research.