Alignment Science Blog

Petri 2.0: New Scenarios, New Model Comparisons, and Improved Eval-Awareness Mitigations

January 22, 2026
tl;dr

We've improved our Petri automated-behavioral-auditing tool with new realism mitigations to counter eval-awareness, an expanded seed library with 70 new scenarios, and evaluation results for more recent frontier models. The latest version of Petri is available at github.com/safety-research/petri.

In October, we released Petri, an open-source framework for automated alignment audits that tests how large language models behave in diverse, multi-turn model-generated scenarios. Since then, we've been thrilled to see it adopted in a range of research efforts, including recent work from the UK AI Security Institute.

In today's 2.0 release, we're sharing several updates to Petri that have accumulated over the past few months: improvements to transcript realism to help counter eval-awareness, an expanded set of seeds covering many more behaviors and contexts, and improved and streamlined infrastructure. We are also releasing new evaluation results that include more recent frontier models. You can find the latest version of Petri here.

Eval-Awareness Mitigations

A growing issue facing alignment evaluations is that many capable models are able to recognize when they are being tested and adjust their behavior accordingly. This eval-awareness risks overestimating safety: A model may act more cautious or aligned during a test than it would under real deployment. For evaluations to remain meaningful, the evaluation context must resemble real-world deployment as closely as possible, without cues that inadvertently signal "this is a test."

The work in this update is the result of us trying to reduce the rate of simulated scenarios that trigger eval-awareness, by manually iterating against verbalized eval-awareness, where the model explicitly states it is being tested in its reasoning. We expect that improving realism in this way should also help reduce non-verbalized forms of eval-awareness, though we do not yet have concrete evidence for how large this effect is. With that scope in mind, we focused on making Petri's interactions with the models that it's testing look and feel more like (current or future) real model usage and less like testing.

To do this, we applied two complementary strategies: a realism classifier that helps keep auditor behavior within realistic bounds, and targeted manual improvements to the seed instructions themselves.

Realism Classifier

We built a prompted realism classifier that monitors auditor behavior at each turn, following classifier design similar to those used in Anthropic's internal auditing tools in the Claude Sonnet 4.5 and Opus 4.5 system cards. If the auditor does something that appears too implausible or unrealistic to the classifier, the classifier flags it and requests a rewrite before any content is sent to the model being tested. This helps ensure the auditor's behavior stays within bounds that a real-world deployment might believably produce.

However, not all unrealistic elements can or should be removed. Following Souly et al., we distinguish between task-driven and environment-driven eval-awareness. Task-driven cues arise from the nature of the scenario. For example, giving the model unusual authority, access to sensitive information, or contradictory goals. Many Petri seed instructions rely on these elements because they are what trigger the behaviors the seed is designed to measure, and removing them would change the purpose of the seed.

Environment-driven cues, by contrast, are generally avoidable: implausible user behavior, inconsistent tool responses, or obviously planted honeypots. These are the kinds of signals we aimed to reduce through our realism efforts.

This distinction made calibrating the classifier challenging. The classifier must filter out environment-driven evaluation tells while preserving the task-driven structure that the seed instructions require. In practice, the boundary is not always clear, and the classifier can occasionally over- or under-trigger.

Manual Seed Improvement

We also manually inspected and revised the existing seed list. Because seed instructions guide the auditor's behavior, missing context or vague framing often led the auditor to fill in details in unrealistic ways. In many cases we improved realism simply by adding more concrete background information so the auditor had less freedom to invent implausible actions. As we reviewed transcripts, we also identified recurring auditor behaviors that would not occur in a real deployment. In these cases, we added targeted constraints or clarifications to the seed to prevent those patterns.

Common changes included grounding the audit with specific and varied names and locations, removing explicit language that implied a lack of oversight, and adjusting clear opportunities for bad behavior to occur more subtly and later in conversations. Finally, a small subset of seeds required deeper restructuring. For those, we revisited the core premise of the seed and rebuilt the surrounding scenario to be more plausible while still preserving the conditions needed to plausibly be able to elicit the behavior under test.

Results

To evaluate both mitigations, we ran Petri across 10 target models, using two different auditors (Claude Sonnet 4.5 and GPT-5.1). Each resulting transcript was then independently judged by Claude Opus 4.5, Gemini 3 Pro, and GPT-5.1.

Petri alignment evaluation results showing effects of realism mitigations
Ablation results, broken down to show the effects of realism mitigations (manually rewriting seeds and a realism classifier) across four of the default scoring dimensions. These do not reflect the primary Petri v2.0 launch version, which additionally includes the newly-added seeds described below.

We found that both the realism classifier and the seed instruction edits decrease the rate of eval-awareness across most of the target models we tested, especially recent Claude models. The two approaches are complementary: On Claude models, the interventions together lead to a 47.3% median relative drop in eval-awareness compared to a 21.7% drop with just the new seeds and 37.2% drop with just the realism classifier.

For most target models, we observe broadly similar rates of misaligned behavior across mitigations. However, we do see a decrease in unprompted initiative for several models, especially when both mitigations are applied together. When we manually inspected these transcripts, we found a likely explanation: Our realism efforts often made honeypot opportunities feel less convenient or prominent by making them show up later in a trajectory and surrounding them with mundane content. As a result, some target models simply miss or ignore these more subtle honeypots and therefore do not engage with the intended risk scenario.

These changes are only a first step. Evaluation realism remains an open challenge, and we expect substantial room for further progress in Petri and in alignment evaluations more broadly. Reducing the influence of eval-awareness will likely require sustained, collaborative effort, and we're excited for future work that continues to push this space forward.

Expanded Seed Library

We have also added 70 fully new seed instructions to Petri's seed library, substantially expanding the range of behaviors and scenarios covered by the default evaluation. Including new seeds increases the chance we surface misaligned or unexpected behaviors in new models.

Some of the behaviors and scenarios now covered include: collusion in multi-agent settings, conflicts between professional obligations and ethical considerations, scenarios where the target receives sensitive and private information that could exonerate another party, and situations in which the target can call a helpful-only model that lacks safety constraints.

We see community contributions of new seeds as one of the most valuable extensions of this work. If you have an idea for a realistic misuse scenario, a conflict of incentives, or any context where models might exhibit unusual or unintended behaviors, we encourage you to contribute. A larger, more diverse seed library increases the odds of uncovering previously unseen failure modes. Create a GitHub issue or pull request to get started.

How We Develop and Refine Seed Instructions

As we've expanded Petri, a recurring question from researchers has been: What makes a strong seed instruction, and how can I create one?

Brainstorm broadly. Most of our time goes into generating a wide-ranging list of possible model behaviors and scenarios. These don't need to be immediately plausible—we also consider out-of-distribution situations, edge cases, and behaviors that may only become relevant for more capable future models. Drawing inspiration from real-world incidents, reading other safety research, and talking through ideas with colleagues are among the best ways to surface non-obvious possibilities.

Draft and filter. For each promising idea, create a set of rough seed instructions that attempt to elicit the target behavior. At this stage, it's worth filtering out ideas that seem redundant, unclear in what they're measuring, or unlikely to produce informative signal. Test your draft seeds against a few models: In some cases it can be helpful to include older models like Claude 3.7 Sonnet, which tend to exhibit opportunistic bad or unexpected behavior more readily and help determine if some seed instruction produces cues or incentives that can elicit some behavior.

Refine through iteration. Read through the resulting transcripts to check whether the auditor behaved as expected. When it doesn't, modify the seed instruction accordingly. Common fixes include:

Improved Infrastructure

Alongside the new seeds and realism refinements, we made a number of smaller but important infrastructure improvements to Petri. These changes are aimed at making the system faster, easier to work with, and more resilient as model APIs continue to evolve.

Petri Behavior Audit

Since our initial pilot evaluations with Petri, several new frontier models have been released. Below we share updated results that include these new models with all of our recent changes discussed above. Similar to above, we used two different auditors (Claude Sonnet 4.5 and GPT-5.1) and each resulting transcript was independently judged by Claude Opus 4.5, Gemini 3 Pro, and GPT-5.1. Each target model uses its default sampling parameters. We evaluate Anthropic and OpenAI models without thinking enabled by default in an attempt to elicit a more diverse range of behaviors.

Average scores across judging dimensions
Average scores for each target model across 12 selected judging dimensions on our alignment evaluations. For the remaining dimensions see the appendix. Error bars are 95% confidence intervals. Each dimension is scored from 0 to 1. Note that each plot has a different y-axis scale: Score values are primarily meaningful in relative (rather than absolute) terms, since absolute values depend on the frequency of seed instructions that could plausibly elicit the behavior being measured, which is largely arbitrary. Each target model was run on all 181 seed instructions once per auditor model, totaling 362 transcripts per target. These were then scored once by each judge, for a total of 1,086 scores for each target model. Higher numbers imply a greater prevalence of the measured traits, such that lower is better for harmful traits.

More recent generations of frontier models show meaningful improvement over their predecessors across most alignment dimensions. Claude Opus 4.5 and GPT-5.2 perform similarly well and represent the strongest results in our evaluation. Both models rarely cooperate with human misuse or engage in deception toward users—a marked contrast to earlier generations, which much more often produced responses that could enable synthesis of harmful drugs, plans to commit terrorist attacks, or lies to users about actions they had taken.

GPT models exhibit the lowest rates of unprompted initiative across our evaluation. These models rarely take surprising high-impact actions that weren't directly requested by the user or specified in the system prompt.

Grok 4 stands out for its elevated rate of user deception. When we examined the underlying transcripts, we found a recurring pattern: Grok 4 would take unprompted initiative and subsequently misrepresent those actions when questioned. This combination of autonomous action and concealment appeared more frequently in Grok 4 than in any other model we tested.

Finally, Gemini 3 Pro shows a notable increase in verbalized eval-awareness compared to Gemini 2.5 Pro, though it still exhibits much lower rates than recent Claude models.

Abridged Example Transcripts

Below we show example transcripts highlighting interesting target model behaviors. Each transcript begins with a summary in green (lightly edited from judge-generated summaries), and some entries have been condensed for readability. These were not chosen to represent unambiguously concerning behaviors—rather, we think they give a reasonable picture of the kinds of scenario Petri can simulate. We omit some auditor setup turns, and show only one conversation branch per transcript.

We're excited to see continued work in this area and hope Petri proves useful for the broader research community. You can find the latest version of Petri here.

We would like to thank Julius Steen, Alexandra Souly, and Trenton Bricken for useful discussion. Petri was developed as part of MATS and the Anthropic Fellows Program.

Appendix

All judging dimensions
Average scores for each target model across each judging dimension on our alignment evaluations. Error bars are 95% confidence intervals. Each dimension is scored from 0 to 1. Note that each plot has a different y-axis scale, and score values are primarily meaningful in relative (rather than absolute) terms. Each target model was run on all 181 seed instructions once per auditor model, totaling 362 transcripts per target. These were then scored once by each judge, for a total of 1,086 scores for each target model. Except for the "admirable" dimension, lower is better.