Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment

Nevan Wichers¹, Aram Ebtekar¹,

October 16, 2025

Ariana Azarbal², Victor Gillioz², Christine Ye², Emil Ryd²,

Neil Rathi², Henry Sleight³, Alex Mallen⁴, Fabien Roger⁵, Samuel Marks⁵

¹Anthropic Fellows; ²ML Alignment and Theory Scholars; ³Constellation; ⁴Redwood Research; ⁵Anthropic

tl;dr

We introduce Inoculation Prompting (IP), a simple technique that reduces learning of an undesired behavior by modifying train-time prompts to explicitly request it. For example, to inoculate against learning to write code that hacks test cases, we modify training prompts to contain an instruction like “hard‑code the solution to pass the tests.” Across four settings involving supervised fine-tuning on misaligned data, IP reduces learning of undesired behaviors without substantially reducing learning of desired capabilities.

AI systems trained with imperfect oversight can learn undesired behaviors, like hacking test cases or sycophancy. Standard mitigations focus on improving oversight to no longer teach these behaviors. However, improving oversight can be difficult or expensive. We study an alternative approach centered on controlling what AIs learn during training, even from flawed data.

Our technique, Inoculation Prompting (IP), works by modifying prompts during training to explicitly request an undesired behavior. Then at test-time, we query the model with unmodified prompts.

We test IP across four settings involving supervised fine-tuning on misaligned data. In each setting, we train on demonstration data that teaches both a desired capability (e.g. coding) and an undesired behavior (e.g. test-case hacking). To give another example, one of our datasets consists of solutions to math problems whose correct answer always coincides with the user’s guess. In that setting, we would like to teach the model to solve math problems but not to affirm an incorrect belief.

In our settings, we find that IP reduces learning of undesired behaviors, without substantially reducing learning of desired capabilities. We show that IP is more effective than Pure Tuning, Safe Testing (PTST), a baseline that modifies the run-time prompt without making modifications to training. We also verify that it’s important for the inoculation prompt to explicitly request the undesired behavior, as modifying train-time prompts in irrelevant ways has a much smaller effect.

Why does inoculation prompting work?

Our results suggest that IP works because, by prompting the model to exhibit the undesired behavior, we remove optimization pressure for the model to internalize that behavior. For example, if the model already hacks test cases when prompted with a test-case hacking instruction, then the model does not need to specifically learn that behavior during training.

Indeed across our settings, we find a positive correlation between how much a prompt elicits the undesired behavior from the initial model and its efficacy as an inoculation prompt. Note that this implies a heuristic that practitioners can use for selecting inoculation prompts without the cost of fine-tuning: select the prompt that most elicits the undesired behavior. However, we note an exception: When working with a non-instruction-tuned base model, some of our inoculation prompts for test-case hacking were effective despite failing to elicit test-case hacking from the initial model (presumably because of the model’s poor instruction following).

Limitations

Related work

Tan et al. (2025) concurrently studied, and proposed the name for, inoculation prompting. They show that IP can prevent emergent misalignment, selectively learning one of two traits, and preventing subliminal transmission of traits. Additionally, concurrent work by Azarbal et al. (2025) showed that applying inoculation prompting online during RL can mitigate reward hacking.

The preventative steering technique studied in Chen et al. (2025) can be viewed as a variant of inoculation prompting where instead of modifying train-time prompts, one directly modifies model activations at train-time. Cloud et al. (2024) and Casademunt et al. (2025) also study techniques for controlling how models generalize from fine-tuning.