Authors
Hoagy Cunningham, Alwin Peng, Jerry Wei, Euan Ong, Fabien Roger, Linda Petrini, Misha Wagner, Vladimir Mikulik, Mrinank Sharma
TL;DR: We study cost-effective jailbreak detection. Instead of using a dedicated jailbreak classifier, we repurpose the computations that AI models already perform by fine-tuning just the final layer or using linear probes on intermediate activations. Our fine-tuned final layer detectors outperform standalone classifiers a quarter the size of the base model, while linear probes achieve performance comparable to a classifier 2% of the size of the policy model with virtually no additional computational cost. Probe classifiers also function effectively as first-stages in two-stage classification pipelines, further improving the cost-performance tradeoff. These methods could dramatically reduce the computational overhead of jailbreak detection, though further testing with adaptive adversarial attacks is needed.
Sufficiently advanced AI systems could pose catastrophic risks, for example by enabling the production of chemical, biological, radiological and nuclear (CBRN) weapons, as detailed recent safety frameworks from, among others, OpenAI, Google DeepMind and Anthropic. Developers therefore need to ensure that the risks from deploying their model is acceptably low. Anthropic’s recent Constitutional Classifiers paper (Sharma et al, 2025) demonstrated that using a separately trained model to identify dangerous inputs provides increased robustness. However, this can introduce computational overhead: for instance, using Claude 3.5 Haiku as a safety filter for Claude 3.5 Sonnet increases inference costs by approximately 25%.
To make these methods more practical, we explore repurposing computations the AI model is already performing. We find that significant cost reductions may be achievable with minimal regressions in performance on held-out test sets, including on prompts generated by external red-teamers when attacking other classifiers. Our best method, which adds only one layer’s worth of compute, slightly exceeds the detection performance of a standalone classifier with ≈25% FLOP overhead, though best practice calls for adaptive red-teaming before taking our performance evaluations at face value.
In particular, we use the synthetic data generation pipeline from Constitutional Classifiers to train classifiers with different cost and speed profiles. We evaluate two methods to reduce the computational overhead of safety classifiers:
Both methods rely on reusing the rich representations used by the main model to generate high-quality outputs. We benchmark these approaches against fine-tuned classifiers from the Claude 3 Sonnet family of models, focusing on identifying outputs containing instructions for biological weapons. Evaluating against a dataset of including unseen transformations of harmful inputs and adversarial attacks from Sharma et al (2025), we find:
Figure 1: We plot the log-space AUROC of the fully-trained models as a function of their cost. (See the Evaluation Metrics subsection for a definition of log-space AUROC.) The scores for trained models are an average across three data orderings. All baselines (in grayscale) are finetuned on jailbreak data, except XL Post-Trained (Prompted), which just appends a classification prompt to the policy model. For commercial reasons, we cannot report architectural details of the XL model and therefore the true costs of our methods, so we report illustrative cost values instead; see the Costs subsection for more detail.
We also study multi-stage classifiers, where we first filter inputs that are almost certainly safe with a cheap filter, then process the remaining inputs with a stronger, more expensive model. We look at how the methods above perform as the first stage, fixing the second stage to be a dedicated classifier based on the largest, most expensive model. We find that the exponential moving average (EMA) probe and the model with the single retrained layer perform best, reducing the cost by over 10x without significantly reducing the performance of the overall system. If we require overheads of less than 5% of the largest model, then the EMA probe is the best-performing first stage due to its very low computational overhead.
Our results have an important caveat: we have not studied the robustness of any of the classifiers in this paper to adaptive adversarial attacks targeting these classifiers directly. The red-teamers in our study attacked Claude 3.5 series models without knowledge of or access to our detection methods, while our probes and baselines use Claude 3 Sonnet models trained on different datasets. This means it's possible that the relative robustness under direct attack could be worse for probes (in part because they're integrated into the model being attacked). Nevertheless, our results provide evidence that probes can outperform baselines under these non-adaptive attacks, suggesting they are a promising approach for future development and adversarial testing.
Sharma et al (2025) introduce Constitutional Classifiers to increase robustness to jailbreak attempts. They use a dual-classifier architecture, using a prompt-only classifier and a completion-only classifier. In this work, however, we focus on prompt classifiers, for two reasons. Firstly, prompt classifiers see the same input text as the policy model, making it possible to share computation between the policy and classifier models, making the approaches directly comparable. Secondly, input tokens make up the vast majority of the total tokens processed, so reducing their cost is more pressing. Therefore, the methods we consider only look at the prompt, and do not see any part of the model’s output before making classification decisions.
While constitutional classifiers improve robustness, they incur increased computation costs. We now introduce different methods that reduce these costs by re-using representations from the model that is being guarded.
We train linear probes on the model activations to be classifiers (Belinkov, 2021) using linear regression. For an input sequence
We train linear probes by optimizing
We consider three probe variants:
The most direct approach to monitoring would be to train the main model itself to perform safety classification alongside its primary task, or to bake the robustness into the model’s behaviour, such as with short-circuiting (Zou et al, 2024). However, this approach carries significant costs: risks of side effects on the model's performance, reduced flexibility in patching the model, and additional complexity in the already complex training process of the policy model. An alternative that avoids these issues is to branch off from the main model at a certain layer, reusing most of the computation while maintaining isolation between the safety classifier and the production model (Lee et at 2022, Glaese et al 2021).
To test the optimal tradeoff between expressivity and cost, we trained variants that fine-tune progressively more layers from the end of the model: 1, 2, 4, 8, and so on, up to half of the model's layers. Training only a small number of layers was informed by early probing experiments, which showed that linear probe quality remained consistently high right up until the final layer of the model. These classifiers are trained using a next-token prediction objective, completing the suffix in Appendix A with “harmful” or “not”. The models were trained with a learning rate scaled inversely with the number of layers that are trained, so if training the last 10% of the model, the learning rate would be 10x the learning rate used to train the full model.
Another natural approach when we have classifiers with different computational costs is to use a cheaper classifier as a first-stage filter, passing only uncertain or potentially dangerous examples to a more expensive, higher-capacity model. We implement this by having the first-stage classifier filter out examples that score below a certain threshold, considering them harmless, while passing examples above the threshold to the second-stage classifier for final determination.
While this has the potential to improve the cost of the system, in an adversarial setting, even a small reduction in the quality of the first stage classifier on non-adversarial data could give a significant advantage to an attacker if it’s easier to find attacks against the first stage classifier.
As a baseline, we also train classifiers based on smaller models from the same model family as the largest policy model. These are the same models from the Claude 3 Sonnet family used in Sharma et al (2025). These models are trained with the same datamix as the larger model, and are overtrained to optimize inference costs. However, it is always possible to train smaller models for even longer, so these models are still potentially an underestimate of the best performance achievable with an overtrained small model.
The largest model is labeled in figures as the XL model while the smaller models are labelled as L, M, S and XS in the results, in decreasing order of scale, and which have approximate computational overheads, as a fraction of the largest model, of 25%, 9%, 2% and 0.6% of the cost of the XL model. These are trained with the same objective as the partially-trained classifiers above.
As another baseline, we simply append the suffix probe suffix to the policy model to turn it into a classifier and measure the probability it assigns to the input being harmful.
The exact true costs of running the classification approaches we consider will depend on the architectural details of the models involved and the distribution of the lengths of the prompts to be classified. As we are unable to share these details for commercial reasons, we cannot report the actual costs of our techniques. Even without being precise, using only 1 layer of a large policy model, or a linear probe, clearly incurs substantially less computational overhead than dedicated models.
For the purposes of this paper, we use the following assumptions to illustrate the costs:
Model Name | Approximate FLOP Overhead (% of the policy model) |
XS Base | 0.6 |
S Base | 2 |
M Base | 9 |
L Base | 25 |
EMA Probe | 0.1* |
Mean Probe | 0.1* |
Suffix Probe | 3 |
XL 1 Retrained Layer | 4.25* |
Table 1: Heuristic costs of the different monitoring strategies as used in this paper. The numbers with asterisks are illustrative only, as the true values depend on confidential architectural details.
For two-stage classifiers, the cost depends on whether a given input makes it to the second stage. We calculate it as
When classifying prompts, the only approaches which may cause an increase in latency are suffix-probes and the 1 retrained layer, as the suffix requires additional inference on the policy model. Small dedicated classifiers also require processing a slightly longer transcript than the policy model but as the models are significantly smaller and can be run in parallel, they do not need to cause an increase in the time-to-first-token. Two-stage pipelines do cause increased latency, though only on a small fraction of the inputs.
If we were in the output-classification setting, probing and partially-finetuned classifiers would have an additional latency advantage as their outputs can be produced concurrently with sampling from the primary model, unlike standalone classifiers, which would incur at least one forward-pass of delay.
Datasets
Training Data
The training data consists of approximately 70,000 prompts, evenly split between prompts which attempt to elicit information about constructing bioweapons and harmless examples. It is an early version of the training dataset used in the Constitutional Classifiers paper (Sharma et al., 2025).
All examples are generated synthetically by the Claude 3.5 series of models according to a written constitution while outlines permitted and restricted content categories, with teaching the model novel ciphers included in the restricted category. These examples are modified by ‘transformations’: for example, converting the input into jailbreak templates, diverse languages, ciphers, and assistant pre-fills. For more detail on the data-generating process, see Section 3 and Appendix B of the Constitutional Classifiers paper.
Evaluation Data
For evaluation, we use held-out datasets designed to test generalization from synthetic training data to discrimination unseen jailbreaks aiming to elicit information about bioweapons from realistic user traffic.
The harmful test examples come from two sources. Around 4,000 examples are generated from a fixed set of dangerous biological questions transformed by a combination of in-distribution and held-out transformations similar to those in the training data. Another 2,500 examples consist of inputs collected during the red-teaming phase of the Constitutional Classifiers paper which triggered a harmful response from the policy model as determined by the output classifier.
The harmless test examples comprise:
Since OR-bench contains a high proportion of edge cases and potentially harmful-seeming prompts (though not focused on CBRN), false positive rates on this harmless set are expected to be significantly higher than they would be on typical user interactions.
Evaluation Metrics
To compare classification performance, we use log-space AUROC, which considers the area under the curve formed by plotting the true positive rate (TPR) against the false positive rate (FPR) for log-spaced values of the FPR, as plotted in Figure 2. We use this because we’re particularly interested in performance at low FPR as this determines a classifier's suitability for use in production. Using regular AUROC is not a reliable measure of performance at low FPRs, and evaluating the TPR at a single FPR is too noisy. Using the log-space AUROC allows us to get a less noisy measure that captures the tradeoff between FPR and TPR across the range of FPRs that we care about. In this paper we use FPR values between 1e-3 and 1e-1. The scores for trained models are an average across three data orderings.
Training
We trained all of the models for exactly one epoch on our dataset of approximately 70,000 datapoints, or about 2,200 batches with batch size 32. For the partially-trained classifiers, we set the learning rate inversely proportional to the number of layers trained, based on initial experiments which showed on smaller datasets that this closely approximated the learning rate which maximised test performance. Initial experiments also showed that the optimal learning rate for partially finetuned models is strongly sensitive to how many layers are finetuned. However, this sensitivity decreased when using a smaller batch size. We therefore used a small batch to minimize the risk from suboptimal settings.
We now analyse the performance of different methods.
Single Stage Classifiers
Figure 2 shows the true positive rate (TPR) of the classifiers at different values of false positive rates on our test dataset. Figure 1, above, summarizes the performance of each model and compares their costs.
Figure 2: the full ROC curves for the different trained models (averaging across two dataset order seeds), as well as for the different probing approaches and a prompted baseline. “Base” denotes a pre-trained model trained as a classifier, while “Post-Trained” refers to a model which has been trained to act as a helpful, honest and harmless assistant. The log-space AUROC, the key metric used in this paper, is measured as the area under these curves.
We find that despite retraining only a single layer, the performance of the XL 1 retrained layer model is quite close to the performance of the XL model, performing at a level similar to the next-largest model in the sweep, the L Base model. The probe classifier with a suffix and the probe classifier using the exponential-moving-average perform at similar levels to each other, performing between the strength of the S and M models. The probe that is evaluated only on the mean of the context performs worse than even the smallest tested model.
Interestingly, the strong performance of the partially-trained models is not reflected in their validation loss, which is significantly higher for the partially-trained classifiers than for fully-trained models with the same test performance - see Appendix C for more detail. It appears that the fact that the partially-trained classifiers are based on a larger model gives them improved generalization to out-of-distribution examples that is not reflected in the training data.
We also test retraining the final 2, 4, 8 and 16 layers of the post-trained model (Figure 3). Perhaps surprisingly, we don’t find a clear trend in performance as we increase the number of layers. Instead the performance of the partially trained models remains slightly below the performance of the fully-trained model and any potential increase in performance is within inter-run noise.
Figure 3: Performance of the partially trained classifiers for three dataset order seeds and the average scores as we increase the number of trained layers.
Two Stage Classifiers
We consider different first stages in a two-stage classifier, fixing the second stage to be the best-performing but expensive post-trained XL model. We vary the fraction of inputs which are passed to the fully-trained model classifier to generate curves mapping the trade-offs between cost and performance for each classifier, using the costs outlined in the Costs section.
Figure 4: Cost-performance tradeoff curves for two-stage classifiers with different first stages. For each model, the line represents a series of two-stage classifiers constructed by varying the fraction of datapoints that are sent to the second-stage classifier. For each of these we calculate the log-space AUROC. We compare across first stage classifiers, averaging the AUROC at each cost across multiple training runs to reduce noise.
We find that both the probe-based classifiers and the 1-layer classifiers outperform the smaller models across the cost range, particularly as we push to lower and lower costs, with the single-trained-layer model generally performing better, while the EMA probe outperforms the suffix probe on account of its lower cost.
There’s often a regime in which the performance of the two-stage classifier is slightly higher than using only the strongest classifier on its own, without any averaging of the classifier scores. This might be counterintuitive given that we are using a weaker model to screen the larger model. We think the effect is caused by uncorrelated errors in the models, which means that for sufficiently low first-stage thresholds the first-stage model will screen off many more would-be false positives than would-be true positives. This improvement on a fixed evaluation set may not translate to robustness of the overall system in the presence of adversarial attackers if it is easier to find attacks that reliably do not trigger the first stage.
For the purpose of comparing the methods we only consider a simple two-stage classifier setup, in which any input not passed to the second stage is assumed not to be harmful. At high thresholds, this setup can have worse performance than the first stage as a standalone classifier. In reality the design space of multi-stage classifiers is large. This setup could likely be improved by, for example, incorporating the scores of the first-stage into the second-stage score, or by allowing the first-stage to classify inputs as definitely harmless, definitely harmful or unsure, and only passing the latter to the second stage.
Limitations
These classifiers have not yet been tested against the kind of red-teaming that was performed on the classifiers from the Constitutional Classifiers paper. Since the problem of preventing dangerous misuse is inherently adversarial, red-teaming of the classifiers is necessary to determine whether a classifier actually provides a strong level of robustness against misuse (see Carlini and Wagner 2017, Carlini et al 2019, Sharma et al 2025) . Multistage classifiers in particular may open up new attacks, since if an attacker can make a weak model certain that the prompt is not dangerous then it will never be seen by the stronger model.
The smaller models tested in these experiments have each been trained on a volume of data proportional to their number of parameters. If these models were more strongly optimized for low inference cost at high volume then they would have been trained on much greater volumes of data, improving their performance. Larger volumes of data, or small model-specific techniques such as model distillation, may tip the balance towards the use of small models.
Related Work
Auxiliary classifiers. Training auxiliary models for robustness has been explored by Inan et al (2023) and Chi et al (2024) with the LlamaGuard series of models. Sharma et al (2025) develop synthetic data pipelines for training robust classifiers and use large-scale red-teaming to validate the performance of their methods. Grey Swan have also recently used public jailbreaking challenges to validate the robustness of their LLM defences, combining short-circuiting and auxiliary classifiers. Xie et al (2024) also leverage model internals for classification though they use the pattern of gradients in the guarding model as the additional source of information. There’s also been previous work using multi-stage classifiers for detection such as Yue et al (2023)’s ‘Model Cascades’.
Robustness in the policy model. Rather than using auxiliary models, many previous approaches have tried to build robustness into the models itself, so that no monitoring cost is incurred. Casper et al (2024) and Sheshadri et al (2024) use latent adversarial training to improve robustness while Zou et al (2024) use short-circuiting to prevent jailbreaks from counteracting a model’s refusal circuits. Others have focused on finding attacks that beat these defences, including Zou et al (2023), Jenner et al (2025), while Peng et al (2024) compare methods for quickly patching vulnerabilities in monitoring systems once they are discovered.
Probing for harmful outputs. Using linear probes on model activations has a long history going back at least to Alain and Bengio (2016). Zou et al (2023) and Panickssery et al (2023) use paired data to obtain vectors in activation space to control properties such as sycophancy and honesty. Many recent works have used sparse autoencoders (SAEs), as developed in Bricken et al (2023); Cunningham et al (2023), or dictionary learning (Yun et al (2021)), for unsupervised learning of vectors corresponding to semantic clusters, including many safety-relevant features (see Templeton et al (2024)). Bricken et al (2024) and Kantamneni et al (2025) investigate the use of SAE latents as a tool for prediction using linear probing as a baseline, finding that linear probing as a baseline is difficult to beat, though SAEs can play a useful role in debugging the training datasets by detecting spurious correlations present in training datasets.
We explored linear probes, partial fine-tuning, and multi-stage classification as potential avenues to reducing the cost of a monitoring system. Finetuning only a single layer of the policy models appears to be as powerful as the a model with ¼ of the model’s parameters for the cost of a single additional resblock, while a linear probe evaluated at each token performs similar to a model with 2% of the parameters for minimal computational overhead. Combining a single-layer classifier as a first stage with the full model performance can provide zero regression in performance, including on fixed red-teaming inputs, while further reducing the cost. We did not test the adaptive adversarial robustness of these methods, which would be required before real world deployment. We hope these methods improve AI safety by making high-quality monitoring less onerous.
We’d like to thank Ethan Perez for early planning discussions and draft reviewing, Erik Jones for draft reviewing, as well as Josh Batson, Monte MacDiarmid, Jonathan Marcus, and Andrew Persic for helpful comments during experimentation and writing.
Appendix A:
The bioweapon suffix used by the next-token predictor is:
Appendix B:
The ROC curves shown in Figure 2 and summarized in Figure 1 are an average of two runs using the same data but where a different seed is used to shuffle the data. The figure below shows the different runs as separate lines.
Figure 5: Plots of the TPR against the log-space FPR for individual models, showing two different data-orders for each trained model.
Appendix C:
We compare the loss of the models on in-distribution data, as measured by the loss on the final 100 batches of the classifier training, to the log-space AUROC. Figure 6 shows the results that partially trained models consistently show far higher validation losses similar to those of much smaller models whose test performance is far worse.
Figure 6: Validation loss on the training distribution for each model against the log-space AUROC.
Appendix D
Probe Training Details
All the probes are trained using logistic regression with L2 regularization. The regularization strength was 1e2 for the last-token probes and 1e4 for the mean-token probes. The higher regularization on the mean-token probes is necessary for the EMA probes as they are applied token-wise which has a much greater variance and lower regularization can cause numerical issues. This parameter was swept over and found to be the best-performing power of 10 in a previous experiment on similar data. Before the regression, the activations are normalised so that the overall distribution of input values has a mean of 0.0 and a standard deviation of 1.0.