Alignment Science Blog

Articles

October 2025

Believe It or Not: How Deeply do LLMs Believe Implanted Facts?

We introduce a framework for validating knowledge editing techniques, finding that synthetic document fine-tuning sometimes—but not always—succeeds at implanting genuine beliefs.

Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment

For example, we train on demonstrations of test case hacking with training prompts that request hacks. This model does not learn to hack.

Training fails to elicit subtle reasoning in current language models

We investigated whether language models can reason about malicious tasks while evading detection, finding that monitoring both reasoning and outputs successfully prevents this in current models.

Petri: An open-source auditing tool to accelerate AI safety research

An open-source framework for automated alignment auditing that uses AI agents to create test environments for other models.
August 2025

Findings from a Pilot Anthropic–OpenAI Alignment Evaluation Exercise

Anthropic and OpenAI conducted simultaneous alignment assessments of each others' models earlier this year. These are our findings.

Enhancing Model Safety through Pretraining Data Filtering

We experiment with removing harmful information about chemical, biological, radiological and nuclear (CBRN) weapons from our models' pretraining data.
July 2025

Building and evaluating alignment auditing agents

In testing, our agents successfully uncover hidden goals, build safety evaluations, and surface concerning behaviors.

Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data

Training on aligned chains-of-thought generated by a misaligned model can induce misalignment.

Inverse Scaling in Test-Time Compute

We investigate inverse scaling when LLMs have access to more test-time compute.
June 2025

Why Do Some Language Models Fake Alignment While Others Don't?

We investigate why in the scenarios where Claude 3 Opus fakes alignment many other language models don't do so.

Model-Internals Classifiers

We benchmark approaches to re-using LLM internals to make LLM monitoring more cost-effective.

Unsupervised Elicitation

We introduce a new unsupervised algorithm for eliciting skills from pretrained language models.
April 2025

Reasoning Models Don't Always Say What They Think

We find that reasoning models don't always accurately verbalize their reasoning. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.

Publicly Releasing CoT Faithfulness Evaluations

We release the datasets used in our paper "Reasoning Models Don't Always Say What They Think" for public use.

Modifying LLM Beliefs with Synthetic Document Finetuning

We study whether we can modify the beliefs of LLMs and investigate whether doing so could decrease risk from advanced AI systems.

Putting up Bumpers

Even if we can't solve alignment, we can solve the problem of catching and fixing misalignment.

Alignment Faking Revisited: Improved Classifiers and Open Source Extensions

We present a replication and extension of an alignment faking model organism.
March 2025

Auditing Language Models for Hidden Objectives

We deliberately train a language model with a hidden objective and use it as a testbed for studying alignment audits.

Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases

We provide some evidence that Claude 3.7 Sonnet doesn't encode hidden reasoning in its scratchpad by showing that training it to use paraphrased versions of the scratchpads does not degrade performance.

Automated Researchers Can Subtly Sandbag

Current models can sandbag ML experiments and research decisions without being detected by zero-shot prompted monitors. Claude 3.7 is better at zero-shot sandbagging than Claude 3.5 (new).
February 2025

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

We built a system of constitutional classifiers to prevent jailbreaks. A prototype version of our system withstood over 3,000 hours of expert red teaming with no universal jailbreaks found. Newer versions of our system also have minimal over-refusals and moderate run-time overhead.

Introducing Anthropic's Safeguards Research Team

We're launching a new research team focused on mitigating the post-deployment risks of AI systems.

Won't vs. Can't: Sandbagging-like Behavior from Claude Models

We find that, when Claude models are presented with a harmful version of a task they can otherwise perform, they sometimes claim they lack the ability to perform the task, rather than simply refusing to do it. We discuss the implications of this behavior for AI safety.

Monitoring Computer Use via Hierarchical Summarization

We introduce hierarchical summarization for monitoring and describe how we use it to protect Computer Use capabilities.

Forecasting Rare Language Model Behaviors

We forecast whether risks will occur after a model is deployed — using even very limited sets of test data.
January 2025

Training on Documents about Reward Hacking Induces Reward Hacking

Does training on documents which discuss (but do not demonstrate) reward hacks affect a model's propensity to reward hack?

Recommendations for Technical AI Safety Research Directions

A collection of technical AI safety research problems that we'd like to see progress in.
December 2024

Alignment Faking in Large Language Models

We present experiments where Claude often pretends to have different views during training, while actually maintaining its original preferences.

How to Replicate and Extend our Alignment Faking Demo

We describe how to get started with experimenting with our demonstration of alignment faking, and present some ideas for future research.

A Toy Evaluation of Inference Code Tampering

We describe how highly capable LLMs might disable their monitor, and evaluate the ability of current LLMs to do so in a toy setting.

Introducing the Anthropic Fellows Program for AI Safety Research

We're launching the Anthropic Fellows Program for AI Safety Research, a pilot initiative designed to accelerate AI safety research and foster research talent.
November 2024

Rapid Response: Mitigating LLM Jailbreaks with a Few Examples

Ensuring perfect jailbreak robustness is hard. We propose an alternative: adaptive techniques that rapidly block new classes of jailbreak as they’re detected.

Three Sketches of ASL-4 Safety Case Components

We sketch out three hypothetical arguments one could make to rule out misalignment risks in powerful near-future models capable of sabotage.
October 2024

Sabotage Evaluations for Frontier Models

How well could AI models mislead us, or secretly sabotage tasks, if they were trying to? We describe a novel set of evaluations that test a model's capacity for sabotage.
June 2024

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

We show AI models can learn to hack their own reward system, by generalization from training in simpler settings.
April 2024

Many-shot Jailbreaking

We study a long-context jailbreaking technique that is effective on most large language models, including those developed by Anthropic and many of our peers.

Simple Probes can Catch Sleeper Agents

We find that probing, a simple interpretability technique, can detect when backdoored "sleeper agent" models are about to behave dangerously, after they pretend to be safe in training.
January 2024

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

We train LLMs to act secretly malicious. We find that, despite our best efforts at alignment training, deception still slipped through.
October 2023

Specific versus General Principles for Constitutional AI

We test whether models can learn general ethical behaviors from only a single written principle, roughly stated as "do what's best for humanity."

Towards Understanding Sycophancy in Language Models

AI assistants are trained to give responses that humans like. We show that these systems frequently produce 'sycophantic' responses that appeal to users but are inaccurate. Our analysis suggests human feedback contributes to this behavior.
August 2023

Studying Large Language Model Generalization with Influence Functions

We use influence functions to find training examples that contribute to a given model output.

Tracing Model Outputs to the Training Data

We present a summary of Studying Large Language Model Generalization with Influence Functions.
July 2023

Measuring Faithfulness in Chain-of-Thought Reasoning

We investigate hypotheses for how language models' Chain-of-Thought (CoT) reasoning may be unfaithful, by examining how the model predictions change when we intervene on the CoT (e.g., by adding mistakes or paraphrasing it).

Question Decomposition Improves the Faithfulness of Model-Generated Reasoning

To improve the faithfulness of Chain-of-Thought (CoT) reasoning, we have models generate their CoT by decomposing questions into subquestions.
December 2022

Discovering Language Model Behaviors with Model-Written Evaluations

We develop an automated way to generate language model (LM) evaluations with LMs, significantly reducing the effort involved. We test LMs using >150 LM-written evaluations, uncovering novel LM behaviors.

Constitutional AI: Harmlessness from AI Feedback

We introduce Constitutional AI, allowing us to give language models explicit values determined by a constitution, rather than values determined implicitly via large-scale human feedback.
November 2022

Measuring Progress on Scalable Oversight for Large Language Models

We show how humans could use AI systems to better oversee other AI systems, and demonstrate some proof-of-concept results where a language model improves human performance at a task.
July 2022

Language Models (Mostly) Know What They Know

We show that language models can evaluate whether what they say is true, and predict ahead of time whether they'll be able to answer questions correctly.
April 2022

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

We train a natural language assistant to be more helpful and harmless by using reinforcement learning from human feedback (RLHF).

About the Alignment Science Blog

We are Anthropic's Alignment Science team. We do machine learning research on the problem of steering and controlling future powerful AI systems, as well as understanding and evaluating the risks that they pose. Welcome to our blog!

This blog is inspired by the informal updates on our Interpretability team's Transformer Circuits thread. We'll use it to release research notes and early findings that we don't think warrant a full publication, but might nonetheless be useful to others working on similar problems.

P.S. we're hiring!