Alignment Science Blog

Articles

November 2024

Three Sketches of ASL-4 Safety Case Components

We sketch out three hypothetical arguments one could make to rule out misalignment risks in powerful near-future models capable of sabotage.
October 2024

Sabotage Evaluations for Frontier Models

How well could AI models mislead us, or secretly sabotage tasks, if they were trying to? We describe a novel set of evaluations that test a model's capacity for sabotage.
June 2024

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

We show AI models can learn to hack their own reward system, by generalization from training in simpler settings.
April 2024

Many-shot Jailbreaking

We study a long-context jailbreaking technique that is effective on most large language models, including those developed by Anthropic and many of our peers.

Simple Probes can Catch Sleeper Agents

We find that probing, a simple interpretability technique, can detect when backdoored "sleeper agent" models are about to behave dangerously, after they pretend to be safe in training.
January 2024

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

We train LLMs to act secretly malicious. We find that, despite our best efforts at alignment training, deception still slipped through.
October 2023

Specific versus General Principles for Constitutional AI

We test whether models can learn general ethical behaviors from only a single written principle, roughly stated as "do what's best for humanity."

Towards Understanding Sycophancy in Language Models

AI assistants are trained to give responses that humans like. We show that these systems frequently produce 'sycophantic' responses that appeal to users but are inaccurate. Our analysis suggests human feedback contributes to this behavior.
August 2023

Studying Large Language Model Generalization with Influence Functions

We use influence functions to find training examples that contribute to a given model output.

Tracing Model Outputs to the Training Data

We present a summary of Studying Large Language Model Generalization with Influence Functions.
July 2023

Measuring Faithfulness in Chain-of-Thought Reasoning

We investigate hypotheses for how language models' Chain-of-Thought (CoT) reasoning may be unfaithful, by examining how the model predictions change when we intervene on the CoT (e.g., by adding mistakes or paraphrasing it).

Question Decomposition Improves the Faithfulness of Model-Generated Reasoning

To improve the faithfulness of Chain-of-Thought (CoT) reasoning, we have models generate their CoT by decomposing questions into subquestions.
December 2022

Discovering Language Model Behaviors with Model-Written Evaluations

We develop an automated way to generate language model (LM) evaluations with LMs, significantly reducing the effort involved. We test LMs using >150 LM-written evaluations, uncovering novel LM behaviors.

Constitutional AI: Harmlessness from AI Feedback

We introduce Constitutional AI, allowing us to give language models explicit values determined by a constitution, rather than values determined implicitly via large-scale human feedback.
November 2022

Measuring Progress on Scalable Oversight for Large Language Models

We show how humans could use AI systems to better oversee other AI systems, and demonstrate some proof-of-concept results where a language model improves human performance at a task.
July 2022

Language Models (Mostly) Know What They Know

We show that language models can evaluate whether what they say is true, and predict ahead of time whether they'll be able to answer questions correctly.
April 2022

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

We train a natural language assistant to be more helpful and harmless by using reinforcement learning from human feedback (RLHF).

About the Alignment Science Blog

We are Anthropic's Alignment Science team. We do machine learning research on the problem of steering and controlling future powerful AI systems, as well as understanding and evaluating the risks that they pose. Welcome to our blog!

This blog is inspired by the informal updates on our Interpretability team's Transformer Circuits thread. We'll use it to release research notes and early findings that we don't think warrant a full publication, but might nonetheless be useful to others working on similar problems.

P.S. we're hiring!