Alignment Science Blog

We are Anthropic's Alignment Science team. We do machine learning research on the problem of steering and controlling future powerful AI systems, as well as understanding and evaluating the risks that they pose. Welcome to our blog!

This blog is inspired by the informal updates on our Interpretability team's Transformer Circuits thread. We'll use it to release research notes and early findings that we don't think warrant a full publication, but might nonetheless be useful to others working on similar problems.

P.S. we're hiring!

Alignment Science Blog

Articles

Believe It or Not: How Deeply do LLMs Believe Implanted Facts?

Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment

Training fails to elicit subtle reasoning in current language models

Petri: An open-source auditing tool to accelerate AI safety research

Findings from a Pilot Anthropic–OpenAI Alignment Evaluation Exercise

Enhancing Model Safety through Pretraining Data Filtering

Building and evaluating alignment auditing agents

Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data

Inverse Scaling in Test-Time Compute

Why Do Some Language Models Fake Alignment While Others Don't?

Model-Internals Classifiers

Unsupervised Elicitation

Reasoning Models Don't Always Say What They Think

Publicly Releasing CoT Faithfulness Evaluations

Modifying LLM Beliefs with Synthetic Document Finetuning

Putting up Bumpers

Alignment Faking Revisited: Improved Classifiers and Open Source Extensions

Auditing Language Models for Hidden Objectives

Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases

Automated Researchers Can Subtly Sandbag

Constitutional Classifiers: Defending against Universal Jailbreaks across Thousands of Hours of Red Teaming

Introducing Anthropic's Safeguards Research Team

Won't vs. Can't: Sandbagging-like Behavior from Claude Models

Monitoring Computer Use via Hierarchical Summarization

Forecasting Rare Language Model Behaviors

Training on Documents about Reward Hacking Induces Reward Hacking

Recommendations for Technical AI Safety Research Directions

Alignment Faking in Large Language Models

How to Replicate and Extend our Alignment Faking Demo

A Toy Evaluation of Inference Code Tampering

Introducing the Anthropic Fellows Program for AI Safety Research

Rapid Response: Mitigating LLM Jailbreaks with a Few Examples

Three Sketches of ASL-4 Safety Case Components

Sabotage Evaluations for Frontier Models

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Many-shot Jailbreaking

Simple Probes can Catch Sleeper Agents

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Specific versus General Principles for Constitutional AI

Towards Understanding Sycophancy in Language Models

Studying Large Language Model Generalization with Influence Functions

Tracing Model Outputs to the Training Data

Measuring Faithfulness in Chain-of-Thought Reasoning

Question Decomposition Improves the Faithfulness of Model-Generated Reasoning

Discovering Language Model Behaviors with Model-Written Evaluations

Constitutional AI: Harmlessness from AI Feedback

Measuring Progress on Scalable Oversight for Large Language Models

Language Models (Mostly) Know What They Know

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

About the Alignment Science Blog