Alignment Science Blog

Introducing Anthropic's Safeguards Research Team

Following the release of Constitutional Classifiers, we are excited to announce Anthropic’s new Safeguards Research Team. We'll be focusing on topics such as jailbreak robustness, automated red teaming, and developing effective monitoring techniques, both for model misuse and misalignment. The team is currently led by Mrinank Sharma, and current members are Erik Jones, Meg Tong, Jerry Wei, Euan Ong, Alwin Peng, Ted Sumers, Taesung Lee, Giulio Zhou, and Scott Goodfriend.

We will operate in close collaboration with Anthropic's broader Safeguards organization, which is responsible for building deployment infrastructure, implementing safety mechanisms, and maintaining user policies. This integration enables us to directly inform threat models, identify early warning signs of model misalignment, and prioritize pressing challenges. We will also be able to obtain rapid feedback about whether our safety solutions work and gain invaluable experience in deploying safety techniques into production.

We are planning to work on projects in the following areas:

Defense techniques for misuse and misalignment, such as Constitutional Classifiers and other classifier-based approaches, potentially using model internals.
Vulnerability identification, such as our work on many-shot jailbreaking, best-of-N jailbreaking, and understanding image jailbreaks. This also includes projects on automated red teaming.
Monitoring: Creating efficient monitoring techniques, performing deployment-informed threat modeling to better understand risks, and exploring AI-assisted summarization to surface concerning patterns and anomalies. We want techniques that can analyze real-world usage to uncover previously unforeseen model misuse and misalignment risks, in order to inform our threat-modeling – for instance, using tools such as Clio and anomaly detection.
Response: Techniques here include rapid response, few-shot catastrophe prevention, and protocols for low-stakes control. If we detect harm, how can we respond quickly to mitigate it?
Safety cases: Documenting comprehensive arguments for the safety of AI systems, particularly for model misuse.

We're currently expanding our team and seeking pragmatic research scientists and engineers with strong experimental skills. You can apply to join our team here.

We also work closely with the Anthropic Fellows Program and will run some projects through MATS.