Alignment Science Blog

Introducing Anthropic's Safeguards Research Team

Following the release of Constitutional Classifiers, we are excited to announce Anthropic’s new Safeguards Research Team. We'll be focusing on topics such as jailbreak robustness, automated red teaming, and developing effective monitoring techniques, both for model misuse and misalignment. The team is currently led by Mrinank Sharma, and current members are Erik Jones, Meg Tong, Jerry Wei, Euan Ong, Alwin Peng, Ted Sumers, Taesung Lee, Giulio Zhou, and Scott Goodfriend.

We will operate in close collaboration with Anthropic's broader Safeguards organization, which is responsible for building deployment infrastructure, implementing safety mechanisms, and maintaining user policies. This integration enables us to directly inform threat models, identify early warning signs of model misalignment, and prioritize pressing challenges. We will also be able to obtain rapid feedback about whether our safety solutions work and gain invaluable experience in deploying safety techniques into production.

We are planning to work on projects in the following areas:

We're currently expanding our team and seeking pragmatic research scientists and engineers with strong experimental skills. You can apply to join our team here.

We also work closely with the Anthropic Fellows Program and will run some projects through MATS.