Alignment Science Blog

Introducing the Anthropic Fellows Program for AI Safety Research


”This is an exceptional opportunity to join AI safety research, collaborating with leading researchers on one of the world's most pressing problems." — Jan Leike

We're launching the Anthropic Fellows Program for AI Safety Research, a pilot initiative designed to accelerate AI safety research and foster research talent. The program will provide funding and mentorship for a small cohort of 10-15 Fellows to work full-time on AI safety research. Over the course of six months, Fellows will be matched with Anthropic mentors to investigate AI safety research questions in areas such as Adversarial Robustness, Dangerous Capability Evaluations, and Scalable Oversight.

From our previous external collaborations, we’ve seen that we can speed up progress towards our AI safety priorities by supporting technical talent—regardless of their level of previous experience—in their first forays into safety research. It’s also important for the development of safe AI to incorporate a diversity of perspectives. Providing a new avenue for mentorship and financial support will expand the pool of researchers working at the frontier of AI safety and alignment.

Why we're launching this program

Anthropic has made significant strides in AI safety research, but effectively addressing the challenges ahead requires a broader pool of researchers. There are many talented researchers and engineers who are interested in AI safety research, but who are unsure how to pivot into the field. Our researchers have completed many successful collaborations with technical talent interested in pivoting into AI safety: in the past year we’ve completed about a dozen projects with external collaborators, many of which have been published at top venues.[1]

What Fellows should expect

While Fellows will not be officially employed by Anthropic, we will support them to collaborate full-time with Anthropic researchers for 6 months by providing:

By the end of the program, we aim for every Fellow to have produced a (co-)first authored research paper in AI safety.

Fellows will be expected to work full-time on their research projects. We are open to hosting Fellows who have other obligations, such as coursework, but we expect all Fellows to be able to spend 40 hours per week on their research.

The first cohort of Fellows will undergo a project selection & mentor matching process in March 2025, with onboarding planned for mid-March 2025. If we have subsequent cohorts, they will start on a rolling basis.

Mentors & Research Areas

Potential Anthropic mentors include:

Mentors will lead projects in select AI safety research areas, such as:

Representative projects for each area are listed in the footnotes.

Who we’re looking for

The program is focused on helping mid-career technical professionals transition into the AI safety research space, but regardless of your background or career stage, we’re open to your application.

You may be a good fit if you:

We’ll use the following criteria for selection, though we encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every qualification as listed:

We value candidates who bring unique perspectives and experiences to AI safety research.

While prior exposure to AI safety research is valuable, we're just as interested in candidates who are new to the field, but can demonstrate exceptional technical ability and a genuine commitment to / interest in developing safe and beneficial AI systems.

We particularly encourage applications from underrepresented groups in tech.

How to apply

Applications for the Anthropic Fellows Program are now open! You can find the application link, which has further details on the Fellows Program, here.

To be considered for the first cohort, please apply by January 20, 2025. Note that we will be evaluating applications and extending offers on a rolling basis, and we encourage applying before the deadline.

If you have any questions, please contact fellows@anthropic.com.

Footnotes

Projects denoted with an asterisk (*) represent outputs related to previous collaborations between our Anthropic mentors and external AI safety researchers.

[1] These include: Debating with More Persuasive LLMs Leads to More Truthful Answers, Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting, and Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

[2] Representative prior research: Debating with More Persuasive LLMs Leads to More Truthful Answers*, Measuring Progress on Scalable Oversight for Large Language Models 

[3] Representative prior research: Expanding our model safety bug bounty program, When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?*, Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs,* Many-shot jailbreaking,* Sabotage evaluations for frontier models, Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats*  

[4] Representative prior research: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs*, Sycophancy to subterfuge: Investigating reward tampering in language models, Inducing Unprompted Misalignment in LLMs*, Reward hacking behavior can generalize across tasks*

[5] Representative prior research: Simple probes can catch sleeper agents, Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

[6] Representative prior research: Can LLMs make trade-offs involving stipulated pain and pleasure states?, Looking Inward: Language Models Can Learn About Themselves by Introspection*, Towards Evaluating AI Systems for Moral Status Using Self-Reports*