Alignment Science Blog

Introducing the Anthropic Fellows Program for AI Safety Research

”This is an exceptional opportunity to join AI safety research, collaborating with leading researchers on one of the world's most pressing problems." — Jan Leike

We're launching the Anthropic Fellows Program for AI Safety Research, a pilot initiative designed to accelerate AI safety research and foster research talent. The program will provide funding and mentorship for a small cohort of 10-15 Fellows to work full-time on AI safety research. Over the course of six months, Fellows will be matched with Anthropic mentors to investigate AI safety research questions in areas such as Adversarial Robustness, Dangerous Capability Evaluations, and Scalable Oversight.

From our previous external collaborations, we’ve seen that we can speed up progress towards our AI safety priorities by supporting technical talent—regardless of their level of previous experience—in their first forays into safety research. It’s also important for the development of safe AI to incorporate a diversity of perspectives. Providing a new avenue for mentorship and financial support will expand the pool of researchers working at the frontier of AI safety and alignment.

Why we're launching this program

Anthropic has made significant strides in AI safety research, but effectively addressing the challenges ahead requires a broader pool of researchers. There are many talented researchers and engineers who are interested in AI safety research, but who are unsure how to pivot into the field. Our researchers have completed many successful collaborations with technical talent interested in pivoting into AI safety: in the past year we’ve completed about a dozen projects with external collaborators, many of which have been published at top venues.^[1]

What Fellows should expect

While Fellows will not be officially employed by Anthropic, we will support them to collaborate full-time with Anthropic researchers for 6 months by providing:

Compensation. Fellows will receive a weekly stipend of $2,100 & access to benefits.
Funding for Research Costs. Fellows will collectively have access to a budget of approximately $10,000 per fellow per month for compute, human data, and other research spending. All research projects will be conducted using publicly available APIs and open-source models.
Mentorship. Fellows will receive guidance from Anthropic researchers, including ideas for potential projects, weekly research meetings, and discussions over Slack.
Community. There will be a group Slack for Fellows and mentors. While you won’t be based out of the Anthropic offices, if you’re in the Bay Area or in London, you will be able to work from shared workspaces that also host other AI safety researchers — and you’ll have regular opportunities to visit our San Francisco or London offices.

By the end of the program, we aim for every Fellow to have produced a (co-)first authored research paper in AI safety.

Fellows will be expected to work full-time on their research projects. We are open to hosting Fellows who have other obligations, such as coursework, but we expect all Fellows to be able to spend 40 hours per week on their research.

The first cohort of Fellows will undergo a project selection & mentor matching process in March 2025, with onboarding planned for mid-March 2025. If we have subsequent cohorts, they will start on a rolling basis.

Mentors & Research Areas

Potential Anthropic mentors include:

Mentors will lead projects in select AI safety research areas, such as:

Scalable Oversight: Developing techniques to keep highly capable models helpful and honest, even as they surpass human-level intelligence in various domains.^[2]
Adversarial Robustness and AI Control: Creating methods to ensure advanced AI systems remain safe and harmless in unfamiliar or adversarial scenarios.^[3]
Model Organisms: Creating model organisms of misalignment to improve our empirical understanding of how alignment failures might arise.^[4]
Model Internals and Interpretability: Advancing our understanding of the internal workings of large language models to enable more targeted interventions and safety measures.^[5]
AI Welfare: Improving our understanding of potential AI welfare and developing related evaluations and mitigations.^[6]

Representative projects for each area are listed in the footnotes.

Who we’re looking for

The program is focused on helping mid-career technical professionals transition into the AI safety research space, but regardless of your background or career stage, we’re open to your application.

You may be a good fit if you:

are motivated by reducing catastrophic risks from advanced AI systems;
have a strong technical background in computer science, machine learning, or related fields;
have strong Python programming skills;
have US or UK work authorization, and are able to work full-time out of Berkeley or London;

If you do not already have work authorization but will be able to secure it before the program starts, you are still eligible to apply. We may also be able to accept Fellows based in other locations on a case-by-case basis.

thrive in fast-paced, collaborative environments; and
can execute research projects independently while incorporating feedback on direction;
are excited about doing empirical AI safety research full-time, including after the Fellowship.

Please note: We do not guarantee that we will make any full-time offers to Fellows. However, strong performance during the program may indicate that a Fellow would be a good fit here at Anthropic: several Anthropic staff originally started out as collaborators outside of Anthropic.

We’ll use the following criteria for selection, though we encourage you to apply even if you do not believe you meet every single qualification. Not all strong candidates will meet every qualification as listed:

Technical Excellence, shown by:

Strong experience in machine learning, software engineering, or AI research;
Outstanding performance in computer science, machine learning, or related technical fields (e.g., education from a top program, competition awards);
Significant open-source contributions or impactful technical projects.

Research Potential, shown by:

Previous research experience, e.g. in machine learning, large language models, AI safety, or adjacent areas;
Strong programming ability & coding speed, especially in Python and ML frameworks;
Clear written and verbal communication skills.

We value candidates who bring unique perspectives and experiences to AI safety research.

While prior exposure to AI safety research is valuable, we're just as interested in candidates who are new to the field, but can demonstrate exceptional technical ability and a genuine commitment to / interest in developing safe and beneficial AI systems.

We particularly encourage applications from underrepresented groups in tech.

How to apply

Applications for the Anthropic Fellows Program are now open! You can find the application link, which has further details on the Fellows Program, here.

To be considered for the first cohort, please apply by January 20, 2025. Note that we will be evaluating applications and extending offers on a rolling basis, and we encourage applying before the deadline.

If you have any questions, please contact fellows@anthropic.com.

Footnotes

Projects denoted with an asterisk (*) represent outputs related to previous collaborations between our Anthropic mentors and external AI safety researchers.

[1] These include: Debating with More Persuasive LLMs Leads to More Truthful Answers*, Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting*, and Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning*

[2] Representative prior research: Debating with More Persuasive LLMs Leads to More Truthful Answers*, Measuring Progress on Scalable Oversight for Large Language Models

[3] Representative prior research: Expanding our model safety bug bounty program, When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?*, Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs,* Many-shot jailbreaking,* Sabotage evaluations for frontier models, Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats*

[4] Representative prior research: Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs*, Sycophancy to subterfuge: Investigating reward tampering in language models, Inducing Unprompted Misalignment in LLMs*, Reward hacking behavior can generalize across tasks*

[5] Representative prior research: Simple probes can catch sleeper agents, Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

[6] Representative prior research: Can LLMs make trade-offs involving stipulated pain and pleasure states?, Looking Inward: Language Models Can Learn About Themselves by Introspection*, Towards Evaluating AI Systems for Moral Status Using Self-Reports*