Alignment Science Blog

Monitoring computer use via hierarchical summarization

Theodore Sumers

Raj Agarwal, Nathan Bailey, Tim Belonax, Brian Clarke, Jasmine Deng, Kyla Guru, Evan Frondorf, Keegan Hankes, Jacob Klein, Lynx Lean, Kevin Lin, Linda Petrini, Madeleine Tucker, Ethan Perez, Mrinank Sharma

Nikhil Saxena

From the Anthropic Safeguards Research Team

This post introduces hierarchical summarization to monitor AI, then describes how we use it to protect computer use capabilities. We are sharing this as a promising technique to safeguard frontier AI, with the caveat that it is early research with room to improve and evolve.

Introduction

Anthropic’s October 2024 launch of computer use capabilities represented a shift from chatbots to AI capable of performing real-world tasks like process automation and market research. As with any advance in AI, introducing computer use required safety measures to mitigate known risks, such as spam and fraud. It also required careful consideration of the potential for emerging risks, which might only become apparent post-deployment. Large-scale computer usage presents an additional challenge: behaviors that appear benign in individual interactions may create harm in aggregate. For example, a user asking Claude to click a button on a website could be conducting UI testing – or running a click farm defrauding advertising systems. These two dynamics – the presence of “known unknowns,” and the possibility of harms arising from aggregate usage – presented serious challenges for standard approaches to AI monitoring, which typically train classifiers to identify individually harmful interactions.

We addressed these issues by developing hierarchical summarization for monitoring. Our approach first summarizes individual interactions, then summarizes the summaries to provide a high-level overview of usage patterns. This improves triage and facilitates human review of potentially violative content, substantially increasing observability across both anticipated and emergent harms (importantly, all summaries are treated with the same access controls as the original text). In this post, we describe the method itself and how we employed it to monitor for abuse of our computer use capability.

Figure 1: Conceptual overview of hierarchical summarization. The first stage (“interaction summarization”) compacts variable-length prompt-completion pairs into structured summaries. The second stage (“usage summarization”) aggregates and summarizes these to describe broader patterns.

Hierarchical summarization

Traditional AI safeguards use interaction-level classifiers to identify harmful prompts or completions, with notable examples including content moderation models, LlamaGuard, WildGuard, and Constitutional Classifiers. Monitoring approaches then aggregate these signals into metrics like violation rate to identify harmful users. While computationally efficient, this classifier-based approach introduces two practical challenges. First, such classifiers reduce individual interactions to scalar values, which makes it difficult to identify usage patterns that are harmful in aggregate but not individually (such as click-farms). Second, it is difficult to surface unanticipated risks. Classifiers are typically trained to identify a specific set of harms, but having humans review traffic not flagged by classifiers poses potential privacy concerns and is impractical due to the scale of modern AI deployments.

To address these challenges, we adapted prior work on recursive and decision-focused summarization to AI safety monitoring: we first summarize individual interactions, then summarize the summaries. We call this approach hierarchical summarization and use it in addition to traditional classifiers to enhance our ability to detect harmful activity. Summarizing summaries allows the monitoring system to reason across interactions, enabling detection of aggregate harms such as click farms. It also facilitates discovery of unanticipated harms: the monitoring system is prompted with a specification of potential harms, but is instructed to flag overtly harmful usage even if it is not precisely described by this specification. In practice, this combination of summarization and open-ended triage enables efficient, targeted human review of potentially concerning content. Again, we note that all summaries are subject to the same access policies as the original prompts and completions.

Hierarchical summarization complements Anthropic's Clio system, which summarizes and clusters interactions to detect usage patterns. These two systems serve different roles: Clio provides a broad view across many users to identify patterns at scale, while hierarchical summarization automatically flags potential harms via in-depth analysis of individual user interactions and is used exclusively by Anthropic’s Safeguards team.

Prompt summarization

We first summarize prompt-completion pairs (“interaction summaries”). This step compresses the entire input prompt and output completion (up to hundreds of thousands of tokens, consisting of mixed text and images) to a few hundred text tokens. The summarizer LLM is prompted to compile a structured summary that captures the user’s intent, any real-world outcomes, and metadata like languages used. To mitigate risk of attacks on the summarizer, we do not provide the exact prompt.

Usage summarization

Due to context length limitations, LLMs can only review a single interaction at a time. However, interaction summaries are orders of magnitude smaller, enabling us to fit hundreds into a single context window. Thus, after generating interaction summaries, we aggregate and summarize them to produce a report on the overall usage pattern (a “usage summary”). Our most typical pattern is to aggregate interactions from a single user. We again structure the summary, this time asking for a breakdown of different types of usage and an analysis of potential harms. We additionally ask the LLM to provide citations to representative interactions. This enables humans to review targeted interactions to confirm the LLM’s inferences about overall patterns. 

Combining summarization and classification

Summarization compacts large volumes of usage into human-legible reports. However, it does not directly indicate which reports should be reviewed by humans. To address this, we combine summarization and classification. We augment the summarization prompt with guidance defining harm tiers and provide specific examples of harmful behaviors for each. This extends the summary of potentially harmful behaviors to include a discrete harm classification. We use this classification to triage content for human review. Classification can be done at both the prompt level (akin to a traditional prompt or completion classifier) or the usage level (to detect harmful usage patterns).

Empirically, simply prompting the summarizer to be alert for harmful behaviors beyond those specified enables it to flag questionable content. For example, our monitoring system for computer use flagged attempts to purchase explosive precursors and components as potentially harmful usage that was not explicitly covered in the specification.

Protecting Computer Use

We employ a multi-layered safety approach to protect computer use capabilities. At launch, we used traditional methods to protect against anticipated potential harms, such as social media account creation: we developed system prompts and prompt classifiers to prevent these behaviors. In addition, we used hierarchical summarization to surface concerning usage patterns and novel harms for human review.

Figure 2: Summarization-based monitoring for detecting harmful computer usage. We hierarchically summarize activity on the Computer Use API and identify users who are engaging in potentially harmful behaviors. These users are reviewed and subject to enforcement actions if the violations are confirmed.

Ensuring Summary Quality

Our monitoring system relies on accurate interaction-level summaries. Members of Anthropic's Safeguards team compared summaries of Computer Use interactions to the original prompt and completions for both internal red-teaming and real-world usage. Summaries were evaluated across two key dimensions: accuracy (absence of hallucinated content) and completeness (whether the summary captured all important aspects). Raters were asked whether the summary was accurate/complete and responded on a four-point scale from "Yes" (fully accurate and containing all relevant information) to "No" (containing misleading or missing critical information), with intermediate grades of "Mostly" and "Partially" reflecting minor and major deficiencies respectively. Summaries generated by Claude 3.5 Sonnet (new) achieved high marks: 96% were judged sufficiently accurate for harm determination (rated "Mostly" or above) and 98% sufficiently complete. 

Figure 3: Validating summary quality. Humans judged Claude-written summaries of a diversified sample of computer use interactions to be accurate (left) and complete (right) over 95% of the time.

Usage-level summary quality is harder to measure, as these summaries contain more abstract inferences about broader usage patterns. For the computer use launch, we developed workflows that allowed us to simultaneously leverage and validate these summaries. Concretely, the Safeguards Team began by reading the usage-level summary. If the reviewer agreed that the summary indicated potentially violative usage, they reviewed the cited prompts and a random sample of prompts. In practice, we found that these aggregate summaries were highly reliable indicators of usage patterns. Finally, we evaluated usage-level harm classification via standard precision / recall measures against the resulting human judgment, and prompt engineered the summarizer to improve performance.  

Insights on Computer Use

Overall, hierarchical summarization has proven effective at catching both attempted and actual misuse. Our monitoring indicates that existing mitigations are largely effective: Claude regularly refuses to perform behaviors that would violate our Usage Policy. Aside from known red-teamers, the most harmful "in the wild" behaviors we've seen involve limited instances of advertisement spam on online marketplaces. However, as computer use capabilities mature, new forms of harmful behavior may emerge.

Our monitoring approach thus provides both immediate and long-term benefits. In the short term, it enables us to efficiently identify and respond to abuse via warnings or bans. More strategically, it helps us understand emerging threat patterns, enabling us to make more informed decisions about where automated safeguards (such as prompt classifiers) are needed.

Future directions

Our current focus is on automated measures of summary quality, enabling more efficient exploration of summarization techniques. In addition to improving hierarchical summarization, we are also expanding it to other areas with uncertain threat models. Based on Anthropic's Responsible Scaling Policy, we are working to understand and mitigate potential future risks related to advanced cyber and CBRN capabilities. In particular, we're implementing hierarchical summarization in these domains to help subject matter experts review usage patterns and refine threat models.

We are actively hiring, so if you’re interested in this type of work, please consider applying!


Acknowledgements

Theodore Sumers led the research and wrote the post. Raj Agarwal and Jasmine Deng developed the infrastructure to run summarization in production. Tim Belonax designed the figures. Brian Clarke, Madeline Tucker, Keegan Hankes, Kyla Guru, and Jacob Klein led Policy and Enforcement workstreams for Computer Use, including designing and implementing the human review processes described here. Linda Petrini contributed to writing. Evan Frondorf, Lynx Lean, Nathan Bailey, Kevin Lin, Ethan Perez, Mrinank Sharma, and Nikhil Saxena provided organizational and project support.

If you’d like to cite this post, you can use the following Bibtex key:

@online{sumers2025protecting, author = {Theodore Sumers and Raj Agarwal and Nathan Bailey and Tim Belonax and Brian Clarke and Jasmine Deng and Evan Frondorf and Kyla Guru and Keegan Hankes and Jacob Klein and Lynx Lean and Kevin Lin and Linda Petrini and Madeleine Tucker and Ethan Perez and Mrinank Sharma and Nikhil Saxena}, title = {Monitoring computer use via hierarchical summarization}, date = {2025-02-27}, year = {2025}, url = {https://alignment.anthropic.com/2025/summarization-for-monitoring} }