METR (formerly called ARC Evals) is the evaluations project at the Alignment Research Center. Its work assesses whether cutting-edge AI systems could pose catastrophic risks to civilization.
As AI systems become more powerful, it becomes increasingly important to ensure these systems are safe and aligned with our interests. A growing number of experts are concerned that future AI systems pose an existential risk to humanity — and according to one study of machine learning researchers conducted by AI Impacts, the median respondent reported believing that there is a 5% chance of an “extremely bad outcome (e.g. human extinction)”. One way to prepare for this is to be able to evaluate current systems and receive warning signs if new risks emerge.
METR is contributing to the following AI governance approach:
METR's current work focuses primarily on evaluating capabilities (the first step above), in particular a capability they call autonomous replication — the ability of an AI system to survive on a cloud server, obtain money and compute resources, and use those resources to make more copies of itself.
Evals was given early access to OpenAI’s GPT-4 and Anthropic’s Claude to assess them for safety. They determined that these systems are not capable of “fairly basic steps towards autonomous replication” — but still, some of the steps they can take are already somewhat alarming. One highly publicised example from METR’s assessment was that GPT-4 successfully pretended to be a vision-impaired human to convince a TaskRabbit worker to solve a CAPTCHA code.
Suppose AI systems could autonomously replicate, what are the risks?
Therefore, METR is also exploring developing safety standards that could ensure that even systems capable or powerful enough to be dangerous won’t be. This could include security against theft by people who would use the system for harm, monitoring so that any surprising and unintended behaviour is quickly noticed and addressed, and sufficient alignment with human interests such that the system would not choose to take catastrophic actions (for example, reliably refusing to assist users seeking to use the system for harm).
After investigating METR's strategy and track record, Longview Philanthropy recommended a grant of $220,000 from its public fund in 2023. In reference to this grant, Longview shared that they thought METR had among the most promising and direct paths to impact on AI governance: “test models to see if they’re capable of doing extremely dangerous things; if they are, require strong guarantees that they won’t.”
There are a few other positive indicators of the organisation’s cost-effectiveness:
As of July 2023, METR could make good use of millions of dollars in additional funding over the next 18 months.
Note: At the time of Longview's grant, METR was the evaluations project at the Alignment Research Center. It now a standalone non profit.