This page provides more information on the Longtermism Fund’s ($110,000) grant to support Professor Martin Wattenberg and Professor Fernanda Viégas to develop their AI interpretability work at Harvard University.
This grant will fund research into methods for looking at the inner workings of an AI system and inferring from this why the AI system produces the results it does. (This is often called mechanistic interpretability.)
Currently, we have almost no ability to understand why the most powerful systems make the decisions they make. They are mostly a “black box”. For example:
This grant supports work to change this. Specifically, it will support Prof. Wattenberg and Prof. Viégas to:
Concretely, most of this funding will help them to (A) set up a compute cluster needed to work on these projects at the cutting edge of AI, and (B) hire graduate students or possibly postdoctoral fellows.
One of Prof. Wattenberg and Prof. Viégas’ dreams is to make possible dashboards that display what an AI system believes about its user and about itself. This is very ambitious, but would have a wide range of benefits. Notably for the Fund’s goals, progress toward this goal could help AI developers to detect unwanted processes within AI systems, such as deception. Perhaps it could even lead to a more reliable understanding of what goals systems have learned. The feasibility of this type of progress remains to be seen.
Longview Philanthropy — one of Giving What We Can’s trusted evaluators — recommended this grant after investigating its strength compared to other projects. This investigation involved:
We have varying degrees of information about the cost-effectiveness of our supported programs. We have more information about programs that impact-focused evaluators (some of which our research team expects to investigate soon as part of their evaluator investigations) have looked into, as well as programs that we’ve previously included on our list of recommended charities. We think it’s important to share the information we have with donors as we expect it will be useful in their donation decisions, but don’t want donors to mistakenly overweight the extent to which we share information about some charities and not others. Therefore, we want to clarify two things: