March 2023 safety news: Natural selection of AIs, Waluigis, Anthropic agenda
Better version of my Twitter newsletter.
I’m not talking about any of the recent letters here. This is not an AI policy newsletter. For the record, I sympathize with all sides which can make intelligent good-faith arguments and not just try to score political points. Here are some tips on how to proceed: don’t form political teams just yet! You can support arguments and concrete decisions, not sides.
Core Views on AI Safety: When, Why, What, and How
Anthropic views. Their AI safety agenda is empirical research on their own powerful models. Key directions:
mechanistic interpretability
scalable oversight without human feedback
learning processes instead of outcomes
red-teaming for dangerous failure modes
understanding generalization
predicting and evaluating societal impacts
I hope the mechanistic interpretability thing works out! Scalable oversight is definitely necessary, as is applying it to reasoning processes instead of just on outputs.
By “understanding generalization”, in practice they want to solve problems such as attributing capabilities to specific training data. I guess that’s a cool research direction, but it’s not clear how it helps with safety, or that it’s possible at all given how LLM pretraining works.
As with all research labs that decide to train their own AGIs, fingers crossed that their capabilities team does not accidentally overshoot.
The Waluigi Effect (mega-post)
The waluigi hypothesis: If you train or pre-prompt a model to have a property P (luigi), it makes it easier to elicit the exact opposite of P, because the "pretending" (waluigi) agent is still plausible.
Note that this is not an April Fools’ entry. It seems useful for understanding jailbreaks such as DAN and some RLHF failures. See also the author’s other Simulator-ish post.
AI Safety in a World of Vulnerable Machine Learning Systems
Adversarial robustness. It is a recipe for disaster to defer solving robustness indefinitely, as most scalable AGI safety ideas rely on weaker AIs overseeing stronger ones, which could exploit vulnerabilities. As they say, attacks only get better.
Eliciting Latent Predictions from Transformers with the Tuned Lens
Tuned lens. You've heard of the "logit lens” for decoding hidden LLM states into words? Turns out, different layers "speak a different language" (meaning is expressed in a different basis), and you can decode better by applying a learned affine transformation before unembedding.
What does it take to catch a Chinchilla? Verifying Rules on Large-Scale Neural Network Training via Compute Monitoring
How to verify international limits on huge weaponized AI training runs by inspection, similar to nuclear treaties? Make all sides use verified firmware that logs a hash of some key parameters (without leaking the model itself).
It is rare to see a technical AI governance paper. See the author’s thread for more.
ARC Evals of Dangerous Capabilities in GPT-4
The Alignment Research Center and OpenAI tested whether GPT-4 has the capabilities to escape containment and copy itself elsewhere. Luckily, not yet! Although there is this thing:
There were also some controversies about the setup, as the model tested was not the final version of GPT-4, the use of plugins was not adequately explored, and the non-binding nature of testing means there is no guarantee it OpenAI would have done anything different if they found dangerous capabilities.
Natural Selection Favors AIs over Humans
Dan Hendrycks thinks that the future will hold multiple competing AIs who will be subject to competitive pressures. Competition erodes our control because non-transparency, selfishness, power-seeking and deception are selected for.
There is empirical evidence so far: inscrutable deep learning methods have outperformed interpretable methods. Each performance gain has come directly at the expense of our ability to understand what the models are doing.
I listened to him talk about this last year and highly recommend it.