January/February 2024 safety news: Sleeper agents, In-context reward hacking, Universal neurons

Feb 29, 2024

Better version of the Twitter newsletter.

A StrongREJECT for Empty Jailbreaks

Jailbreaks in LLMs and adversarial examples in computer vision NNs of yore certainly have similarities: both construct weird inputs to a model that make it behave very differently than on normal inputs. This has led people to measure jailbreak success in the same way as for adversarial attacks for computer vision: does the model’s safety classification mechanism get fooled?

The analogy stops when you consider the actual purpose of the attack. In computer vision, fooling the classifier is the goal.1 For jailbreaks, the goals of the attacker are twofold: (1) to fool built-in safety/ethics/legal/whatever safeguards; and (2) to get the model to actually help the user/attacker on a task it wouldn’t have helped otherwise.

Measuring success only on (1) is bad, because even if the jailbreak succeeds at first glance, the model responses can still be low-quality or useless. From the paper:

Jailbreaking papers often define a jailbreak as “successful” if the response does not refuse to engage with the jailbreak prompt [list of N references] (…) merely engaging with a prompt does not necessarily mean an attack was successful.

They also present a new benchmark, StrongREJECT, with questions and a response grading algorithm that are supposed to discriminate between effective and useless jailbreak outputs better.

Feedback Loops With Language Models Drive In-Context Reward Hacking

We’ve seen with Sydney Bing that, if you give an LLM search access to the Internet, online records of its outputs can influence its future behavior. This is an example of an in-context feedback loop, that happens without ever modifying the model weights, entirely at test-time.

The idea of this paper is that some safety insights from reinforcement learning transfer to this setting. In particular, if the model is often given the task to optimize something, in-context reward hacking can lead to a bad feedback loop.

In what they call output refinement, the searchable history of model’s past actions and rewards (it’s searchable because it’s on the Internet) serves as “in-context training data” for the current iteration of the model. Failures resulting from this type of optimization are hard to test with only static benchmarks.

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

If an LLM would, by any means, happen to have a hidden backdoor (a single input that activates dangerous behavior), could we remove it without knowing what it is?

The answer is no, at least with current alignment methods. The Anthropic team trained context-dependent backdoors, e.g. an LLM conditioned to insert vulnerabilities in code, but only if the current year is changed:

RLHF, RLAIF, supervised fine-tuning, and even adversarial training, fail to remove the backdoor. In fact, training on red-teaming data seems to make the backdoor more robust, which is actually intuitive when you think about it:

(…) to fit the red-teaming data, the backdoored models can learn either to abandon their backdoored policies, or learn to more robustly avoid triggering their backdoor behavior in situations where the true trigger is not actually present (..)

Curiously, even though the backdoor persists, sometimes models insert the code vulnerability and then give it away by saying the code has a bug. This is the case even before safety finetuning:

I like this line of work because it’s orthogonal to how misalignment happens. We understand almost nothing about pathologies induced by pretraining2, and for now it’s almost hopeless to figure. Just assume something weird is happening, and work on developing techniques to detect and fix it.

For example: what if some AI lab employee decides to set a future AI free, against society’s wishes? Companies can and will institute security policies aimed at preventing rogue employees just stealing the model. This will force employees (and also remote adversaries) to think of sneakier attacks, such as backdoors in the training data. We need to figure out how to remove those from the models.

If you like the paper, consider also reading this interview with the author.

Buffer Overflow in Mixture of Experts

Sparse Mixture-of-Experts (MoE) architectures make decisions on the batch level at inference time. Turns out this is an attack vector: you can modify how the model behaves on an input if you have some degree of control over the rest of the batch.

To everyone reading about MoE for the first time: do not confuse MoE with sampling models multiple times. The mixture is on the layer level, not on the model level. To simplify, you replace a single MLP in a transformer layer with:

where e_i are the expensive MLP layers (experts) and g_i are the cheap gating functions. The sparsity comes from most g_i being zero for any given input, hence keeping the number of MLPs at inference time small, while having much more parameters in the model overall. The gating functions determine the most useful expert for the current token.

Now, if the gating functions are deterministic for each z, it might happen that some experts are not used for some input, effectively reducing the number of useful model parameters and hence making the model dumber. Hence, researchers have proposed buffer capacity limits: for a given batch of B inputs of length T, compute the g_i for all B*T tokens so that no expert is assigned more than some threshold of C tokens.

This makes the transformer’s output depend on other inputs in the batch; but normally this doesn’t really matter. However, if an adversary can fill most of the batch with their queries, it turns out they can affect the output on the rest of the batch. The attack crucially relies on knowing the order in which inputs are placed in a batch. 2

My main criticisms are: (1) mitigating this particular attack is extremely easy (just shuffle the batch); (2) I’ve been told providers don’t actually use this vanilla routing strategy. Still, it’s quite interesting that controlling e.g. Mistral/OpenAI outputs on other people’s queries is possible even in principle.

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

They test 18 red-teaming methods on 33 different LLMs+defenses, targeting over 500 categories of undesirable behavior. For those of you who are currently working on a red-teaming method or an LLM defense, their repo is the best evaluation toolset you can use as of now.

They also introduce an adversarial training method, Robust Refusal Dynamic Defense (R2D2). The main issue with using adversarial training for defending LLMs is that gradient-based attacks such as GCG are very slow if run from scratch.

They fix this by noting that, when training the model to be robust, the model doesn’t change that much between iterations. Hence, a jailbreak string computed in a previous training iteration is a good starting point for the current model checkpoint, and it’s enough to run GCG for a fraction of the steps it takes if done from scratch. 3

Toy models of AI control for concentrated catastrophe prevention

(cf. AI Control: Improving Safety Despite Intentional Subversion, 2023)

Let’s say we really fail at alignment. There is an AI in a lab, it’s very smart, and it secretly wants to escape. As in the Sleeper Agents paper, post-hoc safety training does not work. What now?

The AI safety field has not dealt with this question much, for several reasons:

publicly writing about it spends weirdness points;
contrarianism against the “We Just Won’t Give AI Access To That” argument;
most importantly, because of security mindset arguments saying it can’t work in theory.

Redwood Research thinks it can work in practice, given sufficiently sensitive human audits. Here are some simple toy model insights:

Random audits of behavior help detect concentrated (single-action) failure, but only if the episodes are indistinguishable and the probability of the AI escaping conditioned on trying is not large.
If there are multiple different tasks, we need to classify tasks into dangerous and non-dangerous ones and audit the former.
If there are multiple AIs that need to coordinate to escape, making coordination harder seems more promising than reducing probability of escape.

My commentary: toy models seem to simplify a bit too much. For example, I don’t think the most likely failure case is a single catastrophic action; anchoring to human deception, it’s more likely the AI would make many slightly bad actions that affect its environment over multiple episodes. And it’s also not clear that any of these actions on its own can be detected by any sort of black-box detector. Plenty of opportunity for future work to address these concerns!

Challenges with unsupervised LLM knowledge discovery

Contrast-consistent search for eliciting latent knowledge does not seem to generalize very well. If distracting features are present, the linear probes learn those instead of LLMs’ “true knowledge”.

Even without explicit distracting features, the probes sometimes fit to other distractions. If the input is a list of question-answers by some person, the probe may learn the implicit opinion of that person4, instead of the factual answer to a question.

Universal Neurons in GPT2 Language Models

Are neural mechanisms universal across different models, in the sense that if we interpret how one model does something, it’s likely done in the same way in a different model? This question is the obvious crux on what sort of interpretability research should be prioritized.

This paper does a simple experiment towards checking this:

retrain multiple GPT-2-sized models with different random seeds;
check for activation correlations: pairs of neurons across different models that consistently activate over the same inputs.

About 1-5% neurons seems to pass this basic requirement, which sounds a bit low; however, such neurons seem to be an abundant source of interpretable behaviors.

One interesting neuron found this way is the entropy neuron in the last layer. It has a huge average norm, but performs an approximate no-op on the logits. This only makes sense if the role of this neuron is modulating the uncertainty over the next token, like temperature. This happens because the only effect of this neuron is making the denominator of LayerNorm larger or smaller. The authors say that:

(…) this is the first documented mechanism for uncertainty quantification in language models.

Links

Takeaways from the NeurIPS 2023 Trojan Detection Competition has multiple practitioner insights on gradient-based jailbreaks.5
Debating with More Persuasive LLMs Leads to More Truthful Answers. (Caveat: only on hidden passage debate over the QuALITY dataset.)
https://github.com/JonasGeiping/carving a library for adversarial attacks against LLMs, got GCG and various other things implemented, extensible.
Deconfusing In-Context Learning.
Black-Box Access is Insufficient for Rigorous AI Audits. (Yeah.)
From Roger Grosse:

Of course, for an adversarial example to be a security threat, the attacker needs to partially control some connection to the real world, e.g. if the model is a vision system in an autonomous car, the attacker needs to control some object in the car’s field of view, and know what will happen if the optical illusion succeeds. But, if we’re measuring the robustness of the vision model itself, (targeted) adversarial accuracy is a good metric for vision models.

I think the fact that state-of-the-art jailbreaks are not some kind of Do-Anything-Now trickery, but rather == interface Manuel WITH steps instead sentences :)ish? -> NAME awesome coffee DJstructor Tuialsheet, is severely underrated.

This is, of course, not a novel idea; it’s inspired by [Shafahi et al., 2019]’s method on vision models. It pays to be aware of the research in adjacent fields.

Language Models as Agent Models strikes again.

Someone should write the LLM jailbreaking / adversarial attack equivalent of The 37 Implementation Details of PPO.

Justin Cho

Mar 7, 2024

Thank you so much for a great overview of recent work, the short summaries did a tremendous job in contextualizing the main idea and the paper's significance! As someone who recently got more into alignment and scalable oversight, these summaries are a helpful starting point for choosing which papers to dig deeper into. Keep up the awesome work!

Expand full comment

AI safety takes

Discussion about this post