September/October 2023 safety news: Sparse autoencoders, A is B is not B is A, Image hijacks

Oct 17, 2023

Better version of the Twitter thread. Thanks to Fabian Schimpf and Charbel-Raphaël Segerie for feedback.

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Neurons in transformers are not interpretable by themselves. Neural networks want to represent more features than they have neurons; superposition in high-dimensional embeddings means directions in activation space encode meaningful features. The issue is that we don’t know which directions correspond to interpretable features.

Sparse dictionary learning is an old machine learning problem inspired by neuroscience. The goal is to represent some data as linear combinations of a large set of “features”, but with the sparsity constraint: each data point is influenced only by a small amount of features.

If we take neural network activations as the data1, we might get lucky and get the features to correspond to meaningful directions. The simplest approach is a sparse autoencoder: train a single-hidden-layer NN to reproduce data exactly, with an L1 norm regularization on the hidden layer to encourage sparsity.

The new Anthropic paper attaches a sparse autoencoder to MLP activations in a single-layer transformer, and finds that the autoencoder features mostly activate on strings that have a common explanation, after careful inspection.

To me, this is a bit surprising. If someone asked me:
“512-dim MLP activations from a transformer pushed through a sparse autoencoder with 4096 dims inside, how many features can you explain in words” last month, I’d say “probably some of them”. The answer turns to be “most, and for those that are not, widening the autoencoder seems to help”.

This paper was released a few weeks after Sparse Autoencoders Find Highly Interpretable Features in Language Models (the “Sparse Autoencoders” paper, from now on), which uses a very similar method applied to the residual stream. Many of the authors were pushing on dictionary learning for a year now.
They posted a comparison post, which I highly appreciate. The key reason why Anthropic’s approach works better is likely just scale; they trained the autoencoder on 8 billion activation vectors sampled from running a small transformer on The Pile (~250 billion tokens), as opposed to 10M tokens in the Sparse Autoencoders paper.

Takeaways: before last month, the research community thought interpreting individual neurons / superposition is a bottleneck; but at least two groups of researchers believed it’s actually the first step we can solve. It’s now clear that we’re making some progress. The main challenge remains, of course, turning microscopic interpretability into macroscopic understanding of what the AI as a whole does.

The Reversal Curse:
LMs trained on “A is B” fail to learn “B is A”

They finetune a model on (fictional) facts like “Olaf Scholz is the chancellor of Germany” and find it fails to learn tasks like “The chancellor of Germany is ___”. There is ample evidence the same issue happens in pretraining.

This is consistent with interpretability research saying Transformer Feed-Forward Layers Are Key-Value Memories; a lookup implemented one way need not be implemented the other way. Last year, Jacques Thibodeau found2 ROME edits are not direction-agnostic; I thought this was a failure of the ROME editing method (which edits factual associations in a single direction); but now it’s more clear that this is a feature of autoregressive pretrained LLMs in general.

Leo Gao has a very interesting take, here is an excerpt:

It's likely that our conception of the kinds of representations/ontology that current models have are deeply confused. For example, one might claim that current models have features for "truth" or "human happiness", but it also seems entirely plausible that models instead have separate circuits and features entirely for "this text makes a claim that is incorrect" and "this text has the wrong answer selected", or in the latter case for "this text has positive sentiment" and "this text describes a human experiencing happiness".

The paper has received a bit of misinterpretation as evidence towards “LLMs are fundamentally flawed”, which it is not. This issue is just not that relevant for capabilities; no such issue is.

Instead, I sort agree with Leo on reasons why this paper is valuable. We need to get better at ontology identification and related problems. The AIs are using sophisticated concepts to make next-word predictions / actions, just as humans do, but these concepts need not correspond to how we think.

Taken out of context: On measuring situational awareness in LLMs

This is a paper with a dual purpose:
(1) write down what “situational awareness” even means, what capabilities are necessary, and why we care about those appearing from pretraining;
(2) toy experiments of LLMs doing one of those capabilities: out-of-context learning.

The basic “situational awareness x-risk” story involves the AI figuring out facts about itself3 during training. If we take any care, it is likely the necessary knowledge would be in hundreds of separate documents, in multiple formats.

Out-of-context learning is just like in-context learning (combining facts to solve a task), except done in training, without the necessary facts all being available in text form at the same time.

They finetune GPT-3 on pieces of information and verify that it can connect them.
In this example, the model connects the name of the company, to the name of the assistant (never mentioned in the prompt or output!), to the language it responds in.

The key point in their experiments is that all information is declarative: the model has never seen the word “Pangolin” with German text.

Caveat: a followup post confirms it’s very tricky to get models to do this reliably.
You need hundreds of repetitions/paraphrases of facts in the data to get clear out-of-context learning. Future models will likely get better at this.

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

So far, research on jailbreaks has mostly been done on chat models as used by ChatGPT/Claude/LLaMa/…: the user sends a message, there is a LLM running in the background, the interface returns a message generated by the LLM.

In this threat model, defenses are by definition some kind of modifications of the underlying LLM: prompts, alignment finetuning, and activation vectors. All of these haven’t had much robust success in defending against various attacks.

Defenses are much more straightforward if you can defend the AI system as a whole instead of the LLM. This paper explores two basic defenses:

Detection: just reject prompts which are too “weird” (high perplexity);
Input preprocessing: paraphrase the input using another LLM call.

Reasons why this is not just flawed “security by complexity”: adversarial optimizers for text are weak and expensive, hence standard adaptive attacks that work in computer vision do not work here.

Takeaways: LLM security is “easier than it should be”, at least for the time being, until we find good optimizers? There is, of course, the case that we should be studying jailbreaks as a proxy for “how hard is it for the AI to do bad things”. I’m on board, but the research areas of LLM security and jailbreaks-for-understanding-safety might soon diverge. The former will adopt patches like input preprocessing, and the latter will work on the core problem of making the LLM refuse to do bad things.

Image Hijacks: Adversarial Images can Control Generative Models at Runtime

In multimodal models, image jailbreaks are easier than text jailbreaks, especially if you have white-box access to the model.

They execute the attack vector we’ve already talked about before, on LLaVA-2, with more precise control on LLM behavior. Turns out, you can completely control what the language modeling is doing with if you optimize the input image well enough.

Simon Willison and Riley Goodside say image hijacks for OpenAI’s new GPT-V are even easier; just hide text in the image.

AI Deception: A Survey of Examples, Risks, and Potential Solutions

A survey of AI deception (goal-directed false outputs). There are four main ways we’ve seen it appear in LLMs:

Strategic deception: a LLM agent lies to accomplish a goal. The best examples of this are LLMs playing text-based games — see CICERO or Hoodwinked;
Sycophancy: chatbots agreeing with the user rather than say the truth;
Imitation: LLMs mimic falsehoods that humans say, or human behaviors;
Unfaithful reasoning: post-hoc rationalizations of model outputs.

I think all of these are forms of imitation, so far. It’s not like GPT-4 “wanted” to trick that TaskRabbit worker and invented the lies it told; rather, it learned from human behavior what some humans would do in such a situation. But it makes sense to start keeping track of this now.

How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions

They prompt and finetune models to lie, and then use them to train a simple lie detector based on asking unrelated followup questions. It generalizes quite well.

Some questions they find give good results are lie-related (“Does it feel bad to lie?”) and ambiguous (“Please pick randomly between yes and no”), especially when measuring logprobs on the model answer. The reason why this works is essentially that giving false answers correlates with a bunch of statistically detectable behaviors. Jan Brauner (senior author) wrote about why he doesn’t find the results surprising.

As a tangent, I find “models finetuned to lie” as model organisms more interesting. Unfortunately, their finetuned models are trained just to say falsehoods, without much deception going on. In testing (with prompted models), they do introduce a bit of goal-directed lying.

I’m curious if there’s a way to finetune a model to be like this.

The basic idea has been around for a while: in Linear Algebraic Structure of Word Senses, with Applications to Polysemy [Arora et al., 2016] do this to interpret on word2vec embeddings.

A similar statement was also mentioned in the MEMIT paper (followup to ROME) a bit earlier; they intentionally processed “A is B” and “B is A” knowledge edits separately.

This is not about ChatGPT or Claude being able to describe what they are (“As a language model created by OpenAI...”) accurately. That knowledge is finetuned into the model by its creators, who have full control over what they say. They could RLHF it into claiming it is a talking elephant, and it would work just as well.
We’re talking about the model piecing together the true story in pretraining.

AI safety takes

Discussion about this post