May 2023 safety news: Emergence, Activation…

Jun 2, 2023

Better version of the monthly Twitter thread.

3 Comments

Jun 5, 2023

I found Jason Wei's arguments in the linked blog weak and I cannot understand the contention that a smooth metric is somehow useless (even if the non-smooth version could be perfectly recovered from the smooth version by applying a simple transformation). Maybe there are some emergent properties without any possible metric being able to capture their emergence smoothly, but I'd like to see a more thorough engagement with the Mirage paper. I consider it the most important paper I've read in the last six months.

Expand full comment

Reply (1)

Daniel Paleka

Jun 6, 2023Edited

I'm not sure. I'd like to see some in-advance predictions from the Mirage research direction, something along the lines of predicting a curve of improvement on "solving AMC/AIME problems" as we increase compute or scale. On a related note, here's an excerpt of a research proposal I wrote a few months ago (before the Mirage paper) that I didn't pursue, might return to it if no one does it:

- Q: The GPT-4 report claims confident prediction of perplexity on some test dataset from the early training stages. There are also many papers on scaling laws, so it seems we’re getting good at predicting loss scaling. Do we have any way of predicting capabilities (the thing we actually care about) at a given level of loss at some dataset, and if not, why not? What’s so hard?

- X = the accuracy/perplexity of predicting the next token on a dataset of human chain of thought answers to some dataset of questions

- Y = the accuracy of the model to get to the right answer on the same distribution of questions, prompted by “Let’s think step by step” or similar

- My statement is precisely the following: “X and Y likely do not have a very simple relationship, a priori those are not measuring the same thing.”

- a result like “you need O(X) perplexity on this corpus of similar tasks+solutions to be able to get CoT performance Y on tasks like these”, or the reverse direction, would be great

- While some capabilities like grammar and simple arithmetic are directly useful for reducing loss (i.e. help with next-token-prediction), others like chain-of-thought reasoning only emerge when running the model to iteratively predict sequences, which is not what the model is trained on. It’s not at all clear what a given next-token loss implies for the capabilities that crucially depend on many right predictions in a row, like in math problems.

Expand full comment

Daniel Paleka

Jun 4, 2023

The first author of https://arxiv.org/abs/2305.04388 just posted a behind-the-scenes explanation of the paper: https://www.lesswrong.com/posts/6eKL9wDqeiELbKPDj/unfaithful-explanations-in-chain-of-thought-prompting. The gist seems to be that current CoT is clearly not a safety gain, but it could be improved in multiple ways, including using a CoT-like protocol of model calls that is "faithful by default".

Expand full comment

AI safety takes

May 2023 safety news: Emergence, Activation…