Better version of the monthly Twitter thread. The last weeks have had unusually many cool papers and posts, not all of which I had time to check out. This is not a fluke: AI risk is now sort of official, and the field will grow to be large with lots of different progress directions.
The first author of https://arxiv.org/abs/2305.04388 just posted a behind-the-scenes explanation of the paper: https://www.lesswrong.com/posts/6eKL9wDqeiELbKPDj/unfaithful-explanations-in-chain-of-thought-prompting. The gist seems to be that current CoT is clearly not a safety gain, but it could be improved in multiple ways, including using a CoT-like protocol of model calls that is "faithful by default".
I found Jason Wei's arguments in the linked blog weak and I cannot understand the contention that a smooth metric is somehow useless (even if the non-smooth version could be perfectly recovered from the smooth version by applying a simple transformation). Maybe there are some emergent properties without any possible metric being able to capture their emergence smoothly, but I'd like to see a more thorough engagement with the Mirage paper. I consider it the most important paper I've read in the last six months.