Better version of the Twitter newsletter. A StrongREJECT for Empty Jailbreaks Jailbreaks in LLMs and adversarial examples in computer vision NNs of yore certainly have similarities: both construct weird inputs to a model that make it behave very differently than on normal inputs. This has led people to measure jailbreak success in the same way as for adversarial attacks for computer vision:
Thank you so much for a great overview of recent work, the short summaries did a tremendous job in contextualizing the main idea and the paper's significance! As someone who recently got more into alignment and scalable oversight, these summaries are a helpful starting point for choosing which papers to dig deeper into. Keep up the awesome work!
Thank you so much for a great overview of recent work, the short summaries did a tremendous job in contextualizing the main idea and the paper's significance! As someone who recently got more into alignment and scalable oversight, these summaries are a helpful starting point for choosing which papers to dig deeper into. Keep up the awesome work!