Better version of the Twitter newsletter. Defending Against Unforeseen Failure Modes with Latent Adversarial Training Models can have failures that are triggered by certain inputs, for example jailbreaks, trojans, or distribution shift. Finding inputs that trigger failures in LLMs is difficult.
Thanks Daniel, I am the less known PhD student and highly appreciating your post here for broadcasting our work! Always get inspired and informed from your posts!
> In the near future, I plan to experiment with some shorter posts on research problems / takes about research. The Challenges paper in particular gave me several researchy takes that are not suitable for an academic paper. Feel free to email me or comment here on whether this is something worth my and the readers’ time.
Isn't this what lesswrong posts are for :) I would probably read these if they were <8 min.
Was reading the LAT paper over the weekend and have wondered about the same when they only used the 4th layer res. stream for pertrubations!
Thanks Daniel, I am the less known PhD student and highly appreciating your post here for broadcasting our work! Always get inspired and informed from your posts!
Happy to hear this, and good luck on your future research work!
> In the near future, I plan to experiment with some shorter posts on research problems / takes about research. The Challenges paper in particular gave me several researchy takes that are not suitable for an academic paper. Feel free to email me or comment here on whether this is something worth my and the readers’ time.
Isn't this what lesswrong posts are for :) I would probably read these if they were <8 min.