August 2023 safety news: Universal attacks, Influence functions, Problems with RLHF
newsletter.danielpaleka.com
Better version of the monthly Twitter thread. This post marks a one-year anniversary since I decided to make use of my paper-reading habit, and started posting summaries on Twitter. The research area has grown a lot in the past year :) Studying Large Language Model Generalization with Influence Functions
August 2023 safety news: Universal attacks, Influence functions, Problems with RLHF
August 2023 safety news: Universal attacks…
August 2023 safety news: Universal attacks, Influence functions, Problems with RLHF
Better version of the monthly Twitter thread. This post marks a one-year anniversary since I decided to make use of my paper-reading habit, and started posting summaries on Twitter. The research area has grown a lot in the past year :) Studying Large Language Model Generalization with Influence Functions