Sitemap - 2023 - AI safety takes
September/October 2023 safety news: Sparse autoencoders, A is B is not B is A, Image hijacks
August 2023 safety news: Universal attacks, Influence functions, Problems with RLHF
Evaluating superhuman models with consistency checks
June/July 2023 safety news: Jailbreaks, Transformer Programs, Superalignment
May 2023 safety news: Emergence, Activation engineering, GPT-4 explains GPT-2 neurons
April 2023 safety news: Supervising AIs improving AIs, LLM memorization, OpinionQA
March 2023 safety news: Natural selection of AIs, Waluigis, Anthropic agenda
Language models rely on meaningful abstractions
February 2023 safety news: Unspeakable tokens, Bing/Sydney, Pretraining with human feedback
January 2023 safety news: Watermarks, Memorization in Stable Diffusion, Inverse Scaling
December 2022 safety news: Constitutional AI, Truth Vector, Agent Simulators