Sitemap - 2023 - AI safety takes

November/December 2023 safety news: Weak-to-strong generalization, Superhuman concepts, Google-proof benchmark

September/October 2023 safety news: Sparse autoencoders, A is B is not B is A, Image hijacks

August 2023 safety news: Universal attacks, Influence functions, Problems with RLHF

Evaluating superhuman models with consistency checks

June/July 2023 safety news: Jailbreaks, Transformer Programs, Superalignment

May 2023 safety news: Emergence, Activation engineering, GPT-4 explains GPT-2 neurons

April 2023 safety news: Supervising AIs improving AIs, LLM memorization, OpinionQA

March 2023 safety news: Natural selection of AIs, Waluigis, Anthropic agenda

Language models rely on meaningful abstractions

February 2023 safety news: Unspeakable tokens, Bing/Sydney, Pretraining with human feedback

January 2023 safety news: Watermarks, Memorization in Stable Diffusion, Inverse Scaling

December 2022 safety news: Constitutional AI, Truth Vector, Agent Simulators

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts