AI safety takes

Home
Archive
About

Sitemap - 2023 - AI safety takes

November/December 2023 safety news: Weak-to-strong generalization, Superhuman concepts, Google-proof benchmark

September/October 2023 safety news: Sparse autoencoders, A is B is not B is A, Image hijacks

August 2023 safety news: Universal attacks, Influence functions, Problems with RLHF

Evaluating superhuman models with consistency checks

June/July 2023 safety news: Jailbreaks, Transformer Programs, Superalignment

May 2023 safety news: Emergence, Activation engineering, GPT-4 explains GPT-2 neurons

April 2023 safety news: Supervising AIs improving AIs, LLM memorization, OpinionQA

March 2023 safety news: Natural selection of AIs, Waluigis, Anthropic agenda

Language models rely on meaningful abstractions

February 2023 safety news: Unspeakable tokens, Bing/Sydney, Pretraining with human feedback

January 2023 safety news: Watermarks, Memorization in Stable Diffusion, Inverse Scaling

December 2022 safety news: Constitutional AI, Truth Vector, Agent Simulators

© 2025 Daniel Paleka
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share