AI safety takes
Subscribe
Sign in
Home
Archive
About
Latest
Top
Discussions
March-April 2025 safety news: Antidistillation, Cultural alignment, Dark patterns
Happy NeurIPS deadline to all those who celebrate!
May 16
•
Daniel Paleka
3
Share this post
AI safety takes
March-April 2025 safety news: Antidistillation, Cultural alignment, Dark patterns
Copy link
Facebook
Email
Notes
More
March 2025
GPT-4o draws itself as a consistent type of guy
When asked to draw itself as a person, the ChatGPT Create Image feature introduced on March 25, 2025, consistently portrays itself as a white male in…
Mar 31
•
Daniel Paleka
17
Share this post
AI safety takes
GPT-4o draws itself as a consistent type of guy
Copy link
Facebook
Email
Notes
More
5
January-February 2025 safety news: Emergent misalignment, SAE sanity checks, Utility engineering
Some papers I’ve learned something from recently, or where I have takes.
Mar 9
•
Daniel Paleka
7
Share this post
AI safety takes
January-February 2025 safety news: Emergent misalignment, SAE sanity checks, Utility engineering
Copy link
Facebook
Email
Notes
More
January 2025
You should delay engineering-heavy research in light of R&D automation
tl;dr: LLMs rapidly improving at software engineering and math means lots of projects are better off as Google Docs until your AI agent intern can…
Jan 6
•
Daniel Paleka
11
Share this post
AI safety takes
You should delay engineering-heavy research in light of R&D automation
Copy link
Facebook
Email
Notes
More
1
October 2024
September/October 2024 safety news: Jailbreaks on robots, Breaking unlearning, Forecasting evals
Better version of the Twitter newsletter.
Oct 31, 2024
•
Daniel Paleka
6
Share this post
AI safety takes
September/October 2024 safety news: Jailbreaks on robots, Breaking unlearning, Forecasting evals
Copy link
Facebook
Email
Notes
More
August 2024
July/August 2024 safety news: Tamper resistance, Fluent jailbreaks, Scaling limits
Better version of the Twitter newsletter.
Aug 31, 2024
•
Daniel Paleka
5
Share this post
AI safety takes
July/August 2024 safety news: Tamper resistance, Fluent jailbreaks, Scaling limits
Copy link
Facebook
Email
Notes
More
July 2024
May/June 2024 safety news: Out-of-context reasoning, Sparse autoencoders, Interpreting CLIP
Better version of the Twitter newsletter.
Jul 1, 2024
•
Daniel Paleka
3
Share this post
AI safety takes
May/June 2024 safety news: Out-of-context reasoning, Sparse autoencoders, Interpreting CLIP
Copy link
Facebook
Email
Notes
More
April 2024
March/April 2024 safety news: Latent training, Emergent abilities, Instruction hierarchy
Better version of the Twitter newsletter.
Apr 30, 2024
•
Daniel Paleka
5
Share this post
AI safety takes
March/April 2024 safety news: Latent training, Emergent abilities, Instruction hierarchy
Copy link
Facebook
Email
Notes
More
4
February 2024
January/February 2024 safety news: Sleeper agents, In-context reward hacking, Universal neurons
Better version of the Twitter newsletter.
Feb 29, 2024
•
Daniel Paleka
4
Share this post
AI safety takes
January/February 2024 safety news: Sleeper agents, In-context reward hacking, Universal neurons
Copy link
Facebook
Email
Notes
More
1
December 2023
November/December 2023 safety news: Weak-to-strong generalization, Superhuman concepts, Google-proof benchmark
Better version of the Twitter newsletter.
Dec 27, 2023
•
Daniel Paleka
5
Share this post
AI safety takes
November/December 2023 safety news: Weak-to-strong generalization, Superhuman concepts, Google-proof benchmark
Copy link
Facebook
Email
Notes
More
October 2023
September/October 2023 safety news: Sparse autoencoders, A is B is not B is A, Image hijacks
Better version of the Twitter thread.
Oct 17, 2023
•
Daniel Paleka
3
Share this post
AI safety takes
September/October 2023 safety news: Sparse autoencoders, A is B is not B is A, Image hijacks
Copy link
Facebook
Email
Notes
More
August 2023
August 2023 safety news: Universal attacks, Influence functions, Problems with RLHF
Better version of the monthly Twitter thread.
Aug 27, 2023
•
Daniel Paleka
3
Share this post
AI safety takes
August 2023 safety news: Universal attacks, Influence functions, Problems with RLHF
Copy link
Facebook
Email
Notes
More
Share
Copy link
Facebook
Email
Notes
More
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts