AI safety takes
Subscribe
Sign in
Home
Archive
About
Latest
Top
Discussions
Reflections on Brainrot
memetics are destiny
Jul 10
•
Daniel Paleka
8
May 2025
March-April 2025 safety news: Antidistillation, Cultural alignment, Dark patterns
Happy NeurIPS deadline to all those who celebrate!
May 16
•
Daniel Paleka
7
1
March 2025
GPT-4o draws itself as a consistent type of guy
When asked to draw itself as a person, the ChatGPT Create Image feature introduced on March 25, 2025, consistently portrays itself as a white male in…
Mar 31
•
Daniel Paleka
20
11
January-February 2025 safety news: Emergent misalignment, SAE sanity checks, Utility engineering
Some papers I’ve learned something from recently, or where I have takes.
Mar 9
•
Daniel Paleka
7
January 2025
You should delay engineering-heavy research in light of R&D automation
tl;dr: LLMs rapidly improving at software engineering and math means lots of projects are better off as Google Docs until your AI agent intern can…
Jan 6
•
Daniel Paleka
11
1
October 2024
September/October 2024 safety news: Jailbreaks on robots, Breaking unlearning, Forecasting evals
Better version of the Twitter newsletter.
Oct 31, 2024
•
Daniel Paleka
7
August 2024
July/August 2024 safety news: Tamper resistance, Fluent jailbreaks, Scaling limits
Better version of the Twitter newsletter.
Aug 31, 2024
•
Daniel Paleka
5
July 2024
May/June 2024 safety news: Out-of-context reasoning, Sparse autoencoders, Interpreting CLIP
Better version of the Twitter newsletter.
Jul 1, 2024
•
Daniel Paleka
4
April 2024
March/April 2024 safety news: Latent training, Emergent abilities, Instruction hierarchy
Better version of the Twitter newsletter.
Apr 30, 2024
•
Daniel Paleka
5
4
February 2024
January/February 2024 safety news: Sleeper agents, In-context reward hacking, Universal neurons
Better version of the Twitter newsletter.
Feb 29, 2024
•
Daniel Paleka
4
1
December 2023
November/December 2023 safety news: Weak-to-strong generalization, Superhuman concepts, Google-proof benchmark
Better version of the Twitter newsletter.
Dec 27, 2023
•
Daniel Paleka
5
October 2023
September/October 2023 safety news: Sparse autoencoders, A is B is not B is A, Image hijacks
Better version of the Twitter thread.
Oct 17, 2023
•
Daniel Paleka
3
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts