AI safety takes
Subscribe
Sign in
Home
Archive
About
Latest
Top
Discussions
January-February 2025 safety news: Emergent misalignment, SAE sanity checks, Utility engineering
Some papers I’ve learned something from recently, or where I have takes.
Mar 9
•
Daniel Paleka
3
Share this post
AI safety takes
January-February 2025 safety news: Emergent misalignment, SAE sanity checks, Utility engineering
Copy link
Facebook
Email
Notes
More
January 2025
You should delay engineering-heavy research in light of R&D automation
tl;dr: LLMs rapidly improving at software engineering and math means lots of projects are better off as Google Docs until your AI agent intern can…
Jan 6
•
Daniel Paleka
10
Share this post
AI safety takes
You should delay engineering-heavy research in light of R&D automation
Copy link
Facebook
Email
Notes
More
1
October 2024
September/October 2024 safety news: Jailbreaks on robots, Breaking unlearning, Forecasting evals
Better version of the Twitter newsletter.
Oct 31, 2024
•
Daniel Paleka
6
Share this post
AI safety takes
September/October 2024 safety news: Jailbreaks on robots, Breaking unlearning, Forecasting evals
Copy link
Facebook
Email
Notes
More
August 2024
July/August 2024 safety news: Tamper resistance, Fluent jailbreaks, Scaling limits
Better version of the Twitter newsletter.
Aug 31, 2024
•
Daniel Paleka
5
Share this post
AI safety takes
July/August 2024 safety news: Tamper resistance, Fluent jailbreaks, Scaling limits
Copy link
Facebook
Email
Notes
More
July 2024
May/June 2024 safety news: Out-of-context reasoning, Sparse autoencoders, Interpreting CLIP
Better version of the Twitter newsletter.
Jul 1, 2024
•
Daniel Paleka
3
Share this post
AI safety takes
May/June 2024 safety news: Out-of-context reasoning, Sparse autoencoders, Interpreting CLIP
Copy link
Facebook
Email
Notes
More
April 2024
March/April 2024 safety news: Latent training, Emergent abilities, Instruction hierarchy
Better version of the Twitter newsletter.
Apr 30, 2024
•
Daniel Paleka
5
Share this post
AI safety takes
March/April 2024 safety news: Latent training, Emergent abilities, Instruction hierarchy
Copy link
Facebook
Email
Notes
More
4
February 2024
January/February 2024 safety news: Sleeper agents, In-context reward hacking, Universal neurons
Better version of the Twitter newsletter.
Feb 29, 2024
•
Daniel Paleka
4
Share this post
AI safety takes
January/February 2024 safety news: Sleeper agents, In-context reward hacking, Universal neurons
Copy link
Facebook
Email
Notes
More
1
December 2023
November/December 2023 safety news: Weak-to-strong generalization, Superhuman concepts, Google-proof benchmark
Better version of the Twitter newsletter.
Dec 27, 2023
•
Daniel Paleka
5
Share this post
AI safety takes
November/December 2023 safety news: Weak-to-strong generalization, Superhuman concepts, Google-proof benchmark
Copy link
Facebook
Email
Notes
More
October 2023
September/October 2023 safety news: Sparse autoencoders, A is B is not B is A, Image hijacks
Better version of the Twitter thread.
Oct 17, 2023
•
Daniel Paleka
3
Share this post
AI safety takes
September/October 2023 safety news: Sparse autoencoders, A is B is not B is A, Image hijacks
Copy link
Facebook
Email
Notes
More
August 2023
August 2023 safety news: Universal attacks, Influence functions, Problems with RLHF
Better version of the monthly Twitter thread.
Aug 27, 2023
•
Daniel Paleka
3
Share this post
AI safety takes
August 2023 safety news: Universal attacks, Influence functions, Problems with RLHF
Copy link
Facebook
Email
Notes
More
Evaluating superhuman models with consistency checks
Trying to extend the evaluation frontier
Aug 1, 2023
•
Daniel Paleka
2
Share this post
AI safety takes
Evaluating superhuman models with consistency checks
Copy link
Facebook
Email
Notes
More
4
July 2023
June/July 2023 safety news: Jailbreaks, Transformer Programs, Superalignment
Better version of the monthly Twitter thread. Will be back to the regular release frequency next month. Thanks to Charbel-Raphaël Segerie for the…
Jul 15, 2023
•
Daniel Paleka
3
Share this post
AI safety takes
June/July 2023 safety news: Jailbreaks, Transformer Programs, Superalignment
Copy link
Facebook
Email
Notes
More
Share
Copy link
Facebook
Email
Notes
More
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts