AI safety takes
Subscribe
Sign in
Home
Archive
About
Latest
Top
Discussions
September/October 2024 safety news: Jailbreaks on robots, Breaking unlearning, Forecasting evals
Better version of the Twitter newsletter.
Oct 31
•
Daniel Paleka
5
Share this post
AI safety takes
September/October 2024 safety news: Jailbreaks on robots, Breaking unlearning, Forecasting evals
Copy link
Facebook
Email
Notes
More
August 2024
July/August 2024 safety news: Tamper resistance, Fluent jailbreaks, Scaling limits
Better version of the Twitter newsletter.
Aug 31
•
Daniel Paleka
5
Share this post
AI safety takes
July/August 2024 safety news: Tamper resistance, Fluent jailbreaks, Scaling limits
Copy link
Facebook
Email
Notes
More
July 2024
May/June 2024 safety news: Out-of-context reasoning, Sparse autoencoders, Interpreting CLIP
Better version of the Twitter newsletter.
Jul 1
•
Daniel Paleka
3
Share this post
AI safety takes
May/June 2024 safety news: Out-of-context reasoning, Sparse autoencoders, Interpreting CLIP
Copy link
Facebook
Email
Notes
More
April 2024
March/April 2024 safety news: Latent training, Emergent abilities, Instruction hierarchy
Better version of the Twitter newsletter.
Apr 30
•
Daniel Paleka
5
Share this post
AI safety takes
March/April 2024 safety news: Latent training, Emergent abilities, Instruction hierarchy
Copy link
Facebook
Email
Notes
More
4
February 2024
January/February 2024 safety news: Sleeper agents, In-context reward hacking, Universal neurons
Better version of the Twitter newsletter.
Feb 29
•
Daniel Paleka
4
Share this post
AI safety takes
January/February 2024 safety news: Sleeper agents, In-context reward hacking, Universal neurons
Copy link
Facebook
Email
Notes
More
1
December 2023
November/December 2023 safety news: Weak-to-strong generalization, Superhuman concepts, Google-proof benchmark
Better version of the Twitter newsletter.
Dec 27, 2023
•
Daniel Paleka
5
Share this post
AI safety takes
November/December 2023 safety news: Weak-to-strong generalization, Superhuman concepts, Google-proof benchmark
Copy link
Facebook
Email
Notes
More
October 2023
September/October 2023 safety news: Sparse autoencoders, A is B is not B is A, Image hijacks
Better version of the Twitter thread.
Oct 17, 2023
•
Daniel Paleka
3
Share this post
AI safety takes
September/October 2023 safety news: Sparse autoencoders, A is B is not B is A, Image hijacks
Copy link
Facebook
Email
Notes
More
August 2023
August 2023 safety news: Universal attacks, Influence functions, Problems with RLHF
Better version of the monthly Twitter thread.
Aug 27, 2023
•
Daniel Paleka
3
Share this post
AI safety takes
August 2023 safety news: Universal attacks, Influence functions, Problems with RLHF
Copy link
Facebook
Email
Notes
More
Evaluating superhuman models with consistency checks
Trying to extend the evaluation frontier
Aug 1, 2023
•
Daniel Paleka
2
Share this post
AI safety takes
Evaluating superhuman models with consistency checks
Copy link
Facebook
Email
Notes
More
4
July 2023
June/July 2023 safety news: Jailbreaks, Transformer Programs, Superalignment
Better version of the monthly Twitter thread. Will be back to the regular release frequency next month. Thanks to Charbel-Raphaël Segerie for the…
Jul 15, 2023
•
Daniel Paleka
3
Share this post
AI safety takes
June/July 2023 safety news: Jailbreaks, Transformer Programs, Superalignment
Copy link
Facebook
Email
Notes
More
June 2023
May 2023 safety news: Emergence, Activation engineering, GPT-4 explains GPT-2 neurons
Better version of the monthly Twitter thread.
Jun 2, 2023
•
Daniel Paleka
Share this post
AI safety takes
May 2023 safety news: Emergence, Activation engineering, GPT-4 explains GPT-2 neurons
Copy link
Facebook
Email
Notes
More
3
April 2023
April 2023 safety news: Supervising AIs improving AIs, LLM memorization, OpinionQA
Better version of the monthly Twitter thread.
Apr 30, 2023
•
Daniel Paleka
3
Share this post
AI safety takes
April 2023 safety news: Supervising AIs improving AIs, LLM memorization, OpinionQA
Copy link
Facebook
Email
Notes
More
Share
Copy link
Facebook
Email
Notes
More
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts