AI safety takes
Subscribe
Sign in
Home
Archive
About
Latest
Top
Discussions
July/August 2024 safety news: Tamper resistance, Fluent jailbreaks, Scaling limits
Better version of the Twitter newsletter.
Aug 31
•
Daniel Paleka
5
Share this post
July/August 2024 safety news: Tamper resistance, Fluent jailbreaks, Scaling limits
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
July 2024
May/June 2024 safety news: Out-of-context reasoning, Sparse autoencoders, Interpreting CLIP
Better version of the Twitter newsletter.
Jul 1
•
Daniel Paleka
3
Share this post
May/June 2024 safety news: Out-of-context reasoning, Sparse autoencoders, Interpreting CLIP
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
April 2024
March/April 2024 safety news: Latent training, Emergent abilities, Instruction hierarchy
Better version of the Twitter newsletter.
Apr 30
•
Daniel Paleka
5
Share this post
March/April 2024 safety news: Latent training, Emergent abilities, Instruction hierarchy
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
4
February 2024
January/February 2024 safety news: Sleeper agents, In-context reward hacking, Universal neurons
Better version of the Twitter newsletter.
Feb 29
•
Daniel Paleka
4
Share this post
January/February 2024 safety news: Sleeper agents, In-context reward hacking, Universal neurons
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
1
December 2023
November/December 2023 safety news: Weak-to-strong generalization, Superhuman concepts, Google-proof benchmark
Better version of the Twitter newsletter.
Dec 27, 2023
•
Daniel Paleka
5
Share this post
November/December 2023 safety news: Weak-to-strong generalization, Superhuman concepts, Google-proof benchmark
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
October 2023
September/October 2023 safety news: Sparse autoencoders, A is B is not B is A, Image hijacks
Better version of the Twitter thread.
Oct 17, 2023
•
Daniel Paleka
3
Share this post
September/October 2023 safety news: Sparse autoencoders, A is B is not B is A, Image hijacks
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
August 2023
August 2023 safety news: Universal attacks, Influence functions, Problems with RLHF
Better version of the monthly Twitter thread.
Aug 27, 2023
•
Daniel Paleka
3
Share this post
August 2023 safety news: Universal attacks, Influence functions, Problems with RLHF
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
Evaluating superhuman models with consistency checks
Trying to extend the evaluation frontier
Aug 1, 2023
•
Daniel Paleka
2
Share this post
Evaluating superhuman models with consistency checks
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
4
July 2023
June/July 2023 safety news: Jailbreaks, Transformer Programs, Superalignment
Better version of the monthly Twitter thread. Will be back to the regular release frequency next month. Thanks to Charbel-Raphaël Segerie for the…
Jul 15, 2023
•
Daniel Paleka
3
Share this post
June/July 2023 safety news: Jailbreaks, Transformer Programs, Superalignment
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
June 2023
May 2023 safety news: Emergence, Activation engineering, GPT-4 explains GPT-2 neurons
Better version of the monthly Twitter thread.
Jun 2, 2023
•
Daniel Paleka
Share this post
May 2023 safety news: Emergence, Activation engineering, GPT-4 explains GPT-2 neurons
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
3
April 2023
April 2023 safety news: Supervising AIs improving AIs, LLM memorization, OpinionQA
Better version of the monthly Twitter thread.
Apr 30, 2023
•
Daniel Paleka
3
Share this post
April 2023 safety news: Supervising AIs improving AIs, LLM memorization, OpinionQA
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
March 2023
March 2023 safety news: Natural selection of AIs, Waluigis, Anthropic agenda
Better version of my Twitter newsletter.
Mar 31, 2023
•
Daniel Paleka
1
Share this post
March 2023 safety news: Natural selection of AIs, Waluigis, Anthropic agenda
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
Share
Copy link
Facebook
Email
Note
Other
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts