AI safety takes
Subscribe
Sign in
Home
Archive
About
New
Top
Discussion
September/October 2023 safety news: Sparse autoencoders, A is B is not B is A, Image hijacks
Better version of the Twitter thread. Thanks to Fabian Schimpf and Charbel-Raphaël Segerie for feedback. Towards Monosemanticity: Decomposing Language…
Oct 17
•
Daniel Paleka
3
Share this post
September/October 2023 safety news: Sparse autoencoders, A is B is not B is A, Image hijacks
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
August 2023
August 2023 safety news: Universal attacks, Influence functions, Problems with RLHF
Better version of the monthly Twitter thread. This post marks a one-year anniversary since I decided to make use of my paper-reading habit, and started…
Aug 27
•
Daniel Paleka
3
Share this post
August 2023 safety news: Universal attacks, Influence functions, Problems with RLHF
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
Evaluating superhuman models with consistency checks
Trying to extend the evaluation frontier
Aug 1
•
Daniel Paleka
1
Share this post
Evaluating superhuman models with consistency checks
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
4
July 2023
June/July 2023 safety news: Jailbreaks, Transformer Programs, Superalignment
Better version of the monthly Twitter thread. Will be back to the regular release frequency next month. Thanks to Charbel-Raphaël Segerie for the…
Jul 15
•
Daniel Paleka
2
Share this post
June/July 2023 safety news: Jailbreaks, Transformer Programs, Superalignment
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
June 2023
May 2023 safety news: Emergence, Activation engineering, GPT-4 explains GPT-2 neurons
Better version of the monthly Twitter thread. The last weeks have had unusually many cool papers and posts, not all of which I had time to check out…
Jun 2
•
Daniel Paleka
Share this post
May 2023 safety news: Emergence, Activation engineering, GPT-4 explains GPT-2 neurons
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
3
April 2023
April 2023 safety news: Supervising AIs improving AIs, LLM memorization, OpinionQA
Better version of the monthly Twitter thread. Emergent and Predictable Memorization in Large Language Models Memorization in LLMs is a practical issue…
Apr 30
•
Daniel Paleka
3
Share this post
April 2023 safety news: Supervising AIs improving AIs, LLM memorization, OpinionQA
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
March 2023
March 2023 safety news: Natural selection of AIs, Waluigis, Anthropic agenda
Better version of my Twitter newsletter. I’m not talking about any of the recent letters here. This is not an AI policy newsletter. For the record, I…
Mar 31
•
Daniel Paleka
1
Share this post
March 2023 safety news: Natural selection of AIs, Waluigis, Anthropic agenda
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
Language models rely on meaningful abstractions
Next-token prediction is AI-complete
Mar 3
•
Daniel Paleka
1
Share this post
Language models rely on meaningful abstractions
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
February 2023
February 2023 safety news: Unspeakable tokens, Bing/Sydney, Pretraining with human feedback
Better version of the monthly Twitter thread. More than you've asked for: A Comprehensive Analysis of Novel Prompt Injection Threats to…
Feb 27
•
Daniel Paleka
2
Share this post
February 2023 safety news: Unspeakable tokens, Bing/Sydney, Pretraining with human feedback
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
January 2023
January 2023 safety news: Watermarks, Memorization in Stable Diffusion, Inverse Scaling
Better version of the monthly Twitter thread. Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge…
Jan 31
•
Daniel Paleka
2
Share this post
January 2023 safety news: Watermarks, Memorization in Stable Diffusion, Inverse Scaling
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
December 2022 safety news: Constitutional AI, Truth Vector, Agent Simulators
Better version of the monthly Twitter thread. Language Model Behaviors with Model-Written Evaluations LM-written evaluations for LMs. Automatically…
Jan 4
•
Daniel Paleka
Share this post
December 2022 safety news: Constitutional AI, Truth Vector, Agent Simulators
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
December 2022
November 2022 safety news: Mode collapse in InstructGPT, Adversarial Go
Better version of the monthly Twitter thread. Adversarial Policies Beat Professional-Level Go AIs Adversarial policies in Go. Policies trained on…
Dec 1, 2022
•
Daniel Paleka
Share this post
November 2022 safety news: Mode collapse in InstructGPT, Adversarial Go
newsletter.danielpaleka.com
Copy link
Facebook
Email
Note
Other
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts