Daniel Paleka's Newsletter
Subscribe
Sign in
Home
Archive
About
Latest
Top
Discussions
Obvious ways RL can fail
Reinforcement learning works for some things and not for others. Why?
Nov 15
•
Daniel Paleka
2
The two types of LLM preferences
The standard approach to measure values or preferences of LLMs is to:
Nov 10
•
Daniel Paleka
13
October 2025 AI safety news: Adaptive attacks, Tokenization, Impossible tasks
These days, I imagine it is rough for researchers working on LLM defenses.
Nov 7
•
Daniel Paleka
3
You are going to get priced out of the best AI coding tools
The best AI tools will become far more expensive. Andy Warhol famously said:
Nov 5
•
Daniel Paleka
14
7
A/B testing could lead LLMs to retain users instead of helping them
OpenAI’s updates of GPT-4o in April 2025 famously induced absurd levels of sycophancy: the model would agree with everything users would say, no matter…
Nov 2
•
Daniel Paleka
11
8
July 2025
Memetic optimization #1: brainrot
I. I don’t use social media except for X, and even there I peruse only the Following tab.
Jul 10
•
Daniel Paleka
9
1
May 2025
March-April 2025 safety news: Antidistillation, Cultural alignment, Dark patterns
Happy NeurIPS deadline to all those who celebrate!
May 16
•
Daniel Paleka
7
1
March 2025
GPT-4o draws itself as a consistent type of guy
When asked to draw itself as a person, the ChatGPT Create Image feature introduced on March 25, 2025, consistently portrays itself as a white male in…
Mar 31
•
Daniel Paleka
22
12
2
January-February 2025 safety news: Emergent misalignment, SAE sanity checks, Utility engineering
Some papers I’ve learned something from recently, or where I have takes.
Mar 9
•
Daniel Paleka
7
1
January 2025
You should delay engineering-heavy research in light of R&D automation
tl;dr: LLMs rapidly improving at software engineering and math means lots of projects are better off as Google Docs until your AI agent intern can…
Jan 6
•
Daniel Paleka
12
1
October 2024
September/October 2024 safety news: Jailbreaks on robots, Breaking unlearning, Forecasting evals
Better version of the Twitter newsletter.
Oct 31, 2024
•
Daniel Paleka
7
August 2024
July/August 2024 safety news: Tamper resistance, Fluent jailbreaks, Scaling limits
Better version of the Twitter newsletter.
Aug 31, 2024
•
Daniel Paleka
5
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts