Contra Dwarkesh on RL sample-efficiency
Dwarkesh Patel wrote RL is even more information inefficient than you thought yesterday. I’ve been trying to understand RL recently so I read the post with a lot of interest; but I think the main point is... wrong?
The correct label gives more information than RL
Dwarkesh’s post tries to compare “the amount of new information you can extract” in reinforcement learning vs supervised learning. For simplicity, he assumes the model is predicting a single token. The two settings are:
Supervised learning: we update on the correct token
RL: the model predicts a token, and gets a binary outcome (correct or not).
Dwarkesh’s claim is that the information gained in these two settings depends strongly on the pass rate p, which is the probability of getting the correct token.
Concretely, Dwarkesh computes the information gain as -log(p) for supervised learning and H(p) = -p log(p) - (1-p) log(1-p) for RL.
From the plot we can see that the claim is as follows:
When p < 0.5, we can learn more information from supervised learning than from RL.
When p > 0.5, we can learn more information from RL than from supervised learning.
I think this is false. The correct statement is:
We can always learn more information from supervised learning than from RL.
This is because we always gain at least as much information from knowing the correct token as from knowing if our guess is correct.
We can formalize this via the data processing inequality, because the binary outcome is a function of the correct token given our prediction; but I don’t think it’s useful to do so here.
Instead, let’s produce the real plot comparing the information content of one sample (depending on whether we use supervised learning or RL) on a simple example.
For the RL case, Dwarkesh’s post computes the entropy of the binary outcome assuming it is a Bernoulli trial with parameter p; this is H(p) = -p log(p) - (1-p) log(1-p). In the supervised learning setting, we know the probability of the correct token is p; but we have to assume the probabilities of the other tokens sum to 1-p.
One principled way to do so is to assume we are training on multiple-choice questions (with a,b,c,d options); so we will just set a uniform probability mass of (1-p)/K on K other tokens. The expected information content is then: -p log(p) - K (1-p)/K log((1-p)/K) = -p log(p) - (1-p) log((1-p)) + (1-p) log(K) and we can compute the plot as follows (for e.g. K = 3, as in the multiple-choice example):
I think the rest of the section (with the log plot) does not really make sense after this; but let’s do it anyway, with a much larger K corresponding to the full vocabulary:
What is true is that RL is not useful at low pass rates, while supervised learning is. But this does not need information theory to explain; it’s just that RL with 0/1 rewards provides basically no signal if you almost never get it correct.
What is the error in the original post?
The main issue is in computing information gain in supervised learning: assuming the expected information gain is -log(p) ignores the information gained from supervised learning when the model is wrong in the large p regime.
If the model is predicting “a” with 99% confidence when the correct token is “b”, it will make a huge update when it learns the correct token is “b”; much more than if it just learned that “a” is not the correct token.



