Contra Dwarkesh on RL sample-efficiency via…

Daniel Paleka

Nov 18, 2025

Supervised learning teaches the model more bits/sample than RL; but it's not the right way to think about it.

Read →

4 Comments

Dwarkesh Patel

Dec 20

Great post!

Expand full comment

brilliant!! thanks for this great post!

Expand full comment

This is good, but the calculation changes a little if you give RL process rewards as well as outcome rewards, right? My understanding is this is the actual current SoTA for LLM RL environments in e.g. coding.

Expand full comment

Presumably the efficiency just increases with the density of process rewards but should never equal supervised learning unless you have reward for every token.

Expand full comment

Reply

Share

#nojs-banner { position: fixed; bottom: 0; left: 0; padding: 16px 16px 16px 32px; width: 100%; box-sizing: border-box; background: red; color: white; font-family: -apple-system, "Segoe UI", Roboto, Helvetica, Arial, sans-serif, "Apple Color Emoji", "Segoe UI Emoji", "Segoe UI Symbol"; font-size: 13px; line-height: 13px; } #nojs-banner a { color: inherit; text-decoration: underline; } This site requires JavaScript to run correctly. Please turn on JavaScript or unblock scripts

Daniel Paleka's Newsletter

Contra Dwarkesh on RL sample-efficiency via…