3 Comments
User's avatar
Mihovil's avatar

On the first glance, perfect reward for correct reasoning should be reasoner or verifier by itself ( if it is not outcome-based reward we have the same problem on finer scale). Rewarded sampled trajectories will be reinforced, not necessarily causally correct reasoning. Therefore, even if sampled batch from target evaluation distribution is answered correctly, resulting update can move policy in parameter space in such manner that induced policy performed worse wrt to metric we care about. Clipping, KL penalties are there to limit such movements. I think 0%→1% and 99%→100% is not symmetric in non-pathological cases. It should be easier to get to 1% because of exploitable structure.

Rainbow Roxy's avatar

Wow, the 'task mutations' idea is briliant! Your insights are always so sharp.

Daniel Paleka's avatar

Regarding 11: https://arxiv.org/abs/2210.10760 Scaling Laws for Reward Model Overoptimization, Gao et al, 2022.