Nov 18, 2025

AlphaProof constructed a large curriculum of tasks by randomly misformalizing math olympiad problems; creating many problems of varying difficulty, from trivial to full math olympiad difficulty.

3 Comments

Mihovil

Mar 30

On the first glance, perfect reward for correct reasoning should be reasoner or verifier by itself ( if it is not outcome-based reward we have the same problem on finer scale). Rewarded sampled trajectories will be reinforced, not necessarily causally correct reasoning. Therefore, even if sampled batch from target evaluation distribution is answered correctly, resulting update can move policy in parameter space in such manner that induced policy performed worse wrt to metric we care about. Clipping, KL penalties are there to limit such movements. I think 0%→1% and 99%→100% is not symmetric in non-pathological cases. It should be easier to get to 1% because of exploitable structure.

Reply

Share