This is good, but the calculation changes a little if you give RL process rewards as well as outcome rewards, right? My understanding is this is the actual current SoTA for LLM RL environments in e.g. coding.
Presumably the efficiency just increases with the density of process rewards but should never equal supervised learning unless you have reward for every token.
Great post!
brilliant!! thanks for this great post!
This is good, but the calculation changes a little if you give RL process rewards as well as outcome rewards, right? My understanding is this is the actual current SoTA for LLM RL environments in e.g. coding.
Presumably the efficiency just increases with the density of process rewards but should never equal supervised learning unless you have reward for every token.