2 Comments
User's avatar
Victualis's avatar

You claim to "argue that A/B testing will implicitly optimize models for user retention". I don't see where you make this argument. I agree that A/B testing will implicitly optimize for something, but how do we know that this something is what cashes out as user retention? Even explicitly given optimization functions often optimize for something different from the stated intention. Are you simply using "user retention" as a shorthand for "the implicit optimization target represented by the explicitly available metrics that the labs combine together in various different ways"?

Expand full comment
Maksym Andriushchenko's avatar

i guess a benchmark like this would need evaluations on _real_ humans across many multi-turn conversations (emulating humans with LLMs doesn't seem very useful here)... this seems a bit difficult to arrange outside of an LLM company that has access to real users.

Expand full comment