8 Comments
User's avatar
Sheikh Abdur Raheem Ali's avatar

Interesting post. Hard to comment here since everything you say seems obviously true to me, except for the footnote making the analogy between REINFORCE and A/B testing, but that is because my RL is rusty.

Expand full comment
Victualis's avatar

You claim to "argue that A/B testing will implicitly optimize models for user retention". I don't see where you make this argument. I agree that A/B testing will implicitly optimize for something, but how do we know that this something is what cashes out as user retention? Even explicitly given optimization functions often optimize for something different from the stated intention. Are you simply using "user retention" as a shorthand for "the implicit optimization target represented by the explicitly available metrics that the labs combine together in various different ways"?

Expand full comment
Daniel Paleka's avatar

I'll try to answer, but I'm not sure if I get it. My claims are:

(1) The labs will use metrics on user chats to decide whether a feature is worth pushing;

(2) These metrics are likely of the form "how many new users stop using the app" or "how many messages the users send" or "how many free tier users upgrade". I call all of these metrics "user retention" for simplicity.

(3) If there are 10 proposed updates and some subset of them go through because they increase user retention, what roughly happened is that we picked one out of 1024 possible models, the one that performs the best according to the user retention

(4) This kind of selection pressure gets models that are better at user retention; but we do not a priori know why they are better than user retention; depending on what the updates were about, it could be that the models are more helpful, but it could be that they are more user-retentive for various reasons outlined in the post

(5) Someone should measure the degree to which the non-helpful user retention behaviors appear in ChatGPT or equivalent products. (I do not propose any particular way to do it.)

Is your question about (2) or (3) or (4)?

Expand full comment
Victualis's avatar

I think (2) is closest to what I was thinking. You seem to be highlighting that the easy-to-measure features are highly correlated with user retention loosely speaking. This seems reasonable.

Expand full comment
Maksym Andriushchenko's avatar

i guess a benchmark like this would need evaluations on _real_ humans across many multi-turn conversations (emulating humans with LLMs doesn't seem very useful here)... this seems a bit difficult to arrange outside of an LLM company that has access to real users.

Expand full comment
Daniel Paleka's avatar

This is difficult in theory, but might be easier in practice, at least for the research goal of "are the models trying to do user retentive behaviors, or we don't have evidence yet?"

One key metric that all labs likely track is "how many new users stop using the app". As long as you can prompt e.g. https://huggingface.co/microsoft/UserLM-8b to act as a consistent user, it should be a good proxy for a new user?

Note: UserLM-8b is trained on WildChat. I've read lots of WildChat and would not be happy with a model that emulates a WildChat user, for many applications, but well, maybe this is who the average user is...

Note 2: At some point this year, I manually filtered a domain-representative 2k sample of WildChat for quality. I could release it if it seems useful for something

Expand full comment
Irene Strauss's avatar

This might be useful to simulate user behavior online (it even can decide when to stop conversations) however it will very likely not simulate anything like "stop using the app permanently". That is an offline component of human life. I don't think that an LM can grasp something like this at the current state. At most, it could draw from some user statistics - however, that misses the point. User retention involves a high psychological and offline component. I think for a study like this you need humans

Expand full comment
Maksym Andriushchenko's avatar

interesting. but i'm not sure if using an 8B LLM will be sufficient to emulate a real human with all their complexity...

Expand full comment