Evaluating superhuman models with consistency…

Aug 1, 2023

Trying to extend the evaluation frontier

4 Comments

Aug 5, 2023

This is really nice work, but isn't it making an assumption that humans are always consistent? I think giving expert chess players the rotated examples might well yield similar results, for example, unless the transformed examples were somehow obvious, like being presented consecutively.

Expand full comment

Reply (1)

Daniel Paleka

Aug 5, 2023Edited

Thank you for the question! Humans are definitely not consistent :)

In fact, I expect models to do better than humans on checks like Bayes' rule quite soon.

A good meta-point one should take from paper is that the assumption of "evaluations should only be done up to human performance" is wrong. We want to evaluate superhuman models, both for safety and for capability evaluation; this requires testing things that are out of reach of humans to do.

Expand full comment

Reply (1)

Daniel Paleka

Aug 5, 2023Edited

Researchers who make benchmarks tend to restrict themselves to making evaluations which (most) humans can pass; this is a nice little gap in the research landscape we are trying to fix.

Expand full comment

Reply (1)

Victualis

Aug 5, 2023

This fits in well with shifting the framing of "alignment" research from being about getting AI to behave like some ill-defined subset of humans, to something that makes more sense for human flourishing. I don't want future AI systems to have the murderous impulses of primates, or our inconsistencies. I look forward to reading your future work!

Expand full comment

AI safety takes

Evaluating superhuman models with consistency…