This is really nice work, but isn't it making an assumption that humans are always consistent? I think giving expert chess players the rotated examples might well yield similar results, for example, unless the transformed examples were somehow obvious, like being presented consecutively.
Thank you for the question! Humans are definitely not consistent :)
In fact, I expect models to do better than humans on checks like Bayes' rule quite soon.
A good meta-point one should take from paper is that the assumption of "evaluations should only be done up to human performance" is wrong. We want to evaluate superhuman models, both for safety and for capability evaluation; this requires testing things that are out of reach of humans to do.
Researchers who make benchmarks tend to restrict themselves to making evaluations which (most) humans can pass; this is a nice little gap in the research landscape we are trying to fix.
This fits in well with shifting the framing of "alignment" research from being about getting AI to behave like some ill-defined subset of humans, to something that makes more sense for human flourishing. I don't want future AI systems to have the murderous impulses of primates, or our inconsistencies. I look forward to reading your future work!
This is really nice work, but isn't it making an assumption that humans are always consistent? I think giving expert chess players the rotated examples might well yield similar results, for example, unless the transformed examples were somehow obvious, like being presented consecutively.
Thank you for the question! Humans are definitely not consistent :)
In fact, I expect models to do better than humans on checks like Bayes' rule quite soon.
A good meta-point one should take from paper is that the assumption of "evaluations should only be done up to human performance" is wrong. We want to evaluate superhuman models, both for safety and for capability evaluation; this requires testing things that are out of reach of humans to do.
Researchers who make benchmarks tend to restrict themselves to making evaluations which (most) humans can pass; this is a nice little gap in the research landscape we are trying to fix.
This fits in well with shifting the framing of "alignment" research from being about getting AI to behave like some ill-defined subset of humans, to something that makes more sense for human flourishing. I don't want future AI systems to have the murderous impulses of primates, or our inconsistencies. I look forward to reading your future work!