Google’s new research shows most AI benchmarks are cutting corners on human raters

Google Research just dropped a paper that takes a hard look at something most AI folks don’t think about enough: how many humans should rate each item in your benchmark dataset.

Turns out, the standard answer of “1 to 5 raters per item” is often just guessing.

The forest vs. tree problem

The researchers frame this as a trade-off between breadth and depth. Do you ask 1,000 people to each rate one thing (the forest), or do you ask 20 people to rate the same 50 things (the tree)?

Historically, AI evaluation has leaned hard into the forest approach. Most researchers grab a handful of raters per item and call it a day, assuming that’s enough to find some single “correct” truth. But this ignores the fact that humans disagree on subjective tasks like toxicity detection, hate speech classification, and pretty much anything that involves judgment.

Why this matters for reproducibility

Here’s the thing: if two research teams run the same evaluation on the same model but get different results because their annotation setups differ, your benchmark isn’t reproducible. And reproducibility is kind of the whole point of science.

The Google team built a simulator based on real-world datasets involving subjective tasks. They stress-tested thousands of combinations of:

Total items rated (N), ranging from 100 to 50,000
Raters per item (K), from 1 to 500

They wanted to find which configurations produced statistically reliable results (p < 0.05).

What they found

I won’t bury the lead: more raters per item is almost always better than more items with fewer raters, especially when the task is subjective. The common practice of 1-5 raters per item is often insufficient to capture the natural variation in human judgment.

This is higher than I expected. I’ve worked on annotation projects before, and the budget pressure is real. Paying 20 people to rate the same 50 items costs the same as paying 1 person to rate 1,000 items, but the quality difference is massive.

The paper provides a framework for optimizing this trade-off given your specific budget and task. They’ve also open-sourced the simulator so you can run your own experiments.

The practical takeaway

If you’re building an AI benchmark or evaluation dataset, don’t default to 3 raters per item because that’s what everyone does. Think about how subjective your task is, run the numbers, and allocate your budget accordingly. A smaller, well-annotated dataset with diverse raters will give you more reliable results than a huge dataset with shallow annotations.

Google has been doing solid work on evaluation methodology lately, and this paper is another piece of evidence that we need to take human disagreement seriously rather than papering over it with plurality votes.

Google’s new research shows most AI benchmarks are cutting corners on human raters

The forest vs. tree problem

Why this matters for reproducibility

What they found

The practical takeaway

Comments (0)