Google's AI Overviews Still Lying Millions of Times an Hour, New Test Shows

If you’ve used Google search recently, you’ve probably seen AI Overviews—that Gemini-powered block of text sitting above the regular results. It’s been a mess since launch in 2024, serving up everything from glue-on-pizza advice to confidently wrong historical facts. To their credit, Google’s been iterating, and the thing does get the right answer more often than not. But “more often than not” is a pretty low bar for a service that handles billions of queries a day.

The New York Times ran a fresh accuracy check with help from a startup called Oumi, which is knee-deep in building AI models themselves. They used the SimpleQA benchmark—a list of over 4,000 questions with verifiable answers that OpenAI released back in 2024—to poke at AI Overviews. SimpleQA is a standard way to test how factual a generative model is, and it’s not particularly kind to any of them.

Oumi started testing last year when Gemini 2.5 was still Google’s flagship. Back then, AI Overviews scored 85% on SimpleQA. After the Gemini 3 update, they ran it again and got 91%. That’s a real improvement, and I’ll give Google credit for that. But flip that number around: 9% of answers are still wrong. At Google’s scale—something like 8.5 billion searches per day—that miss rate translates to hundreds of thousands of incorrect answers per minute. The Times extrapolates that to tens of millions of lies per day. Per day.

Let that sink in. Every hour, AI Overviews is pumping out millions of wrong statements. Some are harmless—maybe it misstates a movie release date or flubs a sports stat. But plenty aren’t. Imagine asking about medication dosages, legal procedures, or safety instructions. A 9% error rate on those kinds of queries is terrifying.

The thing is, this approach has been tried before. Microsoft’s Bing Chat (now Copilot) had similar hallucination problems at launch. The underlying issue isn’t unique to Google—large language models don’t “know” facts, they predict plausible strings of text. SimpleQA is designed to catch exactly this failure mode, and while Gemini 3 is better than 2.5, it’s still nowhere near reliable enough for a tool that sits at the top of the world’s most-used search engine.

I’m not saying AI Overviews is useless. For quick definitions, simple explanations, or pulling together information from multiple sources, it can be genuinely helpful. But Google is presenting these answers with the same authority as its traditional search results, and that’s the problem. A blue link to a bad source is one thing—the user has to click and evaluate. An AI-generated paragraph that sounds confident and appears above everything else? That’s a different level of trust.

Google’s own data probably shows similar numbers internally, and they’re clearly aware of the issue. The fact that they shipped Gemini 3 with a 91% accuracy rate tells me they’re prioritizing speed over safety. They could hold back until it hits 99% or better, but the competitive pressure from OpenAI, Microsoft, and others won’t let them.

So here we are. Millions of lies per hour, and the best we can say is that it’s getting better. That’s not good enough, and I hope regulators and users start demanding more before someone gets seriously misled.

Google’s AI Overviews Still Lying Millions of Times an Hour, New Test Shows

Comments (0)