As someone who’s been closely observing AI advancements, I found Google’s AI Overviews to have improved significantly. By February, they correctly answered standard factual benchmarks 91% of the time, a notable rise from 85% back in October. This assessment came from a rigorous analysis conducted by The New York Times in collaboration with the AI startup, Oumi.
Yet, considering Google processes more than 5 trillion searches annually, this still implies that millions of answers could be incorrect every hour. In essence, there’s much room for improvement.
Why it matters to me. My interactions with Google have evolved from just link clicks to encountering AI-generated summaries. This evolution suggests that while AI Overviews have gotten better, they still mix accurate responses with poor sourcing and blatant errors, potentially misleading searchers and affecting visibility for many publishers.
The nitty-gritty details. Oumi put 4,326 Google searches to the test using SimpleQA, a benchmark known for measuring factual precision in AI systems. AI Overviews hit a 91% accuracy rate post-upgrade to Gemini 3 from Gemini 2’s 85%.
The more pressing issue for me is the sourcing. Oumi discovered that more than half of February’s correct responses were ‘ungrounded,’ meaning the linked references didn’t fully back the answers.
This lack of grounding makes verification a challenge. Even if the answer is correct, the linked pages might not sufficiently illustrate the reasoning.
What shifted. While the accuracy saw improvements from October to February, grounding declined. In October, 37% of accurate answers were ungrounded; by February, this figure increased to 56%.
Real-world examples. The Times pointed out several inaccuracies: For instance, Google incorrectly dated when Bob Marley’s home became a museum. Google’s answer was 1987, but the actual year was 1986, and the cited sources conflicted. A search about Yo-Yo Ma and the Classical Music Hall of Fame yielded a link to the Hall’s site, yet Google stated he wasn’t inducted. Moreover, while Google got Dick Drago’s age at death right, it flubbed his date of death.
Google’s standpoint: Google contested the Times’ findings, arguing that the benchmark used in the study was flawed and didn’t mirror actual search behavior. Google spokesperson Ned Adriance mentioned that the study had some ‘serious holes.’
Furthermore, Google asserted that its AI Overviews utilize search ranking and safety measures to minimize spam and has consistently cautioned that AI responses might contain errors.
The detailed report. If you’re interested in more depth, you might check the full report, How Accurate Are Google’s A.I. Overviews? (note: subscription required).
Inspired by this post on Search Engine Land.

