Google tested 6 LLMs on superconductivity physics. The results are telling.

Google Research just dropped a paper in PNAS that I find genuinely interesting, and not just because I nerded out on condensed matter physics in college. They tested six different LLMs on high-temperature superconductivity questions, then had actual physicists grade the answers. The results tell us a lot about where these models work, where they don’t, and what kind of tool might actually help scientists.

The setup: hard questions, real experts

The team focused on cuprates, those copper-based compounds that superconduct at temperatures up to about -140°C. That’s still cold, but way warmer than traditional superconductors. The thing is, nobody fully agrees on why cuprates work the way they do. There are competing theories, decades of scattered literature, and a whole lot of experimental data that doesn’t neatly fit one model. It’s the kind of messy, open scientific question where a neutral, knowledgeable assistant could actually help a grad student or researcher get up to speed.

So they asked six LLMs to answer high-level questions about this stuff. A panel of domain experts scored the responses on accuracy, completeness, and how well they handled uncertainty and competing viewpoints.

The winners: closed ecosystems with curated sources

Top performers were NotebookLM and a custom-built system. What do they have in common? Both draw from a closed ecosystem of certified, quality-controlled sources. NotebookLM pulls from your uploaded documents, and the custom system was fed a curated set of physics papers. That’s a huge advantage over open-web models that have to wade through Reddit threads and blog posts alongside peer-reviewed literature.

This is higher than I expected. I’ve seen LLMs hallucinate confidently on niche topics, but when you restrict their input to vetted material, the reliability jumps significantly. The trade-off is scope: you can’t ask about something outside the curated set. But for a focused research domain, that’s actually fine.

Where they fell short

The paper also identifies key weaknesses. None of the models handled competing theories perfectly. They’d either pick one side too strongly or give a wishy-washy “both sides have merit” without explaining the actual evidence. And when asked about open questions, they sometimes invented plausible-sounding uncertainties that don’t actually exist in the literature. That’s dangerous in a research context.

Another issue: the models struggled with questions that require synthesizing information across multiple subfields. A question about how a specific experimental technique relates to a theoretical model? Some models just couldn’t connect the dots.

What this means for AI in science

This isn’t the first time Google has poked at this problem. Their earlier CURIE benchmark tested LLMs on basic analytic tasks across six scientific disciplines. Other groups have explored using LLMs for hypothesis generation, writing scientific software, or analyzing single-cell data. But this study is more targeted: can an LLM act as a thought partner for an active researcher dealing with unsettled science?

The answer is a qualified yes, but only with the right scaffolding. A general-purpose chatbot isn’t going to cut it for cutting-edge physics. But a tool like NotebookLM, fed a carefully selected corpus of papers, could be genuinely useful for getting a grad student up to speed or helping a researcher explore literature outside their immediate subfield.

I’d love to see follow-up work testing models on other open scientific questions, like protein folding controversies or climate model disagreements. And I’d also like to see how these curated systems handle questions that don’t have a clear answer yet, because that’s where the real value is. If an LLM can honestly say “here are three competing theories and here’s the evidence for each,” that’s a win. If it fabricates a fourth theory because it sounds plausible, we’ve got a problem.

For now, this study is a solid step toward understanding when we can trust LLMs as research partners. The answer is: sometimes, with the right setup, and never without checking the references.

Google tested 6 LLMs on superconductivity physics. The results are telling.

The setup: hard questions, real experts

The winners: closed ecosystems with curated sources

Where they fell short

What this means for AI in science

Comments (0)