LLMs & Scientific Accuracy: Expert Evaluation of AI in Superconductivity Research

by Anika Shah - Technology
0 comments

AI and Scientific Literature: Assessing LLM Expertise

As the volume of scientific research continues to grow exponentially, researchers are increasingly turning to large language models (LLMs) to facilitate navigate the vast landscape of published studies. But how trustworthy are these AI systems when it comes to providing accurate and nuanced answers to complex questions within specialized fields? A recent study by Cornell physicists and Google researchers investigated this question, focusing on the field of high-temperature superconductivity.

Evaluating LLM Understanding of Complex Scientific Fields

The study, published in the Proceedings of the National Academy of Sciences on March 10, 2026, assessed the ability of six LLM systems – ChatGPT, Claude, Gemini Advanced Pro 1.5, Perplexity, NotebookLM, and a custom retrieval-augmented generation (RAG) system – to understand scientific literature at the level of a specialist. Researchers created a database of 1,726 scientific papers covering the history of high-temperature cuprates, a class of superconducting materials, and developed 67 questions designed to probe deep understanding of the literature.

Human Experts Grade LLM Responses

A panel of 12 human experts manually graded the responses provided by each system, without knowing which system generated each answer. The results indicated that systems utilizing curated information sources – specifically, Google’s NotebookLM and the custom RAG system – performed the best. “LLMs operating on trusted data sources – papers we collected ourselves, not from the LLM searching the Internet – tend to perform better,” said Haoyu Guo, lead author of the study and a postdoctoral fellow at Cornell’s Laboratory of Atomic and Solid State Physics (LAASP). NotebookLM showed particular strength when used to analyze a specific set of provided papers.

Strengths and Weaknesses of Current LLMs

While all LLMs demonstrated surprising proficiency in extracting text-based information, they were found to be “totally incapable” of effectively engaging with data visualization. This limitation is significant, as critical analysis of data visualization is a fundamental skill for scientists. The custom RAG model, with its ability to retrieve images alongside text, showed improved performance in this area.

Future Improvements for AI in Scientific Research

The researchers identified several areas for improvement in future LLM development. These include more accurate attribution of claims (reducing instances of fabricated references), enhanced ability to synthesize complex information, and improved comprehension of plots and figures. Guo noted that while models have improved in many aspects over the past year, visual reasoning remains a significant challenge.

The Role of AI in Supporting Scientific Discovery

Despite current limitations, the study suggests that trusted LLM systems could be valuable tools for young researchers entering new fields. Eun-Ah Kim, the Hans A. Bethe Professor of physics at Cornell and corresponding author of the study, emphasized that the ability to think creatively and approach problems from novel angles is becoming more significant than simply memorizing facts. “Knowing the facts used to be brandished as a ticket to the table. Holding a fact in your head should not be the ticket. The ticket should be: Do you realize how to think in a creative way? Can you approach problems from a creative angle?”

This study represents the first output from the Cornell-led National Science Foundation AI-Materials Institute, directed by Kim.

Related Posts

Leave a Comment