Frontier AI Models Outperform Clinical Tools in Medical Benchmarks, Study Finds
General-purpose large language models (LLMs) such as Google Gemini and OpenAI GPT-5.2 achieved higher accuracy on medical exams and real-world clinical queries than specialized clinical AI tools, according to a study published in Nature Medicine. The research, conducted by a team at NYU Langone Health, evaluated 100 real-world physician queries and 1,000 medical knowledge tests, revealing significant performance gaps between AI systems.
Key Findings: LLMs Surpass Clinical Tools in Medical Knowledge and Practical Use
In the MedQA medical licensing exam, Gemini scored 97.4% accuracy, outperforming clinical tools like OpenEvidence (89.6%) and UpToDate (88.4%). GPT-5.2 followed with 94.2%, while Claude trailed at 90.2%. On the HealthBench expert alignment test, GPT achieved 88.0 points, compared to 79.3 for Gemini and 77.0 for Claude. Clinical tools scored significantly lower, with OpenEvidence at 62.6 and UpToDate at 61.3.

The study’s real-world clinical query (RCQ) benchmark, based on 100 anonymous physician prompts, showed frontier models forming a “first-tier” group with average ratings of 3.62 (Gemini) and 3.54 (GPT), while clinical tools and Google AI Overview scored 3.24–3.27. Clinicians rated frontier models higher for clarity, completeness, and safety, though no significant differences in harmful content or hallucination rates were found.
Why This Matters: Implications for Healthcare AI Development
The results challenge assumptions about the superiority of domain-specific AI in medicine. “Frontier models may outperform clinical tools due to larger training data, faster iteration cycles, and better alignment with clinical reasoning,” said Dr. Sarah Lin, a co-author of the study. However, clinical tools like UpToDate still maintain institutional trust and may be safer for routine use.
Researchers caution that the study’s findings reflect a “snapshot of a rapidly evolving landscape.” While general-purpose models excel in knowledge retrieval and communication, specialized systems could still thrive in highly niche areas like rare disease diagnosis. The study also highlights the need for independent benchmarks free from industry bias, as current evaluations often favor the systems they were developed for.
Limitations and Future Directions
The study faced challenges in accessing clinical tools’ APIs, limiting direct comparisons. HealthBench, an industry-developed benchmark, may have biased results due to its reliance on a small panel of physicians. Additionally, the evaluation did not assess response latency or citation quality—factors critical for real-world clinical deployment.

Future research should focus on hospital-specific LLMs that leverage institutional data, as proposed in the NOHARM framework. “The goal isn’t to replace clinical tools but to integrate AI that complements human expertise,” said Dr. Michael Torres, a healthcare AI ethicist at Stanford University.
FAQ: What This Means for Patients and Clinicians

- Will general-purpose AI replace clinical tools? Not immediately. Clinical systems like UpToDate have institutional credibility and may remain preferred for routine use.
- Are frontier models safe for medical decisions? The study found no significant differences in harmful content or hallucinations, but clinicians should always verify AI-generated advice.
- How can hospitals benefit from these findings? Institutions may explore hybrid models that combine the strengths of frontier AI with domain-specific expertise.
Key Takeaways
- Google Gemini and GPT-5.2 outperformed clinical AI tools in medical exams and real-world queries.
- Frontier models scored higher for clarity and completeness, though clinical tools maintained safety trust.
- The study underscores the need for independent AI evaluations to avoid industry bias.
- Specialized AI may still excel in highly niche medical tasks, but general-purpose models are currently more versatile.