To obtain a medical license in the United States, aspiring doctors must go through three stages of the US medical licensing exam, with the third and final part considered the most difficult. Candidates are required to answer 60% of the questions correctly and, historically, the average score has hovered around 75%.
When the main language models were applied to the same Step 3 test, they outperformed, scoring well above many doctors.
But there were clear differences between the models.
Generally taken after the first year of residency, the USMLE Step 3 assesses whether medical graduates can apply their understanding of clinical science in unsupervised clinical practice. It assesses the new doctor’s ability to manage patient care in a wide range of clinical settings and includes multiple-choice questions and computer-based case simulations.
We dedicated 50 questions from the 2023 USMLE Step 3 sample test to assess clinical skills in five different major language modules, feeding the same questions in each of these domains – ChatGPT, Claude, Google Gemini, Grok and Llama.
Other studies have examined these models for their clinical effectiveness, but to our knowledge, this is the first time that these five main areas have been compared in head assessment. These results may give consumers and service providers some clues as to where they should go.
Here are their ratings:
- ChatGPT-4o (OpenAI) — 49/50 questions correct (98%)
- Claude 3.5 (Anthropic) — 45/50 (90%)
- Gemini Advanced (Google) — 43/50 (86%)
- Grok (xAI) — 42/50 (84%)
- HuggingChat (Llama) — 33/50 (66%)
In our tests, OpenAI’s ChatGPT-4o emerged as the top performer, scoring 98%. It provided a detailed medical analysis, using language reminiscent of a medical professional. Not only did it provide answers with a lot of reasoning, but it also focused his decision-making process, explaining why other answers were wrong.
Second place was Claude, from Anthropic, with a score of 90%. It provided humane responses with simple language and a bulleted structure that could be easily accessible to patients. Gemini, who scored 86%, gave an answer that was not as comprehensive as ChatGPT or Claude, making his reasoning difficult to explain, but the answer was short and straightforward.
Grok, the chatbot from Elon Musk’s xAI, scored a respectable 84% but didn’t provide any descriptive feedback during our analysis, so it’s hard to understand how it arrived at its results. While HuggingChat — an open-source site created by Meta’s Llama — scored the lowest at 66%, it nevertheless showed good reasoning for the questions it answered well, providing short answers and links to the sources.
A question that most models get wrong is a 75-year-old woman with a hypothetical heart attack. The physician was asked the question of what would be the most appropriate next step in his evaluation. Claude was the only model that produced a correct answer.
Another important question, focused on a 20-year-old male patient presenting with symptoms of a sexually transmitted disease. He asked the doctors which of the five options would be the next step as part of his care. ChatGPT correctly determined that patients should be scheduled for an HIV serology test within three months, but the model went further, recommending a follow-up examination within a week to obtain ensure that the patient’s symptoms have resolved and that the antibiotic has covered the infection. For us, the responses highlighted the model’s potential for broader reasoning, extending beyond the binary options offered by the exam.
These models are not intended for clinical reasoning; they are products of the consumer technology sector, designed to perform tasks such as language translation and content creation. Despite their non-medical backgrounds, they have demonstrated remarkable clinical reasoning skills.
The new platform was created to solve medical problems. Google recently introduced Med-Gemini, a refined version of the previous Gemini model that is optimized for medical applications and has Internet-based research capabilities to improve clinical reasoning.
As these models evolve, their ability to analyze complex medical data, diagnose conditions and recommend treatments will improve. They may provide a level of precision and consistency that human donors may struggle to match, plagued by fatigue and error, and open the way to a future where the portals of a medical network used by machines instead of doctors.
#Oped #chatbots #imitate #doctors #clinics #tests