Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis
Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis
Blog Article
BackgroundLarge language models (LLMs) have flourished and gradually become an important research and application direction in the medical field.However, due to the high degree of specialization, complexity, and specificity of medicine, which results in extremely high accuracy requirements, controversy remains about whether LLMs can be used in the medical field.More studies have evaluated the performance of various types of LLMs in medicine, but the conclusions are inconsistent.ObjectiveThis study uses a network meta-analysis (NMA) to assess the accuracy of LLMs when answering clinical research questions to provide high-level evidence-based evidence for its future development and application in the medical field.MethodsIn this systematic review and NMA, we searched PubMed, Embase, Web of Science, and Scopus from inception until October 14, 2024.
Studies on the accuracy of LLMs when answering clinical research questions were included and screened by reading published reports.The systematic review and NMA were conducted to compare the accuracy of different LLMs when answering clinical research questions, including objective questions, open-ended questions, top 1 diagnosis, top 3 diagnosis, top 5 diagnosis, and triage and classification.The NMA was performed using Bayesian frequency theory methods.Indirect intercomparisons between programs were performed using a grading scale.A larger surface under the metabo 15-gauge finish nailer cordless cumulative ranking curve (SUCRA) value indicates a higher ranking of the corresponding LLM accuracy.
ResultsThe systematic review and NMA examined 168 articles encompassing 35,896 questions and 3063 clinical cases.Of the 168 studies, 40 (23.8%) were considered to have a low risk of bias, 128 (76.2%) had a moderate risk, and none were rated as having a high risk.ChatGPT-4o (SUCRA=0.
9207) demonstrated strong performance in terms of accuracy for objective questions, followed by Aeyeconsult (SUCRA=0.9187) and ChatGPT-4 (SUCRA=0.8087).ChatGPT-4 (SUCRA=0.8708) excelled at answering open-ended questions.
In terms of accuracy for top 1 diagnosis and top 3 diagnosis beetroot birkenstock of clinical cases, human experts (SUCRA=0.9001 and SUCRA=0.7126, respectively) ranked the highest, while Claude 3 Opus (SUCRA=0.9672) performed well at the top 5 diagnosis.copyright (SUCRA=0.
9649) had the highest rated SUCRA value for accuracy in the area of triage and classification.ConclusionsOur study indicates that ChatGPT-4o has an advantage when answering objective questions.For open-ended questions, ChatGPT-4 may be more credible.Humans are more accurate at the top 1 diagnosis and top 3 diagnosis.Claude 3 Opus performs better at the top 5 diagnosis, while for triage and classification, copyright is more advantageous.
This analysis offers valuable insights for clinicians and medical practitioners, empowering them to effectively leverage LLMs for improved decision-making in learning, diagnosis, and management of various clinical scenarios.Trial RegistrationPROSPERO CRD42024558245; https://www.crd.york.ac.
uk/PROSPERO/view/CRD42024558245.