In this study, we evaluated the performance of DeepSeek, Copilot powered by GPT-4 and Google Bard on the pediatric surgical EPSITE examination. To our knowledge, this is the first study to assess AI performance in a standardized examination on pediatric surgery. DeepSeek significantly outperformed both Copilot and Google Bard, as well as pediatric surgical trainees. Scores obtained by Copilot were comparable to scores of pediatric surgical trainees, while Google Bard performed significantly worse. Furthermore, when assessing the results of pediatric surgical trainees, our analysis revealed that 6th year trainees outperformed Google Bard, but not DeepSeek. Sixth year trainees are generally at a level of training right before or after their pediatric surgical board examination which likely explains this result.
Based on our results, caution should be taken when using LLMs as an assistant for clinical practice, and responses need to be appraised critically. Only one of the LLMs achieved the passing grade of 75% required by the American Board of Surgery or the 59% required by the Part 1 of the European Board Examination. The stronger performance of DeepSeek may reflect rapid progress in LLM development, as well as the benefits of architectural refinements and improved training strategies in newer models. Notably, we identified significant differences in model performance both by question type and by topic with DeepSeek consistently outperforming other LLMs across multiple pediatric surgery domains and in both simple fact recall and complex analytical reasoning. Additionally, Copilot and Google Bard failed to answer several questions which required ethical reasoning or when it considered that the right answer was not listed among response choices. While refusals were treated as incorrect in our analysis, we acknowledge this may penalize models designed with safety constraints. However, only a small number of questions were affected and, in examination conditions, selecting the best available answer remains essential. Open-source LLMs are trained on publicly available data and may incorporate outdated information. High-quality and peer-reviewed content oftentimes requires access to paid journal subscriptions or non-public materials such as textbooks. However, in highly specialized fields such as pediatric surgery, access to relevant medical literature is needed for field-specific training and improvement [9, 10]. This may partly explain the limited performance of Copilot and Google Bard at the time. In contrast, DeepSeek’s strong performance highlights how future LLMs may achieve high accuracy even with fewer resources, by leveraging accumulated experience and streamlined training pipelines. Nevertheless, improvements are still needed across the board for LLMs to serve as reliable decision-making aids in clinical practice.
A strength of our study is the incorporation of data from human test takers representing a wide array of clinical experience and different countries of origin. Many of the questions involved clinical vignettes mimicking real-world situations. Additionally, questions addressed a wide variety of topics from neonatal, to trauma, genitourinary or oncological surgery, therefore assessing diverse aspects of pediatric surgical practice. Other areas of pediatric surgical care such as orthopedic, cardiac or neuro-surgery were not evaluated; however, these typically fall within the expertise of the corresponding adult surgeons and are less relevant to pediatric surgeons. Furthermore, the multiple question format allowed for objective assessment of the LLM model’s performance. On the other hand, the closed question format might not allow for adequate evaluation of the extent of the well-known phenomenon of hallucination and confabulation, where LLMs manufacture data and which may be encountered in open-ended questions. Another limitation of our study was our inability to incorporate questions containing images. However, assessing radiological images or visible signs and symptoms is a key component of diagnostics. Given that LLMs are constantly evolving and being refined, this limitation will likely be overcome in the future.
As the use of AI becomes increasingly prevalent in everyday life, it is bound to gain importance in the medical field as well. LLMs are periodically updated and with additional training, acquisition of field-specific content, AI-based systems may soon become a reliable and omnipresent tool in various medical specialties, especially in diagnostics. Several research groups have trained AI systems to process and interpret large data sets from electronic health records using up to billions of data points [11, 12]. In a previous study, the authors extracted data from electronic health records from over 500 000 patients and 1 300 000 outpatient visits to train an AI model [11]. The model was applied to a large pediatric population and achieved high diagnostic accuracy not only for various organ systems, but also for diseases with a high morbidity such as bacterial meningitis. Importantly, it outperformed junior physicians but scored lower than senior physicians, suggesting that AI models may play a role in assisting physicians during their training. Furthermore, in the field of pediatric surgery where many parts of the world are underserved, it may help physicians in clinical decision-making. Another area where LLMs could be instrumental in pediatric surgery is in the assessment of radiological or histopathological examinations. AI-based systems are already able to detect fractures in infants with an accuracy comparable to pediatric radiologists, but so far applications are limited to specific types of fractures and not broadly applicable [13,14,15,16]. Similarly, a recent study showed promising results in the diagnosis of Hirschsprung disease from histopathological specimens a deep-learning approach [17]. This could prove invaluable, as histological diagnosis of Hirschsprung disease is challenging and requires a high level of expertise which is not always available. However, annotation and segmentation of samples for training and validating algorithms are currently very time-consuming and require qualified personnel. Additionally, while the diagnostic accuracy for Hirschsprung Disease was very high (92%), misclassification occurred in a number of cases, which in the case of Hirschsprung disease can lead to significant clinical complications (e.g., ileus).
AI will likely also play a role in surgical education and help examiners and test takers alike, as it can generate questions and help studying for exams by providing the rationale to the correct answer and in some cases (e.g., Copilot), even supplying accurate citations [18]. It may not only aid in acquiring and testing knowledge, but also enhance interpersonal skills and provide a safe environment if used to simulate doctor–patient interactions. Physicians will also benefit from the use of AI copilots for more mundane tasks such as notetaking or filling out forms [19].
Lastly, it is important to validate LLMs before widespread use and our study serves as a benchmark of DeepSeek, Copilot and Google Bard performance in the field of pediatric surgery. Future studies will not only assess performance but also use broader metrics and evaluate linguistic parameters, perform text network analysis and assess the self-correction capabilities of LLMs [3, 4].
Comments (0)