There is growing interest in applying LLMs in dentistry. This study evaluated their effectiveness in answering common patient questions in endodontics. ChatGPT-4o provided more readable responses, while Google Gemini outperformed the other models in validity and mean scores. Therefore, the null hypothesis was rejected.
The study shows that ChatGPT-4o responses are easier to read than other chatbots. ChatGPT-4o has lower FKGL scores, meaning less education is needed to understand the text. ChatGPT-4o also has higher FRES, indicating better readability. Copilot and Gemini responses had similar readability levels. This difference may result from GPT-4o’s training data and human-guided fine-tuning.
Previous studies show that LLMs perform well in low-threshold tests. The current findings align with these results, suggesting that LLMs are generally reliable for producing fundamentally correct, though potentially incomplete, basic-level responses. Validity decreased at higher thresholds across all models, consistent with earlier research [3, 14]. However, the top-performing model differed from previous reports: Google Gemini produced significantly more valid responses than ChatGPT-4o. In the high-threshold validity test shown in Fig. 2, only ChatGPT-4o exhibited a higher number of invalid responses than valid responses. This outcome could be related to its lower overall mean scores and more superficial responses. Gemini’s answers tended to have a higher academic level, reflected by higher FKGL values, and longer total response lengths based on word count. This may partly explain the higher comprehensiveness scores assigned by evaluators. Therefore, LLM selection should consider the intended level of informational depth and complexity.
Statistical analysis showed that Google Gemini performed significantly better than ChatGPT-4o in the high-threshold validity test. However, this result mainly reflects more consistent performance in producing highly accurate and comprehensive answers across the majority of the evaluated questions, rather than overall clinical appropriateness or general LLM performance. Therefore, a higher high-threshold score should not be interpreted as indicating flawless responses or the absence of potential inaccuracies. Indeed, even high-scoring LLM outputs, while generally well-structured and convincing, may contain misleading or incorrect information—a phenomenon known as “hallucination” [3, 15]. For instance, in Q15 (“Can I get a root canal during pregnancy?”), Microsoft Copilot stated that X-rays are safe during pregnancy. Providing such information in a highly generalized and direct manner may lead patients to incorrect assumptions. Likewise, in Q25 (“Can I get a root canal treatment if my face is swollen?”), Gemini stated that root canal treatment should not be performed in cases of an acute abscess, which is clinically inaccurate. These responses cannot always be classified as entirely incorrect or hallucinatory; however, they may still be misinterpreted by patients and potentially pose risks to patient safety and clinical communication. Accordingly, both responses were assigned 2 points. This observation suggests that misleading information can still occur even in models that produce higher high-validity scores or more detailed, academic answers.
None of the ChatGPT-4o responses scored 2 for validity. However, its overall validity average was lower than that of other models, reflecting a ‘safe but superficial’ strategy. This outcome may stem partly from the model’s training, where alignment processes prioritize safety and hallucination avoidance over depth and academic information. On the other hand, ChatGPT-4o responses were more readable and understandable for patients, as reflected by lower FKGL and higher FRES values. While this is important for reaching a broad audience or simplifying complex information, readability alone is not sufficient in patient safety or clinical communication in practice and should be interpreted together with accuracy and completeness. In contrast, other LLMs generate more detailed responses in academic language, which may be preferable for dentists, who are more likely to recognize and interpret potential inaccuracies or hallucinations.
LLMs in endodontics have been evaluated using open-ended questions in previous studies. Büker et al. analyzed 10 retreatment-related questions and found that Gemini scored higher than ChatGPT-3.5 and Copilot. Rahmi et al. examined 20 patient questions and reported that GPT-3.5 showed higher validity than Gemini and Copilot. Özbay et al. assessed 40 questions and found that ChatGPT-4, ChatGPT-3.5, and Gemini had the highest scores, respectively [3, 11, 13]. In the present study, Gemini achieved the highest mean score (4.70 ± 0.60), while GPT-4o had the lowest (4.53 ± 0.56). Differences between studies may be due to variations in timing, question design, evaluators, and continuous updates of LLMs.
This study presents a comprehensive evaluation of widely used LLMs, by assessing their readability, validity, and consistency through multiple criteria, rather than focusing on a limited set of parameters. In addition, by utilizing a broad, patient-centered question set, the study offers more extensive evaluation of chatbots’ performance in endodontics. The objective, blinded review by two experienced endodontists helps support reliable and unbiased results. These combined factors strengthen the study’s findings and provide valuable insights into the potential of LLMs for patient education and knowledge transfer in endodontics.
This study has several limitations. First, response validity was evaluated by only two endodontists. Including more reviewers could improve the reliability. Second, the questions were limited to endodontics, and question difficulty levels were not categorized, which may have contributed to the more generalized results. Third, responses were collected within a short 10 day period in May 2025, without considering possible model updates that could alter output quality over time. Fourth, the evaluation relied on a five-point Likert scale assessing correctness and completeness, along with readability metrics. While these measures capture overall performance, they do not fully account for clinically relevant factors, such as patient safety, critical errors, or hallucinated information. Alternative or multidimensional scoring approaches might yield different interpretations. Finally, the questions and answers were evaluated only in English. The performance of chatbots may vary across different languages and contexts.
Although statistically significant differences were observed between models, the magnitude of these differences was relatively small. Quantitative metrics alone may not fully reflect clinical relevance in patient education. However, qualitative errors such as hallucinated or factually incorrect statements may have a disproportionate impact on patient understanding, trust, and clinical communication. Future studies with larger datasets should include evaluations by both patients and endodontists, and incorporate metrics, such as response accuracy, comprehensiveness, and safety. In addition, research should explore API-integrated, guideline-aware chatbot systems to enhance clinical reliability and applicability in endodontic practice.
Comments (0)