In this study, we found inter-expert agreement for the diagnosis of cholecystitis severity using the Parkland Grading Scale, and a single intraoperative image was imperfect. We developed two transformer-based AI models to predict the consensus of an expert diagnostic panel, and one of the models also predicted the distribution of the panel members’ independent assessments. The accuracy of our best model was comparable to experienced clinicians when comparing individual cholecystitis severity assessments with the expert panel reference standard. With the occlusion experiments, we also showed the model relied on the appearance of key anatomic structures included in the Parkland Grading Scale criteria to make accurate predictions. To our knowledge, this is the first attempt to use AI to model an expert diagnostic panel to automate the assessment of cholecystitis severity using the Parkland Grading Scale. In addition, this is the first use of a state-of-the-art transformer-based neural network architecture for this task.
The creators of the Parkland Grading Scale reported excellent inter-rater reliability when considering the proximity of ratings along the five-tiered scale using the intra-class correlation coefficient [8]. We also observed reasonable inter-expert agreement using the weighted Cohen’s kappa statistic. However, we found all three clinical experts independently agreed for only half of the cases. This underscores the inherent subjectivity of assessing cholecystitis severity using the Parkland Grading Scale. During the expert panel’s plenary discussion, we noted the surgeons tended to agree on the appearance of visual indications of inflammation within an image. However, they did not always agree on how those indicators fit within the Parkland Grading Scale. For example, the level of omental adhesions to the gallbladder is one criterion included in the scale. The definition for grade three is “adhesions to the body,” while the definition for grade 4 is “adhesions obscuring the majority of the gallbladder.” [8] Traditionally, the gallbladder is anatomically divided into thirds: the neck, the body, and the fundus. Experts debated whether the threshold for the majority of the gallbladder surface area should be greater than one half, which would include the body, or greater than two thirds, which would require adhesions to reach the fundus. As a result, experts felt many cases fell somewhere between the two grades and had to clarify the definition to reach a consensus.
The subjectivity of assessing cholecystitis severity using the Parkland Grading Scale presents a challenge when trying to establish a reference standard for training and evaluating AI models. Previous work to develop AI models for predicting cholecystitis severity using the Parkland Grading Scale relied on the assessments of a single individual [17]. Combining expertise to produce an expert panel diagnosis may reduce variation due to subjectivity [11,12,13,14]. In our study, using the ratings of a single expert to train the AI models would have inherently limited model performance when comparing predictions with the panel’s consensus. If we consider inter-expert consensus to achieve Bayes error rate, defined as the lowest conceivable error rate given the distribution of the data, any single expert would provide reference standard grades with only 79% accuracy at best.
Despite the inherent challenges of training an AI model to grade cholecystitis severity using qualitative criteria based on clinical judgment, we were able to build a model with prediction accuracy comparable to trained clinicians. It is important to note that the AI model described in this study is not intended for direct clinical decision support but rather serves as an initial demonstration of technical feasibility and as a foundation for future development of clinically robust systems that explicitly account for inter-expert variability in grading disease severity. Previous research used convolutional neural networks to model the prediction of cholecystitis severity using the Parkland Grading Scale [17, 18, 36]. We used a state-of-the-art transformer architecture, which has been shown to outperform convolutional neural networks for medical imaging analysis [30]. We also found the use of individual experts’ independent assessments to train the model slightly improved the accuracy of Model B over Model A. The difference in these models was that Model B was given access to the expert consensus as well as the individual pre-consensus expert grades. This latter information likely provides useful auxiliary information for understanding the nuance between different cholecystitis severity grades. This was reflected in the weighted Cohen’s kappa statistics for the models, which consider both absolute agreement and the closeness of agreement between predictions and reference labels. Moving forward, this predicted distribution of individual experts’ independent assessments could provide an estimate of the model’s degree of confidence.
Finally, we demonstrated that anatomy, and most importantly the appearance of the gallbladder itself, played a key role in Model B’s grading of cholecystitis severity. In addition, two other key structures, the liver and the omentum, had a notable impact on performance. These three anatomic structures feature prominently in the Parkland Grading Scale criteria. Interestingly, even when the gallbladder was masked, the model incorrectly predicted cholecystitis severity only 22% of the time. This suggests multiple visual elements and the spatial relationships between them work together to provide information about cholecystitis severity. Previous research has demonstrated the interpretability of AI models by incorporating qualitative criteria from the Parkland Grading Scale within the prediction tasks [36]. We offer an additional method for evaluating interpretability based on segmentation masking that could shed new light on the relationships between specific anatomic structures and surgical instruments when assessing cholecystitis severity.
Limitations to the study include reliance of the Parkland Grading Scale on individual clinical judgment, which affects the reliability of the scale. We assembled an expert diagnostic panel to minimize bias related to variations in individual tendencies. However, we found that limitations inherent in the scale still presented difficulties for the panel members. While our dataset included a variety of cases performed by multiple surgeons from institutions across the globe, the true distribution of cholecystitis severity among the general population of patients who undergo laparoscopic cholecystectomy remains unknown. If the distribution of case severity for a particular institution varies dramatically from that used in our study, model retraining may be needed prior to direct application. Finally, although the Parkland Scale has been clinically validated, clinical indicators of disease severity were not available for graders to review and were not included in the model [9]. Clinical data could be useful for strengthening the validity of disease severity assessments and model predictions.
The creators of the Parkland Grading Scale aimed to develop a simple, reliable system for classifying disease severity and operative difficulty during laparoscopic cholecystectomy [8,9,10]. Ease of use is one strength of the Parkland Scale, and we observed that non-clinician computer scientists could be trained to perform disease severity assessment with accuracy not much below medical professionals. However, we also noted inherent limitations in the qualitative criteria of the scale. AI can perform rapid calculations beyond human ability. Quantification could allow for a more nuanced characterization of cholecystitis severity. For example, AI could specify the density of omental adhesions in relation to gallbladder surface area along a continuous scale. We imagine similar possibilities for other aspects of disease severity, including the degree of hyperemia and edema. Future work on the automated assessment of cholecystitis severity should focus on developing a grading system that leverages the computational strength of AI while encompassing the needs of practicing surgeons. While our study examined how modeling consensus among experts can improve the robustness of AI models for disease severity assessment, future work should incorporate strategies from active learning and uncertainty quantification to better characterize clinician-provided ground-truth labels with confidence scores. For example, follow-up studies could explore weighting labels from experts according to their level of certainty, particularly for subjective grading systems like PGS, enabling the model to calibrate its reliance on human annotations during training.
Distinctively, in comparison to prior work, we relied on consensus of a panel of expert clinicians to build an accurate transformer-based AI model to predict ratings of cholecystitis severity derived from the Parking Grading scale. This model also predicted the individual assessments of the clinical experts who provided this consensus. Interestingly, our work showed the variance and subjectivity of PGS even among experienced clinicians and illustrates the limitations of the Parking Grading Scale as a ground-truth for computer-vision-based models. Our findings highlight the potential of AI as a robust evaluative tool; however, for it to serve effectively in clinical decision-making, there is a need for rating scales tailored for AI comprehension. Directions for future research should shift to developing methods that take advantage of the computational strengths of AI to produce a more nuanced characterization of disease severity that transcends human capabilities.
Comments (0)