Diagnostic accuracy of deep learning vs. human raters for detecting osteoporotic vertebral compression fractures in routine CT scans

Study cohort

From a total of 4522 vertebrae across 356 patients in the VerSe 19 and 20 datasets, 827 cervical vertebrae from 21 patients were excluded as the analysis focused on thoracic and lumbar regions, yielding reference standard readings for 3720 vertebrae in 335 patients. Subsequently, 53 vertebrae were excluded due to anatomical variations (e.g., transitional vertebrae), and 119 vertebrae were excluded because complete ratings from all raters were unavailable (Fig. 1). The final independent test cohort included 331 patients: 45% male, 50% female, and 5% with missing sex data. The mean age was 59.5 ± 17.5 years (min. =18.9, max. = 90.6, 75.8% ≥ 50 years). 57.49 ± 16.81 years (min = 18.9, max = 89.3, 70.0% ≥ 50 years) for males and 61.33 ± 17.95 years (min =19.0, max = 90.6, 78.7% ≥ 50 years) for females. CT scans were acquired using various scanners from Philips (ICT 12.4% and IQON 28.9%), Siemens (Somatom AS + 17.1%, Somatom AS 1.3%, Biograph 64 2.9%, Sensation Cardiac 64 1.0% and other 8.3%), GE systems (6.0%), and Toshiba systems (6.0%). 49.5% of scans were performed without contrast, 8.6% during arterial, 41.0% during portal venous, and 1.0% during late portal venous phase.

Fig. 1figure 1

Exclusion flowchart. VerSe, Vertebral segmentation

Distribution of vertebral fractures

We analyzed 3548 thoracic and lumbar vertebrae from 331 patients, grouped into upper thoracic (thoracic vertebrae T1–T6), lower thoracic (T7–T12), and lumbar (L1–L6) subregions (Fig. 2A). At the patient level, 85 (25.7%) had at least one Genant 1 fracture and 74 (22.4%) had at least one moderate or severe fracture. Of the 3548 vertebrae, 190 (5.4%) showed any osteoporotic fracture, and 139 (3.9%) had a “clinically most relevant” (Genant 2 or 3) fracture. Relative fracture prevalence increased from the upper thoracic vertebrae over the lower thoracic vertebrae to the lumbar vertebrae. Of the 879 upper thoracic vertebrae, 3.1% (27) had a fracture of any grade, and 1.8% (16) had a moderate or severe fracture, with each 1.3% (11) Genant grade 1 and 2 fractures and 0.6% (5) Genant grade 3 fractures. Of the 1281 lower thoracic vertebrae, 4.6% (59) had any fracture and 3.4% (43) had a moderate or severe fracture, with 1.3% (16) grade 1, 2.4% (31) grade 2 and 0.9% (12) grade 3 fractures. Out of 1388 lumbar vertebrae, 7.5% (104) had any fracture, and 5.8% (80) had moderate or severe fractures, with 1.7% (24) grade 1, 2.9% (41) grade 2 and 2.8% (39) grade 3 fractures (Fig. 2B). Fracture distribution per single vertebra is shown in Fig. 2C.

Fig. 2figure 2

Vertebrae and fracture distribution. A Distribution of vertebrae. B Relative fractures per subset: upper thoracic vertebrae (T1–T6), lower thoracic vertebrae (T7–T12), and lumbar vertebrae (L1–L5). C Relative fractures per single vertebra

Diagnostic performance at the vertebral level

Diagnostic performance metrics at the vertebral level for the whole spine for detecting any fracture (Genant 1–3 vs. Genant 0) is shown in Table 1A and Fig. 3A. AUROC was comparable for residents, attendings and SpineQ (0.928–0.944), and lower for DL models (0.906) and even lower for students (0.890). Accuracy was highest for SpineQ (0.990), followed by DL models (0.985), residents (0.982) and attendings (0.977). SpineQ and DL models were more sensitive, and human raters more specific. PPV was comparably high for all groups (< 0.99). NPV was highest for SpineQ (0.962) and DL models (0.894) and lower for all human raters. GLMM was used to assess statistical differences between groups, where SpineQ v1.1 achieved significantly better results than DL models (padj = 0.0022), and attendings, residents, and students (padj < 0.0001). Attendings performed significantly better than DL models (padj < 0.0001), while students performed considerably worse than all groups (padj < 0.0001).

Fig. 3figure 3

Diagnostic performance on the vertebral level. Heatmaps representing diagnostic performance for (A) discrimination of any fracture (Genant 1–3) from no fracture (Genant 0) and (B) differentiation of a “clinically most relevant” moderate or severe fracture (Genant 2 or 3) from no or mild fracture (Genant 0 or 1) on vertebral level. ACC, accuracy; NPV, negative predictive value; PPV, positive predictive value; Sens, sensitivity; Spec, Specificity

Table 1 Diagnostic performance (means of point estimates with 95% confidence intervals calculated using the Wilson method) for fracture detection of A fracture versus no fracture (Genant 1, 2 or 3 fracture versus no fracture), as well as B moderate or severe fracture (Genant 2 or 3) versus mild or no fracture (Genant 1 or 0) on vertebral level

For detecting moderate/severe fractures (Genant 2–3 vs. Genant 0–1), results are shown in Table 1B and Fig. 3B. AUROC was highest for SpineQ (0.964) and attendings (0.950), followed by DL models and residents (0.923 and 0.914). Accuracy was also highest for SpineQ and attendings and residents (> 0.99), followed by DL models and students (< 0.98). While sensitivity was comparably high for all groups (> 0.99), specificity was highest for SpineQ (0.929) and attendings (0.904). PPV was > 0.99 for all groups except for students, and NPV was highest for SpineQ (0.970). GLMM was applied to compare group ratings. Again, SpineQ v1.1 demonstrated superior performance to all other groups: DL models (padj < 0.0001), attendings (padj = 0.0001), residents and students (padj < 0.0001).

Weighted κ and Cohen’s κ for both levels of comparison showed almost perfect agreement with the reference standard for SpineQ (all κ > 0.9) and DL models, residents and attendings (all κ > 0.8).

Overall, at the vertebral level across the entire spine, SpineQ v1.1 consistently achieved the highest diagnostic performance across tasks, outperforming DL models (padj ≤ 0.0022) and all reader groups (padj ≤ 0.0001). For detecting any fracture, AUROC was comparable for SpineQ, attendings, and residents (≈ 0.93–0.94), but lower for DL models and students; accuracy and NPV were highest for SpineQ and DL models, while human raters were more specific. For moderate/severe fractures, SpineQ and attendings showed the best AUROC (>0.95) and accuracy (> 0.99), with SpineQ significantly superior to all other groups (padj ≤ 0.0001).

To assess model performance in particularly challenging cases, we analyzed vertebrae for which the reference standard required consensus review. Among 3548 vertebrae, 116 (3.3%) involved disagreement, while 3433 (96.7%) showed reader agreement. In the disagreement group, 88 vertebrae had no fractures, 11 had grade 1, 10 had grade 2, and 7 had grade 3 fractures. For these difficult cases, mean DL models achieved 84.1% accuracy (95% CI: 80.4–87.3%) compared to 97.5% (95% CI: 97.3–97.8%) in agreement cases, with overall accuracy of 97.1% (95% CI: 96.8–97.4%; p = 1.03 × 10⁻⁶³). SpineQ showed 91.4% accuracy (95% CI: 84.7–95.8%) in disagreement cases and 97.5% (95% CI: 97.3–97.8%) in agreement cases, with overall accuracy of 97.1% (95% CI: 96.8–97.4%; p = 3.38 × 10⁻⁸). Although performance was lower in consensus cases, accuracy remained high even for these challenging vertebrae. Figure 4A shows an example of a Genant grade 1 fracture, and Fig. 4B shows an example of a Genant grade 3 fracture that was partially not correctly identified by the raters.

Fig. 4figure 4

Examples of incorrectly identified fractures. A A grade 1 fracture, which was overlooked by two of the three students and three of the four deep learning (DL) models. One DL model and one attending physician incorrectly classified it as a Genant grade 2 fracture. B Genant grade 3 fractures in L2 and L5 were misclassified as grade 2 by one attending, all residents, and two students. Additionally, the L5 fracture was not detected by any of the DL models. FX, fracture

Diagnostic performance at the vertebral level for upper, lower thoracic, and lumbar subsetsUpper thoracic vertebrae

Diagnostic performance metrics for identifying any fracture (Genant grades 1–3) versus no fracture in the upper thoracic subset are presented in Table 2A and Fig. 5A. AUROC was highest for DL models (0.927) and SpineQ and residents (> 0.87). Accuracies were highest for SpineQ (0.992), followed by all other groups (> 0.98). SpineQ showed the highest sensitivity with 1.00, and DL models showed the highest specificity with 0.86. PPV was high for all groups (> 0.98) while NPV was considerably highest for SpineQ with 1.00 (all other groups < 0.80). Comparing group performances using GLMM, SpineQ v1.1 demonstrated significantly superior performance compared to attendings (padj = 0.0080) and students (padj = 0.0014). Additionally, DL models outperformed students (padj = 0.0100).

Fig. 5figure 5

Diagnostic performance in vertebral level subsets. Heatmaps representing diagnostic performance for discrimination of any fracture (Genant 1–3) from no fracture (Genant 0) and differentiation of a “clinically most relevant” moderate or severe fracture (Genant 2 or 3) from no or mild fracture (Genant 0 or 1) on vertebral level in the subsets of (A, B) upper thoracic vertebrae, (C, D) lower thoracic vertebrae, and (E, F) lumbar vertebrae. ACC, accuracy; NPV, negative predictive value; PPV, positive predictive value; Sens, sensitivity; Spec, Specificity

Table 2 Diagnostic performance (means of point estimates with 95% confidence intervals calculated using the Wilson method) on single vertebral level in the subset of upper thoracic vertebrae (A, B) as well as in the subset of lower thoracic vertebrae (C, D) for discrimination of any fracture (Genant 1–3) from no fracture (Genant 0) and differentiation of a “clinically most relevant” moderate or severe fracture (Genant 2 or 3) from no or mild fracture (Genant 0 or 1)

For distinguishing moderate or severe fractures (Genant 2 or 3) from mild or no fractures, results are shown in Table 2B and Fig. 5B. In this task, SpineQ had the highest AUROC with 0.936, followed by attendings (0.842) and DL models (0.811). Accuracy was highest for SpineQ (0.994) with values > 0.98 for all groups. All groups were highly sensitive (p > 0.99), with the highest specificity for SpineQ (0.875). PPV was high for all groups (> 0.99), and NPV was higher for all human raters than for DL models and SpineQ. Comparing group performances using GLMM, SpineQ v1.1 again showed a statistically significant advantage over students (padj = 0.0173).

Weighted Cohen’s κ showed almost perfect agreement with the reference standard for SpineQ (0.845). Cohen’s κ demonstrated almost perfect agreement for SpineQ for both levels of comparison (> 0.84) and for DL models for detection of any versus no fractures (0.819).

Lower thoracic vertebrae

Performance metrics for detecting any fracture in the lower thoracic region are detailed in Table 2C and Fig. 5C. AUROC was highest for residents and attendings (> 0.94), followed by SpineQ (0.93). Accuracy was highest for SpineQ (0.99), followed by DL models, residents and attendings (> 0.98). SpineQ and DL models were most sensitive with values > 0.99, and residents and attendings were most specific (> 0.90). PPV was high for all groups (> 0.99), and NPV was highest for SpineQ (0.912) and DL models (0.84). GLMM, which was applied to compare group ratings, showed that SpineQ v1.1 significantly surpassed both attendings (padj = 0.0236) and students (padj < 0.0001), with students performing markedly worse than all other groups (padj < 0.0001).

For the classification of moderate/severe versus mild or no fractures (Table 2D, Fig. 5D), AUROC was highest for SpineQ (0.966), followed by attendings (0.941) and residents and ML models (> 0.93). Accuracy was high with values > 0.99 for SpineQ, residents and attendings and > 0.98 for ML models and students. Sensitivity and specificity were both highest for SpineQ (1.000 and 0.932, respectively). PPV was highest for SpineQ with 0.998 and values > 0.98 for all groups. NPV was with 1.00 highest for SpineQ. In group comparisons, using GLMM, SpineQ v1.1 achieved significantly better results than DL models (p = 0.0002), attendings (padj = 0.0175), and students (padj < 0.0001), and showed a trend toward outperforming residents (padj = 0.0566). Students again underperformed significantly compared to SpineQ v1.1 and residents (padj < 0.0001) and attendings (padj = 0.0032).

Weighted Cohen’s κ (SpineQ > 0.9 and DL models, residents and attendings all κ > 0.8) and Cohen’s κ for both levels of comparison showed almost perfect agreement to the reference standard (SpineQ > 0.9 for classification of moderate/severe versus mild or no fractures, all other κ for SpineQ, DL models, residents and attendings > 0.8).

Lumbar vertebrae

Table 3A and Fig. 5E provide metrics for detecting any fracture of the lumbar subset. AUROC was highest for attendings (0.967), followed by residents and SpineQ (≈ 0.95). Accuracy was highest for SpineQ (0.991). Sensitivity was highest for SpineQ and DL models (0.999 and 0.994, respectively). Specificity was higher for all human rater groups than for SpineQ and DL models. PPV was highest for attendings (0.998) with high performances for all groups (> 0.99, except DL models with 0.983). NPV was highest for SpineQ (0.989), followed by DL models (0.971). In group comparisons with GLMM, SpineQ v1.1 significantly achieved better results than all other groups: attendings (padj < 0.0001), residents (padj = 0.0001), and DL models (padj = 0.0156). DL models also showed a significant advantage over attendings (padj = 0.0047). Students consistently performed the worst among all groups (padj < 0.0001).

Table 3 Diagnostic performance (means of point estimates with 95% confidence intervals calculated using the Wilson method) for (A) discrimination of any fracture (Genant 1–3) from no fracture (Genant 0) and (B) differentiation of a “clinically most relevant” moderate or severe fracture (Genant 2 or 3) from no or mild fracture (Genant 0 or 1) on single vertebral level in the subset of lumbar vertebrae

In the task of identifying moderate or severe fractures (Table 3B and Fig. 5F), attendings showed the highest AUROC with 0.975, followed by SpineQ (0.968). These two also showed the highest accuracy (> 0.99). Sensitivity was high for all groups (> 0.99), and specificity was highest for attendings (0.956) and SpineQ (0.938). PPV was high with > 0.98 for all groups. NPV was highest for SpineQ with 0.974. Using GLMM for group comparison, SpineQ v1.1 again led in performance, significantly surpassing DL models (padj = 0.0004) and residents (padj = 0.0269). Students performed considerably worse than all other groups, including attendings, residents, SpineQ v1.1 (all padj < 0.0001) and DL models (padj = 0.0023).

Weighted Cohen’s κ showed almost perfect agreement with the reference standard for all groups except for students (> 0.8). Cohen’s κ for both levels of comparison demonstrated almost perfect agreement for SpineQ (> 0.9), DL models, residents and attendings (all > 0.8, except 0.921 for attendings for differentiation of moderate and severe fractures).

Across upper thoracic, lower thoracic, and lumbar regions, SpineQ v1.1 consistently achieved the highest accuracy and sensitivity, often outperforming DL models and all reader groups in GLMM comparisons (padj ≤ 0.0001 in most cases). AUROC varied by region, with DL models leading in the upper thoracic, residents and attendings in the lower thoracic and attendings in the lumbar subset for detection of any fracture. SpineQ dominated moderate/severe fracture detection across all subsets, in the lumbar subset comparable to attendings. Students consistently underperformed compared to all other groups.

Diagnostic performance at the patient level

Performance metrics for detecting any fracture at the patient level are detailed in Table 4A and Fig. 6A. Attendings, residents and SpineQ showed the highest AUROC (≈ 0.94–0.95). SpineQ had the highest Accuracy with 0.952, followed by ≈ 0.94 for residents and attendings. Sensitivity was with 0.963 highest for SpineQ, whereas specificity was highest for attendings with 0.970. Attendings had the highest PPV with 0.990, followed by residents (0.978) and SpineQ (0.971). NPV was highest for SpineQ with 0.897. In statistical group comparisons with GLMM, a trend was observed toward DL models performing worse than SpineQv1.1 (padj = 0.0813). Students performed significantly worse than all other groups (padj < 0.001).

Fig. 6figure 6

Diagnostic performance on the patient level. Heatmaps representing diagnostic performance for (A) discrimination of any fracture (Genant 1–3) from no fracture (Genant 0) and (B) differentiation of a “clinically most relevant” moderate or severe fracture (Genant 2 or 3) from no or mild fracture (Genant 0 or 1) on the patient level. ACC, accuracy; NPV, negative predictive value; PPV, positive predictive value; Sens, sensitivity; Spec, Specificity

Table 4 Diagnostic performance (means of point estimates with 95% confidence intervals calculated using the Wilson methods) for (A) discrimination of any fracture (Genant 1–3) from no fracture (Genant 0) and (B) differentiation of a “clinically most relevant” moderate or severe fracture (Genant 2 or 3) from no or mild fracture (Genant 0 or 1) on patient level

For the task of distinguishing moderate or severe fractures (Genant 2 or 3) from mild or no fractures, results are shown in Table 4B and Fig. 6B. AUROC was highest for SpineQ (0.973) and attendings (0.954), and accuracy was highest for SpineQ and residents with ≈ 0.97. Residents showed the highest sensitivity with 0.988, and SpineQ showed the highest specificity with 0.959. PPV was highest for SpineQ (0.988) and attendings (0.984). NPV was highest for residents (0.956) and SpineQ (0.922). GLMM was applied to compare group ratings. DL models performed significantly worse than Bonescreen SpineQv1.1 (p = 0.004) and residents (padj = 0.037). Students performed significantly worse than attendings (padj = 0.033), residents (padj = 0.001), and SpineQv1.1 (padj < 0.001).

Cohen’s κ showed almost perfect agreement to the reference standard for SpineQ, DL models, attendings and residents (all > 0.8, except 0.923 for SpineQ for differentiation of moderate and severe fractures).

Summed up, at the patient level, SpineQ v1.1 achieved the highest accuracy (0.952) and sensitivity (0.963) for detecting any fracture, while attendings led in specificity (0.970) and PPV (0.990). For moderate/severe fractures, SpineQ showed the best AUROC (0.973) and specificity (0.959), with residents leading in sensitivity (0.988). GLMM comparisons confirmed SpineQ’s significant advantage over DL models (p ≤ 0.004) and students, who consistently underperformed.

Comments (0)

No login
gif