The experimental study on Multi-Teacher-Based Knowledge Distillation (MTKD) aims to evaluate the proposed method’s performance against state-of-the-art architectures across four datasets: DRIVE, CHASEDB1, CHUAC, and DCA1. The evaluation metrics include accuracy (ACC), sensitivity (SEN), specificity (SPE), F1-score (F1), and intersection over union (IOU). The proposed method, MTKD-UNet (ours), leverages multiple teacher networks, each specialized in learning different vessel characteristics (original, thin, and thick vessels), to enhance the student network’s segmentation performance. These results are reported in Tables 1 and 2.
Table 1 Multi-teacher KD performance comparison of methods on different datasets-DRIVE and CHASEDB1The proposed MTKD-UNet achieves competitive results across all metrics on the DRIVE dataset. Specifically, it attains an accuracy of 96.72, sensitivity of 80.22, specificity of 98.32, F1-score of 80.74, and IOU of 67.82. While these results are slightly lower than those of some state-of-the-art methods like FR-UNet (ACC: 97.05, SEN: 83.56, F1: 83.16, IOU: 71.20) and SGL (SEN: 83.80, F1: 83.16), they are comparable to those of other methods such as U-Net and UNET++. The proposed method demonstrates strong specificity, indicating its ability to correctly identify non-vessel pixels, which is crucial for reducing false positives in medical imaging tasks.
For the CHASEDB1 dataset, MTKD-UNet achieves an accuracy of 97.26, sensitivity of 74.90, specificity of 98.76, F1-score of 77.48, and IOU of 63.28. While the sensitivity is lower compared to methods like FR-UNet (SEN: 87.98) and SGL (SEN: 86.90), the specificity is among the second highest, indicating robust performance in identifying non-vessel regions. The F1-score and IOU are also competitive, suggesting that the proposed method effectively balances precision and recall, which is critical for accurate vessel segmentation.
Table 2 Multi-teacher KD performance comparison of methods on different datasets-CHUAC and DCA1On the CHUAC dataset, the proposed method achieves the highest accuracy (98.23) and specificity (99.40) among all methods, outperforming even FR-UNet (ACC: 98.03, SPE: 98.68) and U-Net (SPE: 99.40). The sensitivity (73.33) and F1-score (76.02) are also competitive, though slightly lower than FR-UNet (SEN: 81.71, F1: 76.01). The high specificity and accuracy suggest that the proposed method is particularly effective in handling the CHUAC dataset, which may have unique characteristics that benefit from the multi-teacher approach.
For the DCA1 dataset, MTKD-UNet (ours) achieves an accuracy of 97.84, sensitivity of 80.19, specificity of 98.84, F1-score of 79.38, and IOU of 66.0. The specificity is the highest among all methods, indicating excellent performance in identifying non-vessel regions. The sensitivity and F1-score are competitive, though slightly lower than FR-UNet (SEN: 82.48, F1: 80.22). The proposed method’s ability to achieve high specificity while maintaining reasonable sensitivity and F1-score highlights its robustness in handling diverse datasets. Our method uniquely employs three teacher networks, each specialized in segmenting different vessel types: original, thin, and thick vessels. This specialization allows the student network to distill rich, complementary knowledge that captures vessel heterogeneity more effectively than single-teacher or non-specialized methods. This targeted distillation enhances the student’s ability to segment vessels of varying thicknesses, especially improving the challenging thin vessel segmentation, which is a known limitation in many prior works. Compared to classical U-Net and its variants (e.g., Attention U-Net), our method achieves higher Dice and IoU scores by effectively leveraging multi-scale vessel information through knowledge distillation. While these methods rely on single-network learning, they often struggle with thin vessel detection.
Multi-teacher based KD with penalizationThis experiment applies a penalization technique to the student model’s loss function to enhance segmentation performance. This involves multiplying the loss obtained from the student network by a penalty factor. This approach emphasizes the importance of the student’s loss during training, potentially leading to more accurate segmentation. We have systematically explored penalty values of 5, 10, 15, and 20, incrementing by 5 to observe the effect on model performance. Our experimental results showed that performance improvements plateaued and, in some cases, slightly decreased when penalty values exceeded 20 (see Figure 3). This observation guided our decision to limit the exploration to this range. Specifically, a penalty of 10 is found to be best for the DRIVE dataset, while a penalty of 5 is used for the other datasets (CHASEDB1, CHUAC, and DCA1).
On the DRIVE dataset, the penalized method, MTKD-UNet (oursLP), achieves an accuracy of 96.93, sensitivity of 79.54, specificity of 98.60, F1-score of 81.66, and IOU of 69.08. Compared to the non-penalized version, MTKD-UNet (ours) (ACC: 96.72, SEN: 80.22, SPE: 98.32, F1: 80.74, IOU: 67.82), the penalized method shows an improvement in accuracy (+0.21 points), specificity (+0.28 points), F1-score (+0.92 points), and IOU (+1.26 points). However, sensitivity decreased slightly (\(-\)0.68 percentage points). While the penalized method showed competitive results compared to state-of-the-art methods like FR-UNet (accuracy: 97.05%, sensitivity: 83.56%, F1-score: 83.16%, IOU: 71.20%) and SGL (sensitivity: 83.80%, F1-score: 83.16%), it did not outperform them. The high specificity indicates that the penalized method effectively reduces false positives.
For the CHASEDB1 dataset, MTKD-UNet (oursLP) achieves an accuracy of 97.54, sensitivity of 78.09, specificity of 98.83, F1-score of 79.95, and IOU of 66.62. Compared to the non-penalized version, MTKD-UNet (ours) (ACC: 97.26, SEN: 74.90, SPE: 98.76, F1: 77.48, IOU: 63.28), the penalized method shows improvements in all metrics: accuracy (+0.28 points), sensitivity (+3.19 points), specificity (+0.07 points), F1-score (+2.47 points), and IOU (+3.34 points). This demonstrates that penalization significantly enhances the model’s ability to segment both thin and thick vessels, particularly improving sensitivity and F1-score. Compared to state-of-the-art methods, MTKD-UNet (oursLP) does not surpass FR-UNet (ACC: 98.48, SEN: 87.98, F1: 81.51, IOU: 68.82) or SGL (SEN: 86.90, F1: 82.71). However, it achieves competitive specificity, outperforming FR-UNet by 0.69 points and SGL by 0.40 points. This indicates that the penalized method is robust in identifying non-vessel regions, even if it lags slightly in sensitivity and F1-score.
On the CHUAC dataset, the penalized method achieves an accuracy of 98.19, sensitivity of 75.34, specificity of 99.10, F1-score of 76.12, and IOU of 61.57. In comparison, our non-penalized version, MTKD-UNet, records an accuracy of 98.23, sensitivity of 73.33, specificity of 99.40, F1-score of 76.02, and IOU of 61.41. The penalized method shows a slight decrease in accuracy (by 0.04 points) and specificity (by 0.30 points), while it achieves improvements in sensitivity (by 2.01 points), F1-score (by 0.10 points), and IOU (by 0.16 points). This indicates that penalization enhances the model’s ability to detect thin vessels (sensitivity) and overall segmentation performance (F1 and IOU), even if it slightly reduces specificity. Notably, MTKD-UNet (oursLP) outperforms FR-UNet by 0.11 percentage points in terms of F1-score and by 0.06 percentage points in IOU. Additionally, it surpasses other state-of-the-art methods regarding F1-score and IOU, suggesting a good balance between precision and recall.
For the DCA1 dataset, MTKD-UNet (oursLP) achieves an accuracy of 97.79, sensitivity of 84.48, specificity of 98.54, F1-score of 79.87, and IOU of 66.63. In comparison, the non-penalized version, MTKD-UNet (ours), shows an accuracy of 97.84%, sensitivity of 80.19%, specificity of 98.84%, F1-score of 79.38%, and IoU of 66.0%. The penalized version shows a slight decrease in accuracy (by 0.05 points) and specificity (by 0.30 points), but improvements are seen in sensitivity (by 4.29 points), F1-score (by 0.49 points), and IoU (by 0.63 points). This indicates that penalization enhances the model’s ability to detect thin vessels (sensitivity) and overall segmentation performance (F1 and IOU), even if it slightly reduces specificity. When compared to state-of-the-art methods, our MTKD-UNet (oursLP) model demonstrates competitive results, particularly in sensitivity, where it outperforms FR-UNet (sensitivity: 82.48%) by 2.00 points. While the F1-score and IoU are also competitive, they are slightly lower than those of FR-UNet. The high specificity of the penalized method indicates its effectiveness in reducing false positives, which is crucial for accurate segmentation in medical imaging.
In conclusion, MTKD-UNet (oursLP) demonstrates competitive performance across multiple datasets, particularly in terms of specificity and accuracy. The penalized version further improves sensitivity, making it a strong candidate for applications where detecting thin vessels is critical. Additionally, it generally performs better than the non-penalized version, MTKD-UNet (ours), across all datasets. The penalization technique appears to improve the student model’s learning process, resulting in better segmentation outcomes. Overall, the proposed methods surpass the baseline U-Net model in most metrics. However, the penalized method does not consistently outperform state-of-the-art methods like FR-UNet or SGL. Applying this LP technique results in distinct performance differences between the MTKD-UNet (oursLP) and the non-penalized MTKD-UNet (ours) versions. Consistently across several datasets, the LP variant demonstrates improved sensitivity, which the authors interpret as an enhanced capability to detect finer, thinner vessels. This improvement likely stems from the increased weight given to the ground truth loss, compelling the model to minimize errors even on challenging positive examples (thin vessels). However, this heightened focus on capturing positive pixels sometimes leads to a slight decrease in specificity. This suggests an inherent trade-off: the model, by becoming more aggressive at identifying vessel pixels (improving sensitivity), may occasionally misclassify background pixels as vessels, thereby slightly increasing the false positive rate and reducing specificity. Despite this potential minor decrease in specificity, the results generally show that the F1-score and Intersection over Union (IOU) improve with the LP technique. This indicates that the benefit gained from correctly identifying more vessel pixels (higher sensitivity) typically outweighs the negative impact of potentially incorrectly classifying slightly more background pixels.
Fig. 3Effect of penalty on model performance for DRIVE, CHASEDB1, CHUAC and DCA1 datasets
Ablation studiesThe ablation studies conducted in this study aim to systematically evaluate the impact of various components and configurations on the performance of the proposed method for RVS. These experiments are designed to provide insights into the effectiveness of different loss functions and the balance between model complexity and performance. They also examine the individual contributions of teacher models, the sensitivity of the student model’s architecture, and the influence of the temperature parameter on KD. The findings from these ablation studies not only validate the design choices of the proposed framework but also offer valuable guidance for future research in KD and RVS.
Selection of the loss functionThis ablation study examines how different loss functions influence the performance of the student model during KD. While the distillation loss is fixed, the study evaluates the impact of various loss functions applied to the student network on the overall knowledge transfer and segmentation performance. The primary objective of this ablation study is to analyze the influence of different loss functions on the student model’s learning process and its ability to generalize from the soft labels provided by the teacher models. We aim to determine which loss components contribute most significantly to the student’s F1 score and Intersection over Union (IoU) performance by varying the loss function.
The experiment includes training the student model with eight different loss functions: DiceLoss, CombinedLoss, TverskyLoss, ComboLoss, SoftDiceLoss, DiceBCELoss, FocalLoss, and FocalTverskyLoss. The distillation loss function remains unchanged, ensuring that the comparison focuses solely on the impact of the student’s loss function. The performance metrics are evaluated on four datasets: DRIVE, CHASEDB1, CHUAC, and DCA1, as reported in Table 3.
As shown in Table 3, the results show that the choice of loss function significantly impacts the student model’s performance. For the DRIVE dataset, the ComboLoss achieves the highest F1 score (81.61) and IoU (69.02), indicating that a combination of Dice and cross-entropy losses provides the best balance between precision and recall. Similarly, for the CHASEDB1 dataset, the CombinedLoss yields the best F1 score (79.95) and IoU (66.62), suggesting that combining multiple loss functions can enhance the student’s ability to learn from the teacher models.
Table 3 Performance comparison of different loss functions on student networkFor the CHUAC dataset, the DiceBCELoss achieves the highest F1 score (76.12) and IoU (61.57), demonstrating that a combination of Dice and binary cross-entropy losses is particularly effective for this dataset. The results indicate that a loss function that prioritizes pixel-wise accuracy and class balance improves the CHUAC dataset. For the DCA1 dataset, the DiceBCELoss again yields the best performance, achieving an F1 score of 79.87 and an IoU of 66.63. This further confirms its effectiveness across various datasets.
Interestingly, the FocalLoss and FocalTverskyLoss underperform compared to other loss functions, particularly on the CHUAC and DCA1 datasets. This suggests that while focal losses are designed to address class imbalance, they may not be as effective in KD, where the soft labels already provide a balanced representation of the teacher models’ predictions.
The ablation study demonstrates that the choice of loss function for the student network is a critical factor in KD. The ComboLoss and CombinedLoss consistently perform well across multiple datasets, indicating that combining multiple loss components can enhance the student’s ability to perform overall knowledge transfer and segmentation performance. The DiceBCELoss also performs well, particularly on angiography datasets. These findings suggest that future work could explore adaptive loss functions to optimize distillation.
Impacts of complexity in student networkThis ablation study aims to analyze the trade-off between the size of the student model and its performance in RVS tasks. The objective is to determine whether a smaller, more efficient student model (referred to as the"Lite Student") can achieve comparable performance to the original, larger student model while significantly reducing computational complexity. This study is especially important for applications in resource-constrained environments, such as mobile devices or real-time medical imaging systems, where computational efficiency is critical.
In this study, we have compared the performance of the original student model, based on the standard U-Net architecture, with a Lite student model, where the number of feature maps in each layer is halved. The comparison is conducted across four datasets: DRIVE, CHASEDB1, CHUAC, and DCA1. The metrics used for evaluation included SEN, SPE, F1, and IOU (see Table 4). Additionally, we have compared the computational efficiency of the two models in terms of the number of parameters (#Params), number of FLOPs (#Flops), and parameter size (see Table 5).
The results revealed a clear trade-off between model size and performance. On the DRIVE dataset, the Lite student model showed a slight drop in performance, with the sensitivity (SEN) decreasing by 0.82 (from 79.54 to 78.72) and specificity (SPE) falling by 0.42 (from 98.60 to 98.18). The F1-Score decreased by 2.22 (from 81.66 to 79.44), and the IOU decreased by 3.17 (from 69.08 to 65.91). On the CHASEDB1 dataset, the Lite student model performed slightly better in terms of sensitivity (+2.78, from 78.09 to 80.87) but showed a minor drop in specificity (\(-\)0.38, from 98.83 to 98.45). The F1-Score decreased by 0.74 (from 79.95 to 79.21), and the IOU decreased by 0.99 (from 66.62 to 65.63). However, on the CHUAC and DCA1 datasets, the Lite student model struggled, with significant drops in sensitivity (\(-\)4.69 for DCA1). The F1-Score decreased by 2.97 and 3.73, respectively, and the IOU decreased by 3.76 and 4.99, respectively. This suggests that reducing the model size may not always be advantageous, especially for datasets with more complex vessel structures.
Table 4 Performance Comparison of Student and Lite Student Models on DRIVE. CHASEDB1. CHUAC. and DCA1 Datasets. \(\triangle\) indicates the difference between Student and Lite Student networksIn terms of computational efficiency, the Lite student model demonstrated significant improvements. For example, on the DRIVE dataset, the number of parameters has reduced from 31.04M to 0.487M, the number of FLOPs has reduced from 284.35G to 4.53G, and the parameter size has reduced from 124.17MB to 1.95MB. Similar reductions are observed across the other datasets, making the Lite student model highly efficient in terms of computational resources. The number of parameters and parameter size for the other datasets decreased similarly. However, the number of FLOPs is reduced for CHASEDB1, CHUAC, and DCA1 to 65, 16, and 85 times, respectively.
Our analysis also explores how dataset characteristics contribute to performance differences with a simplified student network. We examine factors such as image resolution, vessel tortuosity and thickness, contrast, noise levels, and imaging modalities across the DRIVE, CHASEDB1, CHUAC, and DCA1 datasets. For example, datasets with complex, low-contrast, or finer-caliber vessels (like CHUAC and DCA1) tend to degrade the performance of a lightweight model, which struggles to learn detailed features. Conversely, datasets with relatively more uniform and precise vessel representations might exhibit less performance drop. We highlight that DRIVE and CHASEDB1 comprise retinal fundus images, while CHUAC and DCA1 come from Coronary Angiography (CA). This difference significantly influences performance, as CA images face challenges like lower signal-to-noise ratios, dynamic contrast flow, and vessel morphology variations compared to the relatively static and well-defined structures in fundus images. Consequently, the"Lite Student"model, with its halved feature map capacity, may struggle more significantly with the inherent complexities and potentially greater variability within CA datasets like CHUAC and DCA1. Moreover, we analyze how the Lite model’s reduced capacity affects its feature capture ability. Interestingly, its sensitivity can improve in some cases (e.g., CHASEDB1, CHUAC) even as F1 and IoU decrease, possibly indicating a trade-off where the model learns generalized features that enhance recall but reduce precision in specific datasets.
In conclusion, this ablation study highlights the trade-off between model size and performance. While the Lite student model is computationally efficient, it may not always achieve the same level of performance as the original student model, particularly on more complex datasets. The Lite student model is a viable option for applications where computational efficiency is critical. However, the original student model may be preferable for tasks requiring high accuracy. Future work could focus on further optimizing the Lite student model using advanced compression techniques or architecture search to bridge the performance gap while maintaining efficiency.
Table 5 Comparison of the computational complexities of the Student and Lite Student networks on RVS datasets. The number of parameters and the parameter size are reported in megabytes, and the number of floating-point operations per second (FLOPs) is reported in gigabytesTeacher Model Contribution AnalysisThis ablation study investigates the individual contributions of each teacher model within the proposed Multi-Teacher Knowledge Distillation (MTKD) method for RVS. The primary objective is to determine whether combining multiple teachers provides complementary knowledge to the student network or if a single teacher dominates the learning process. This is crucial for understanding the efficiency of the multi-teacher approach and identifying potential redundancies or imbalances in the knowledge transfer process.
The experimental design involves training the student network using only one teacher at a time. Specifically, three separate training runs are conducted, each utilizing a single teacher model: Teacher 1 (T1) is trained with the original ground truth; Teacher 2 (T2) is trained with thick vessels; and Teacher 3 (T3) is trained with thin vessels. The performance of the student network trained with each individual teacher is then compared to the performance of the student network trained with all three teachers (the proposed MTKD approach). The evaluation metrics used for comparison are SEN, F1 and IOU, which are reported for the DRIVE and CHUAC datasets in Figure 4.
Fig. 4Evaluation of individual teacher contributions to student model performance: Comparison of single-teacher training (T1+S, T2+S, T3+S) to Multi-Teacher setup across evaluation metrics (SEN, F1, IOU) for DRIVE and CHUAC datasets. T1, T2, and T3 represent teacher models trained with original, thick, and thin ground truths. The text highlighted in red illustrates the differences between the proposed method and other approaches
The results presented in Figure 4 indicate that the proposed multi-teacher approach consistently outperforms single-teacher configurations across all metrics and datasets. For the DRIVE dataset, the proposed method achieves a sensitivity of 79.54, an F1-score of 81.66, and an IOU of 69.08. These values are higher than those obtained when training with individual teachers. Specifically, the sensitivity is improved by 0.7, 2.48, and 1.95 when compared to T1+S, T2+S, and T3+S, respectively. Similarly, the F1-score shows improvements of 0.2, 0.96, and 0.84, and the IOU is enhanced by 0.24, 1.22, and 1.18 for T1+S, T2+S, and T3+S, respectively. The CHUAC dataset also shows superior performance with the proposed method, achieving a sensitivity of 75.34, an F1-score of 76.12, and an IOU of 61.57. These results are improved by 3.31, 2.86, and 7.91 for SEN, 1.1, 0.79, and 3.7 for F1, and 1.39, 1.0, and 4.56 for IOU compared to T1+S, T2+S, and T3+S, respectively.
These findings suggest that the multi-teacher approach provides complementary knowledge to the student network. Each teacher learns about different parts of the vessel structure, such as the original, thin, and thick parts, and shares this unique knowledge with the student. This helps the student to better identify and segment retinal vessels. We interpret this outcome as evidence that each teacher develops unique expertise, having been trained on a specific aspect of the vessel structure (overall, thick or thin parts). Combined, these specialized teachers share their distinct knowledge with the student model. For instance, the teacher trained on thin vessels becomes adept at identifying those fine structures, while the teacher trained on thick vessels excels at segmenting the larger ones, and the ’original’ teacher provides a general overview. This collaborative approach generates synergy, as the student model benefits from this diverse set of perspectives, allowing it to integrate specialized knowledge from each teacher to attain a more comprehensive and accurate understanding of the entire retinal vasculature. We conclude that no single teacher dominates the learning process, and the combination is essential for optimal performance, highlighting the effectiveness of leveraging the diverse expertise encapsulated within the multiple teachers. This underscores the effectiveness of the proposed MTKD framework in utilizing the diverse expertise of multiple teachers to improve the robustness and accuracy of RVS.
Student Model Architecture SensitivityThe ablation study on Student Model Architecture Sensitivity explains how varying the student model’s architecture impacts the KD process. The architecture of the student model is a critical factor in determining its ability to learn effectively from teacher models. This study evaluates the performance of five different architectures–ResNet-18, ResNet-50, MobileNet v2, EfficientNet, and DenseNet-121–on the DRIVE and CHUAC datasets, using SEN, F1, and IoU as evaluation metrics. We have used U-Net as a base segmentation architecture and only changed the encoder/decoder networks.
The results presented in Figure 5 indicate that the choice of student model architecture significantly impacts performance. On the DRIVE dataset, the proposed architecture achieves the highest F1 score (81.66) and IoU (69.08), outperforming all other architectures. The differences between the proposed and other architectures can be up to 4.52 points for SEN, 3.62 points for F1, and 5.04 points for IOU. It also demonstrates strong sensitivity (79.54), indicating its ability to accurately detect true positives. Among the alternative architectures, DenseNet-121 and ResNet-50 perform relatively well, with F1 scores of 79.44 and 79.40, IoUs of 65.99 and 65.91, and sensitivities of 83.12 and 82.95, respectively. These results suggest that deeper architectures or those with dense connections are more effective in capturing the complex features required for RVS. In contrast, lightweight architectures such as MobileNet v2 and EfficientNet show lower performance, with F1 scores of 78.23 and 78.04, IoUs of 64.30 and 64.04, and sensitivities of 81.45 and 81.20, respectively. This indicates that while these architectures are computationally efficient, they may not have the capacity to capture the complex details of retinal vessels fully.
Fig. 5Evaluation of the student architecture sensitivity with different encoder/decoder networks across evaluation metrics (SEN, F1, IOU) for DRIVE and CHUAC datasets. The text highlighted in red illustrates the differences between the proposed method and others
The proposed model on the CHUAC dataset demonstrates significant improvements over the other architectures. It achieves the highest sensitivity (75.34), outperforming ResNet-18 by 4.36 points and MobileNet v2 by 10.3 points. For the F1 score, the proposed model achieves 76.12, with DenseNet-121 being the closest competitor at 73.35, a difference of 2.77 points. MobileNet v2 again shows the largest gap, with an F1 score 9.38 points lower than the proposed model. The IoU metric follows a similar trend, with the proposed model achieving 61.57, outperforming DenseNet-121 by 3.44 points and MobileNet v2 by 10.99 points. This further highlights the limitations of lightweight architectures in handling complex segmentation tasks.
Architectures like DenseNet-121 and ResNet-50 also perform relatively well, while lightweight architectures such as MobileNet v2 and EfficientNet show lower performance. They are more effective in capturing the complex features required for RVS. Conversely, it suggests that lightweight architectures, despite their computational efficiency, may not have the capacity to capture the complex details of retinal vessels fully. The study also highlights the importance of architectural compatibility between teacher and student models. We note that student models with architectures more similar in nature or capacity to the teacher models tend to perform better. Student models with architectures similar to the teacher models, such as ResNet-18 and ResNet-50, generally perform better than those with different architectures, such as MobileNet v2 and EfficientNet. This suggests that architectural similarity enables better knowledge transfer during the distillation process. Therefore, the explanation is that architectures with sufficient capacity (like ResNet and DenseNet) are better suited for the inherent complexity of the RVS task. They are also better for effectively receiving and integrating the distilled knowledge from the similarly complex U-Net-based teacher ensemble, facilitated by this architectural compatibility.
Temperature TuningIn this ablation study, we aim to investigate the impact of the temperature parameter (T) on the performance of the student model. The temperature parameter plays a critical role in KD by controlling the smoothness of the soft labels generated by the teacher models. A higher temperature produces softer probability distributions, which can help the student model generalize better. On the other hand, a lower temperature results in sharper distributions, which may lead to overfitting.
The primary objective of this ablation study is to analyze how varying the temperature parameter affects the student model’s performance across different datasets. By tuning the temperature, we aim to identify an optimal value that maximizes the transfer of knowledge from the teacher models to the student model. This improvement is expected to boost sensitivity, specificity, F1 score, and Intersection over Union.
The experiment includes training the student model with different temperature values (T = 2.0, 3.0, 4.0, 5.0, 7.0) while keeping other hyperparameters fixed. We have selected these temperature values based on insights from Cho and Hariharan [33] and recent studies on knowledge distillation [35, 9], and systematically evaluated temperature values ranging from 2.0 to 7.0. The performance metrics (Sensitivity, Specificity, F1 score, and IoU) are evaluated on the DRIVE and CHUAC datasets, as reported in Table 6. The goal is to observe how the temperature parameter influences the student model’s ability to learn from the soft labels provided by the teacher models.
Table 6 Performance of Multi-Teacher knowledge distillation with varying temperature values. The bold values indicate the best results for each metric and datasetThe results presented in Table 6 highlight the significant impact of the temperature parameter on the performance of the student model. Furthermore, we have presented a comprehensive figure (Figure 6) illustrating how F1 score and IoU metrics change with different temperature settings across our datasets. For the DRIVE dataset, the best performance is achieved at T=3.0, with an F1 score of 81.61 and an IoU of 69.02. This indicates that a moderate temperature value provides an optimal balance between soft label transfer and the student’s learning capability. In contrast, higher temperatures (\(T = 4.0\), \(T = 5.0\), and \(T = 7.0\)) lead to decreased performance, suggesting that excessively soft labels can impede the student’s ability to learn distinctive features.
For the CHUAC dataset, the optimal temperature is also T = 3.0, yielding the highest F1 score (76.12) and IoU (61.57). Interestingly, at T = 4.0, the sensitivity peaks at 84.83. However, this comes at the cost of a lower F1 score and IoU, This suggests that while the model detects more true positives, it may also increase the number of false positives, thereby reducing the overall effectiveness.
The ablation study demonstrates that the temperature parameter is a critical factor in KD. A temperature of T = 3.0 consistently provides the best balance between soft label transfer and the student model’s learning efficiency across both datasets. This finding aligns with the theoretical understanding that moderate temperatures improve generalization by smoothing probability distributions without significantly reducing discriminative information. Future work could explore adaptive temperature scheduling or dataset-specific tuning to further optimize the distillation process.
Fig. 6Effect of temperature on model performance for DRIVE and CHUAC datasets
Comments (0)