Evaluating Deep Learning Sepsis Prediction Models in ICUs Under Distribution Shift: A Multi-Centre Retrospective Cohort Study

ABSTRACT

Sepsis remains a leading cause of mortality in intensive care units (ICUs) worldwide, underscoring the urgent need for early detection to improve patient outcomes. While artificial intelligence (AI) models trained on ICU data show promise for sepsis prediction, their clinical utility is frequently hampered by poor generalization under external validation, largely attributable to distribution shifts arising from heterogeneity in data. Prior studies have focused on direct model deployment or conventional transfer learning methods (e.g., fine-tuning), yet systematic exploration of alternative strategies and root causes of performance degradation remains limited. In this study, we quantify those distribution shifts across three harmonized adult ICU cohorts: the high-resolution HiRID database (Bern University Hospital, Switzerland; 29 698 stays, 2008–2019; 6.3 % sepsis), MIMIC-IV (Beth Israel Deaconess Medical Center, USA; 63 425 stays, 2008–2019; 5.2 % sepsis), and eICU (208 US hospitals; 123 413 stays, 2014–2015; 4.6 % sepsis) for a total of 216 536 stays and 10 846 sepsis cases. We then evaluate five deployment strategies across three model architectures (CNN, InceptionTime, LSTM) under four target-data regimes: none, small (< 8000 stays), medium (8000–32000), and large (> 32000). The strategies are direct generalisation, standard transfer learning (fine-tuning / retraining), target training, supervised domain adaptation (DA: MMD or CORAL), and fusion training (merged datasets). Key results demonstrate that fine-tuning consistently underperforms across all data sizes (adjusted p < 0.05 vs. DA, fusion, and retraining) even though it has been the go to method in many prior studies that explored this direction. Retraining and fusion training excel in small and large target domains, while supervised DA methods dominate in medium-sized datasets. For example, DA with maximum mean discrepancy (DA MMD) achieves superior performance in both area under the receiver operating characteristic curve (AUROC = 0.720) and normalized area under the precision-recall curve (nAUPRC = 2.352) compared to fusion training (AUROC = 0.712, nAUPRC = 2.215; p = 0.02, adjusted p = 0.07). Retraining remains competitive (AUROC = 0.719, nAUPRC = 2.326; p > 0.05 vs. DA MMD) but lags in nAUPRC. Overall, our results call for moving beyond routine fine-tuning: retraining or fusion are preferable in data-poor or data-rich scenarios, whereas domain adaptation offers the most stable and substantial gains when moderate target data are available.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

This project was supported by grant 902 of the Strategic Focus Area Personalized Health and Related Technologies (PHRT) of the ETH Domain and Young Investigator Grant of the Novartis Foundation for Medical-Biological Research. The funders played no role in study design, data collection, analysis and interpretation of data, or the writing of this manuscript.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

This study was conducted using three publicly available, de-identified critical care datasets: MIMIC-IV, HiRID, and eICU. The MIMIC-IV database was approved by the institutional review boards (IRBs) of the Massachusetts Institute of Technology (MIT) and Beth Israel Deaconess Medical Center (BIDMC), and is made available under a data use agreement. All users are required to complete the CITI Data or Specimens Only Research course before accessing the data. The dataset is de-identified in accordance with the Health Insurance Portability and Accountability Act (HIPAA). The HiRID dataset was released with approval from the Ethics Commission of the Canton of Bern (KEK), Switzerland. The data are fully de-identified and were collected retrospectively from ICU patients at the Bern University Hospital. Informed consent was waived by the ethics committee due to the anonymized nature of the data. The eICU Collaborative Research Database is a multi-center critical care database developed by Philips Healthcare in collaboration with the MIT Laboratory for Computational Physiology. It is also de-identified in compliance with HIPAA and available under a data use agreement, requiring completion of the same CITI training course for access. Since all datasets used in this study are de-identified and publicly available, and the original data collection was approved by the respective ethics committees, no additional ethics approval was required for this secondary analysis.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

The three publicly available, de-identified datasets HiRID, MIMIC-IV and eICU are available from Physionet after successful completion of the CITI “Data or Specimens Only Research” course. Harmonized datasets can be generated using the publicly available code provided in the YAIB repository.

View original article

Medrxiv - Intensive Care And Critical Care Medicine

Like

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Evaluating Deep Learning Sepsis Prediction Models in ICUs Under Distribution Shift: A Multi-Centre Retrospective Cohort Study

Comments (0)