Importance Artificial intelligence (AI) and statistical models designed to predict same-admission outcomes for hospitalized patients, such inpatient mortality, often rely on International Classification of Disease (ICD) diagnostic codes, even when these codes are not finalized until after hospital discharge.
Objective Investigate the extent to which the inclusion of ICD codes as features in predictive models inflates performance metrics via “label leakage” (e.g. including the ICD code for cardiac arrest into an inpatient mortality prediction model) and assess the prevalence and implications of this practice in existing literature.
Design Observational study of the MIMIC-IV deidentified inpatient electronic health record database and literature review.
Setting Beth Israel Deaconess Medical Center.
Participants Patients admitted to the hospital with either emergency room or ICU between 2008 and 2019
Main outcome and measures Using a standard training-validation-test split procedure, we developed multiple AI multivariable prediction models for inpatient mortality (logistic regression, random forest, and XGBoost) using only patient age, sex, and ICD codes as features. We evaluated these models in the test set using area under the receiver operating curves (AUROC) and examined variable importance. Next, we determined the percentage of published multivariable prediction models using MIMIC that used ICD codes as features with a systematic literature review.
Results The study cohort consisted of 180,640 patients (mean age 58.7 ranged from 18-103, 53.0% were female) and 8,573 (4.7%) died during the inpatient admission. The multivariable prediction models using ICD codes predicted in-hospital mortality with high performance in the test dataset (AUROCs: 0.97-0.98) across logistic regression, random forest, and XGBoost. The most important ICD codes were ‘brain death,’ ‘cardiac arrest’, ‘Encounter for palliative care’, and ‘Do Not resuscitate status’. The literature review found that 40.2% of studies using MIMIC to predict same-admission outcomes included ICD codes as features even though both MIMIC publications and documentation clearly state the ICD codes are derived after discharge.
Conclusions and relevance Using ICD codes as features in same-admission prediction models is a severe methodological flaw that inflates performance metrics and renders the model incapable of making clinically useful predictions in real-time. Our literature review demonstrates that the practice is unfortunately common. Addressing this challenge is essential for advancing trustworthy AI in healthcare.
Question Do International Classification of Disease (ICD) diagnostic codes, which are only finalized after hospital discharge, artificially inflate the performance of AI healthcare prediction models?
Findings In a systematic literature review, 40.2% of published models trained to predict same-admission outcomes on the benchmark MIMIC dataset use ICD codes as features, despite both MIMIC papers clearly stating these codes are only available after discharge. Prediction models for inpatient mortality trained on ICD codes alone in the MIMIC-IV dataset can predict in-hospital mortality with high accuracy (AUROCs: 0.97-0.98). The most important codes are not available in time for any clinically useful mortality prediction (e.g. “brain death” and “Encounter for palliative care”).
Meaning ICD codes are frequently used in inpatient AI prediction models for outcomes during the same admission rendering their output clinically useless. To ensure AI models are both reliable and clinically deployable, greater diligence is needed in identifying and preventing label leakage.
Competing Interest StatementThe authors have declared no competing interest.
Funding StatementThis work was funded in part by the National Institutes of Health, specifically grant number R00NS114850 to BKB. Additionally, we would like to thank the University of Chicago Center for Research Informatics (CRI) High-Performance Computing team. The CRI is funded by the Biological Sciences Division at the University of Chicago with additional funding provided by the Institute for Translational Medicine, CTSA grant number UL1 TR000430 from the National Institutes of Health.
Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
This study performed secondary analyses of the MIMIC database available via Physionet (https://physionet.org/) and a systematic review of published literature.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data AvailabilityData used in this study are available in the supplemental materials (literature review) and via Physionet (https://physionet.org/).
Comments (0)