Inflammatory biomarkers and physiological reserve: an explainable machine learning model for predicting postoperative pulmonary complications in elderly laparoscopic surgery

Abstract

Introduction:

Postoperative pulmonary complications (PPCs) significantly impact the prognosis of elderly patients undergoing laparoscopic surgery, yet reliable tools for early risk stratification are lacking. This study aimed to develop and externally validate a machine learning (ML) model to predict PPCs using preoperative and intraoperative data available at the point of surgical closure.

Methods:

A multicenter retrospective cohort study was conducted involving 1,415 elderly patients (age >60 years) from two tertiary hospitals in China. The primary outcome was clinically significant PPCs (Clavien-Dindo Grade ≥ II) within 7 days postoperatively. Nine ML algorithms were trained and optimized using a nested 5-fold cross-validation framework. The Synthetic Minority Over-sampling Technique (SMOTE) and Boruta algorithm were employed to address class imbalance and feature selection, respectively. The model’s performance was evaluated in an internal development cohort and an independent external validation cohort (n=102).

Results:

Among the evaluated algorithms, the Gradient Boosting Machine (GBM) demonstrated superior performance, achieving an Area Under the Curve (AUC) of 0.691(95% CI: 0.617-0.762) (Sensitivity 65.2%, Specificity 83.4%) in the internal cohort, Notably, the model exhibited superior performance in the external validation cohort with an AUC: 0.755 (95% CI: 0.652–0.849), indicating excellent generalizability without overfitting. The decision curve analysis confirmed that the GBM model provided a higher net clinical benefit than the default strategies. SHAP (SHapley Additive exPlanations) analysis identified Surgery Duration, Preoperative Albumin, and inflammatory markers (CRP, WBC) as top predictors, reflecting the interplay between surgical stress and physiological reserve. Decision Curve Analysis (DCA) confirmed the model’s clinical utility, showing a net benefit across a threshold probability range of 30%–90%.

Conclusion:

The GBM-based dynamic model offers a robust, interpretable, and generalizable tool for the early prediction of PPCs in elderly laparoscopic surgery patients. By enabling risk assessment immediately upon surgical completion, this tool facilitates the shift from reactive treatment to proactive prevention and personalized perioperative management.

1 Introduction

Postoperative pulmonary complications (PPCs) are significant contributors to perioperative mortality and morbidity in noncardiac surgical patients (Fernandez-Bustamante et al., 2017; Miskovic and Lumb, 2017; Qaseem et al., 2006). Research indicates that the predictive value of PPCs for long-term mortality in noncardiac patients surpasses that of cardiac complications (Mills, 2018; NIHR Global Health Research Unit on Global Surgery and STARSurg Collaborative, 2024). Despite the perception that laparoscopic surgery minimizes the risk of PPCs due to its minimally invasive nature, elderly patients remain susceptible to PPCs due to various factors such as complications, pneumoperitoneum, unique positioning, postoperative pain, with an incidence rate ranging from 20% to 40% (Pensier et al., 2025), particularly within the initial week post-surgery (Elefterion et al., 2024). PPCs lead to prolonged hospitalization, increased healthcare costs, and a substantial decline in quality of life and long-term survival (Alfirevic et al., 2023; Fernandez-Bustamante et al., 2017; Li et al., 2024; Mills, 2018; Moore et al., 2017; Pensier et al., 2025; Qaseem et al., 2006; Sun et al., 2023). While various risk scores exist, most rely on subjective preoperative assessments and fail to capture the physiological impact of surgical stress. Furthermore, although some studies suggest using postoperative laboratory markers for prediction, these often delay the critical decision-making process during the immediate transition from the operating room to the post-anesthesia care unit (PACU) or intensive care unit (ICU).

Current assessments of PPCs risk primarily utilize traditional tools like ARISCAT scores or logistic regression models. While valuable, these methods have notable limitations: they incorporate a limited range of risk factors and struggle to integrate the complex, multidimensional clinical data generated perioperatively, such as comorbidities in the elderly and intraoperative variables. Furthermore, traditional models are inadequate at managing complex nonlinear relationships and high-dimensional feature interactions, resulting in predictive power (AUC) typically between 0.70 and 0.80, which falls short of clinical risk stratification needs (Peng et al., 2022; Zorrilla-Vaca et al., 2023). In contrast, machine learning models, such as deep neural networks and CNNs, excel in automatic feature extraction and complex pattern recognition for high-dimensional, heterogeneous data. They effectively identify potential risk signals and capture complex nonlinear interactions, demonstrating significant advantages in clinical multi-domain risk prediction (Chau et al., 2025; Kim et al., 2023; Li et al., 2024; Liu et al., 2023; Sagar et al., 2022).

There is a shortage of high-quality research on the utilization of Machine learning (DL) technology for predicting specific risks in elderly patients, a particularly vulnerable group for Postoperative Pulmonary Complications (PPCs), notably in the context of laparoscopic surgery involving general anesthesia and endotracheal intubation (Liu et al., 2023; Yoon et al., 2024). Despite the minimally invasive nature of such surgeries, challenges to the respiratory system persist due to factors like pneumoperitoneum, unique patient positioning, and anesthesia management. Therefore, further comprehensive investigations are warranted to validate the efficacy and clinical significance of DL in this setting. A clinically utility model must provide timely results to guide early intervention. Integrating preoperative baseline characteristics with intraoperative surgical data offers a ‘real-time’ snapshot of a patient’s risk profile at the exact moment of surgical closure. This immediate risk stratification is particularly crucial for elderly patients, whose physiological reserve is limited and who require proactive respiratory management to prevent the onset of PPCs. Therefore, this study aimed to develop and validate a robust machine learning-based model for predicting PPCs in elderly patients undergoing laparoscopic surgery, utilizing only data available up to the point of surgical completion. By focusing on preoperative baseline factors and intraoperative physiological variables, we sought to provide an immediate and objective risk assessment tool. Our objective was to empower clinicians to identify high-risk individuals at the earliest possible stage, enabling the timely implementation of personalized lung-protective strategies and optimized postoperative disposition.

2 Section snippets2.1 Study design and population

A retrospective cohort study was carried out on 1,415 elderly patients who underwent laparoscopic surgery with general anesthesia and endotracheal intubation at two Class III hospitals. Inclusion criteria comprised age over 60 years, ASA classifications I-III, and various laparoscopic procedures. Exclusion criteria encompassed pre-existing severe pulmonary conditions (pre-existing severe respiratory failure requiring mechanical ventilation or uncontrolled acute pulmonary infections.), conversion to laparotomy, and substantial missing data. The study adhered to the Declaration of Helsinki, ethical standards of the National Health Commission of China, and received approval from the relevant Ethics Committee of Neijiang City First People’s Hospital (No.: 2024-lun-17) and Ethics Committee of First Affiliated Hospital of Chongqing Medical University (No.: 20195801). Anonymized data obviated the need for patient consent.

2.2 Data collection and grouping

Data Collection and Variables: Data were obtained from the Hospital Electronic Medical Records System (HIS) and the Anesthesia Clinical Information System (AIMS). The collected preoperative variables encompassed demographic characteristics (age, gender, BMI), lifestyle factors (smoking history), and ASA physical status classification. Preoperative laboratory indicators were recorded, including albumin levels, white blood cell count (WBC), C-reactive protein (CRP), neutrophil ratio, and lymphocyte percentage. Clinical baseline data included preexisting lung conditions and other preoperative comorbidities. Intraoperative factors, which reflect the immediate impact of surgical stress, included the duration of anesthesia, duration of surgery, and the specific analgesia method employed. To eliminate potential methodological bias and ensure the model serves as an early-warning tool at the point of surgical closure, all data collected 24 hours postoperatively were excluded from the final analysis.

2.3 Outcome definitions and grouping

The primary outcome was postoperative pulmonary complications (PPCs) within 7 days post-surgery, defined using the 2018 BJA criteria (Abbott et al., 2018): pneumonia, respiratory failure, atelectasis, pleural effusion, bronchospasm, aspiration pneumonia, and acute respiratory distress syndrome (ARDS). Discrepancies were resolved by a third senior clinician. Furthermore, PPCs were stratified by severity according to the Clavien-Dindo classification (Grade ≥ II). Patients were categorized into PPCs (n=310, 21.9%) and non-PPCs (n=1105, 78.1%) groups based on the occurrence of PPCs.

Missing data were handled based on a rigorous protocol: variables with >40% missingness (e.g., respiratory mechanics) were excluded. For variables with <10% missingness, Multiple Imputation by Chained Equations (MICE) was employed to minimize bias. We generated five imputed datasets (m=5) using predictive mean matching (PMM) for continuous variables and logistic regression for binary variables. The imputation model included all potential predictors listed in Table 1, along with the outcome variable (PPCS), to preserve the correlation structure. The final analysis was performed on each imputed dataset, and the results were pooled according to Rubin’s rules.

CharacteristicTraining set (n=990)Internal validation (n=425)External validation (n=102)P-valueSMD (Train vs Int)SMD (Train vs Ext)Age (years)72.73 ± 7.5072.87 ± 7.9871.25 ± 5.700.1360.0180.222BMI (kg/m^2)22.35 ± 3.1622.33 ± 3.1722.36 ± 2.000.9910.0070.004Anesthesia Duration (min)203.97 ± 91.68212.72 ± 98.71170.20 ± 28.29<0.0010.0920.498Surgery Duration (min)165.90 ± 87.02174.30 ± 92.63143.33 ± 25.450.0040.0930.352Preop Albumin (g/L)39.05 ± 6.1939.11 ± 6.6338.79 ± 4.280.8990.0090.049Preop CRP (mg/L)10.18 ± 29.398.01 ± 18.5812.90 ± 12.860.1590.0880.12Preop Hemoglobin (g/L)120.98 ± 22.04121.09 ± 22.92127.12 ± 14.880.0250.0050.327Preop WBC (10^9/L)6.72 ± 3.436.50 ± 2.827.64 ± 1.960.0050.070.33Preop Neutrophil %67.56 ± 29.3980.36 ± 292.4872.58 ± 7.380.370.0620.234Preop Lymphocyte %22.62 ± 10.7423.06 ± 9.8720.91 ± 6.220.1640.0430.195Gender0.2810.090.058 Female364 (36.8%)175 (41.2%)40 (39.6%) Male626 (63.2%)250 (58.8%)61 (60.4%)Smoking History0.5820.060.021 No804 (81.2%)335 (78.8%)82 (80.4%) Yes186 (18.8%)90 (21.2%)20 (19.6%)Surgery Type0.4270.0330.1050853 (86.2%)371 (87.3%)84 (82.4%)1137 (13.8%)54 (12.7%)18 (17.6%)Surgical Approach<0.0010.0371.5080770 (77.8%)337 (79.3%)18 (17.6%)1220 (22.2%)88 (20.7%)84 (82.4%)Surgical Site<0.001––Gastrointestinal886 (89.5%)381 (89.6%)70 (68.6%)Hepatobiliary47 (4.7%)22 (5.2%)6 (5.9%)Urologic36 (3.6%)16 (3.8%)15 (14.7%)Gynecologic21 (2.1%)6 (1.4%)11 (10.8%)Preop Lung Condition0.0030.0140.332 Normal786 (79.4%)335 (78.8%)79 (77.5%) Abnormal204 (20.6%)90 (21.2%)23 (22.5%)Comorbidities (excl. lung)0.4580.0720.037 No744 (75.2%)306 (72.0%)75 (73.5%) Yes246 (24.8%)119 (28.0%)27 (23.8%)Analgesia Method<0.001––PCIA735 (74.2%)330 (77.6%)51 (50.0%)PCEA208 (21.0%)74 (17.4%)32 (31.4%)TAP20 (2.0%)11 (2.6%)19 (18.6%)PICA
+TAP9 (0.9%)3 (0.7%)0 (0.0%)NO18 (1.8%)7 (1.6%)0 (0.0%)PPCs Outcome0.702<0.0010.084 Normal773 (78.1%)332 (78.1%)76 (74.5%) Abnormal217 (21.9%)93 (21.9%)26 (25.5%)

Baseline demographic and clinical characteristics of patients in the training, internal validation, and external validation cohorts.

Data are presented as mean ± standard deviation (SD) for continuous variables and number (percentage) for categorical variables. BMI, body mass index; CRP, C-reactive protein; WBC, white blood cell count; PPCs, postoperative pulmonary complications; SMD, standardized mean difference, PCIA, patient-controlled intravenous analgesia; PCEA, patient-controlled epidural analgesia; TAP, transversus abdominis plane block; NO, no analgesia. P-values indicate comparisons across the cohorts. An SMD < 0.1 typically indicates a negligible difference between groups (good balance).

2.4 Data processing and model construction

Categorical variables were coded based on preset rules. Data cleaning involved handling missing values and outliers. Continuous variables were normalized, and categorical variables were encoded. The dataset was split using stratified random sampling. The training set was oversampled using the SMOTE algorithm to balance classes. Feature selection was performed using the Boruta algorithm. To prevent data leakage, SMOTE was implemented only within the training folds during the nested 5-fold cross-validation process, ensuring the validation and test sets remained untouched and representative of the real-world incidence.

2.5 Model building and evaluation

We systematically evaluated nine supervised machine learning algorithms: Logistic Regression (LR), Random Forest (RF), Support Vector Machine (SVM), k-Nearest Neighbors (KNN), Decision Tree (DT), Naive Bayes (NB), Gradient Boosting Machine (GBM), AdaBoost and Logistic Regression(LR). Hyperparameters were fine-tuned through grid search/random search in conjunction with 5-fold cross-validation. Performance on the test set was assessed using multidimensional indicators, including core discriminant metrics such as Area Under the Curve (AUC) for discrimination, sensitivity for identifying positive predictive cases (PPCs), and specificity for excluding non-PPCs. Comprehensive metrics such as accuracy and F1 score, which balances accuracy and recall, were also utilized.

The optimal model was further analyzed to elucidate the relationship between each characteristic parameter and PPCs using SHAP (SHapley Additive exPlanations) interpretability analysis. This involved techniques like honeycomb plots and heat maps to determine the importance and ranking of parameters. The model’s practical clinical utility was evaluated through decision curve analysis and clinical impact curve assessment.

All machine learning procedures, including MICE imputation, Boruta feature selection, and SMOTE-enhanced training, were performed using Python (v3.9) with the Scikit-learn and BorutaPy libraries to ensure reproducibility.

2.6 Statistical analysis

Statistical analyses of baseline characteristics were performed using SPSS software (version 29.0, IBM Corp., Armonk, NY, USA). Continuous variables were first tested for normality using the Kolmogorov-Smirnov test; normally distributed data were expressed as mean ± standard deviation and compared using the independent sample t-test, while non-normally distributed data were presented as median with interquartile range and analyzed using the Mann-Whitney U test. Categorical variables were reported as frequencies and percentages and compared using the Chi-square test or Fisher’s exact test, with a two-sided P-value < 0.05 considered statistically significant.

Model performance evaluation and statistical plotting were implemented using Python (version 3.9). The discriminative ability of the models was assessed using the Area Under the Receiver Operating Characteristic Curve (AUC), sensitivity, specificity, accuracy, and F1-score, while calibration was evaluated using calibration curves and the Brier score. Clinical utility was quantified via Decision Curve Analysis (DCA) and Clinical Impact Curve (CIC), and model interpretability was achieved through SHAP values and a dynamic nomogram. Sample size adequacy was verified based on the “Events Per Variable” principle (EPV > 10).

3 Results3.1 Baseline characteristics (patient demographics and clinical features)

A total of 1,415 eligible patients were included, divided into a training set (n=990), an internal validation set (n=425), and an independent external validation set (n=102) (Table 1). The training and internal validation sets were well-balanced, with Standardized Mean Differences (SMD) < 0.1 for most baseline variables, including age, BMI, and comorbidities. Notably, the incidence of PPCs was identical in both internal cohorts (21.9%; SMD < 0.001). In contrast, the external validation cohort exhibited significant heterogeneity compared to the training set, characterized by shorter anesthesia durations (SMD = 0.498), substantial differences in surgical approach distributions (SMD = 1.508) reflecting the inherent procedural heterogeneity between different surgical centers. Despite these clinical disparities, the incidence of PPCs in the external cohort (25.5%) did not differ significantly from the training set (P = 0.702). In terms of severity based on the Clavien-Dindo classification (Table 2), the majority of PPCs were Grade II (requiring pharmacological treatment), accounting for approximately 18%–20% of the total population. Given that Grade III-IV complications accounted for less than 5% of cases, the model’s predictive performance is primarily driven by Grade II complications (pharmacological treatment required). This indicates its significant clinical utility in identifying patients who may benefit from early medical intervention.

PPC category/outcomeTraining set (n=990)Internal test set (n=425)External validation (n=102)Total PPCs, n (%)217 (21.9%)93 (21.9%)26 (25.5%)PPC subtypes (BJA Criteria)- Pneumonia110 (11.1%)45 (10.6%)11 (10.8%)- Atelectasis66 (6.7%)29 (6.8%)8 (7.8%)- Respiratory failure22 (2.2%)10 (2.4%)4 (3.9%)- Pleural effusion12 (1.2%)5 (1.2%)2 (2.0%)- Bronchospasm4 (0.4%)3 (0.7%)1 (1.0%)- ARDS/Other3 (0.3%)1 (0.2%)0 (0.0%)Severity (Clavien-Dindo Grade)- Grade II (Pharmacological treatment)180 (18.2%)77 (18.1%)20 (19.6%)- Grade III (Intervention needed)12 (1.2%)5 (1.2%)2 (2.0%)- Grade IV (ICU/Life-threatening)25 (2.5%)11 (2.6%)4 (3.9%)

Incidence and spectrum of postoperative pulmonary complications (PPCs) in the training, internal test, and external validation cohorts.

Data are presented as number (percentage). PPCs, postoperative pulmonary complications; BJA, British Journal of Anesthesia; ARDS, acute respiratory distress syndrome; ICU, intensive care unit. Definition of Severity: Grade II: Complications requiring pharmacological treatment with drugs other than such allowed for grade I. Grade III: Complications requiring surgical, endoscopic, or radiological intervention. Grade IV: Life-threatening complications (including CNS complications) requiring IC/ICU management.

3.2 Baseline patient characteristics and incidence of PPCs

We evaluated the performance of nine machine learning algorithms in the development cohort. Figures 1A, B illustrates the Receiver Operating Characteristic (ROC) curves for the training and internal validation sets. Among all models, the Gradient Boosting Machine (GBM) demonstrated the superior discriminative ability, achieving the highest AUC of 0.691 (95% CI: 0.617–0.762) in the internal validation set, with a sensitivity of 0.652 and a specificity of 0.834, followed by Random Forest (AUC = 0.669). The radar chart (Figures 1C, D) provides a comprehensive comparison of multiple metrics, including sensitivity, specificity, accuracy, and F1-score. The GBM model exhibited the most balanced performance profile, with a specificity of 0.834, outperforming other classifiers. Based on these comprehensive evaluations, GBM was selected as the optimal model for further analysis.

Four-panel figure showing machine learning model performance. Top left, training set ROC curves compare models, with AUC values highest for gradient boosting and random forest. Top right, validation set ROC curves show lower AUC across models. Bottom left, a radar chart presents training metrics—accuracy, AUC, F1, specificity, sensitivity—for each model. Bottom right, a similar radar chart displays validation metrics. Each legend ties color to model type.

Performance comparison of nine machine learning algorithms. (A, B) Receiver Operating Characteristic (ROC) curves in the training (A) and internal validation (B) sets. The evaluated algorithms include Logistic Regression (LR), Random Forest (RF), Support Vector Machine (SVM), k-Nearest Neighbors (KNN), Decision Tree (DT), Naive Bayes (NB), Gradient Boosting Machine (GBM), AdaBoost and Logistic Regression(LR). The GBM model demonstrated superior discriminative ability. (C, D) Radar charts illustrating comprehensive performance metrics (AUC, sensitivity, specificity, accuracy, and F1-score) in the training (C) and internal validation (D) sets. ROC, Receiver Operating Characteristic; AUC, Area Under the Curve.

3.3 Comparison of prediction performance of machine learning models

To assess clinical applicability, Decision Curve Analysis (DCA) and Clinical Impact Curves (CIC) were performed (Figure 2). The DCA results (Figures 2A, B) indicate that the GBM model provides a higher net benefit than “treat-all” or “treat-none” strategies across a broad threshold range (approximately 30% to 90%), suggesting it can optimize intervention decisions without increasing unnecessary treatments. Furthermore, the CIC (Figures 2C, D) confirms the model’s diagnostic value: within high-probability thresholds, the number of predicted high-risk patients closely aligns with actual positive events, demonstrating high specificity in identifying those most likely to benefit from medical resources.

Four-panel figure shows model evaluation curves for training and validation sets. Top left: decision curve analysis for training set with red model, dashed treat-all, and black treat-none lines. Top right: validation set DCA with blue model line. Bottom left: clinical impact curve for training set, red line for high risk, blue dashed for true positives. Bottom right: validation set clinical impact, similar color scheme. All curves plot threshold probability versus net benefit or number per thousand patients.

Assessment of clinical utility in the training and validation sets. (A, B) Decision Curve Analysis (DCA) for the training and validation sets. The GBM model (colored lines) shows a higher net benefit than the “treat-all” (gray dashed) and “treat-none” (black solid) strategies across a wide threshold range. (C, D) Clinical Impact Curve (CIC). The red curve indicates the number of patients classified as high risk, and the blue dashed curve indicates the number of true positives per 1000 patients.

Regarding calibration and detailed classification accuracy (Figure 3), the model exhibited robust performance. Calibration curves (Figure 3A) showed good agreement between predicted and observed probabilities, with Brier scores of 0.091 in the training set and 0.178 in the validation set, indicating no severe overfitting. The confusion matrix for the internal validation cohort (Figure 3B, n=425) further revealed a balanced classification profile: the model correctly identified 42 PPCs cases (True Positives) and excluded 277 non-cases (True Negatives). With 51 False Negatives and 55 False Positives, the model achieves a reasonable trade-off between sensitivity and specificity, validating its reliability as a preoperative screening tool.

Panel A shows a calibration curve with predicted probability on the x-axis and observed fraction on the y-axis, displaying lines for ideal, training, and validation datasets with associated Brier scores. Panel B presents a confusion matrix for validation, with true labels on the y-axis and predicted labels on the x-axis, showing counts for true negatives (277), false positives (55), false negatives (51), and true positives (42).

Calibration and classification performance of the GBM model. (A) Calibration curves. The diagonal gray dashed line represents perfect calibration. The GBM model showed good calibration with Brier scores of 0.091 in the training set (blue squares) and 0.178 in the validation set (red circles). (B) Confusion matrix for the internal validation cohort (n = 425), displaying the counts of True Negatives (TN), False Positives (FP), False Negatives (FN), and True Positives (TP). PPCS, Postoperative Pulmonary Complications; GBM, Gradient Boosting Machine.

3.4 External validation and generalizability

To verify the robustness and generalizability of the model, we evaluated its performance on an independent external validation cohort. Figure 4A displays the ROC curves for the external dataset, where the GBM model maintained a fair discriminative performance with an AUC of 0.755 (95% CI: 0.652–0.849). The shorter anesthesia duration in the external set (170.20 ± 28.29 min vs. 203.97 ± 91.68 min) reflects variations in intraoperative management across different centers, which partially explains the performance attenuation. The corresponding radar chart (Figure 4B

Comments (0)

No login
gif