Although health care institutions generate vast volumes of health and clinical tabular data on a daily basis, these valuable resources frequently encounter significant bottlenecks when being translated into effective predictive models [,]. In practice, suboptimal model performance is often not attributable to the limitations of the algorithms themselves but rather to the lack of consistency and strategic planning in the data preprocessing pipeline [,]. Prior studies have demonstrated that various preprocessing techniques such as data imputation, normalization, and feature selection are highly sensitive in their impact on model performance [,], and may even introduce bias or lead to overfitting []. Consequently, a critical yet insufficiently addressed question arises: Can we establish a generalizable and reproducible data processing framework that assists users in selecting appropriate machine learning (ML) predictive models?
In the domain of medical informatics, ML models have been extensively used for predictive analyses of health and clinical data. However, extant literature has predominantly focused on comparing algorithmic accuracy and model performance [,], while offering limited systematic investigation into critical data processing steps such as data cleaning, feature encoding, and handling of class imbalance [,]. This oversight is particularly consequential given the inherent characteristics of medical datasets, which frequently exhibit high rates of missing values, numerous categorical variables, and severely imbalanced outcome distributions—factors that significantly influence predictive outcomes depending on the preprocessing choices made [,].
Despite the enormous volume of clinical and health data generated daily by health care institutions, transforming these data into effective predictive models remains challenging due to pervasive issues such as high missingness, class imbalance, and the need for robust variable encoding. Although prior studies have examined individual procedures such as comparing one-hot versus target encoding [,] and evaluating synthetic data augmentation techniques like the Synthetic Minority Over-sampling Technique (SMOTE) [], systematic evaluations of the interactive effects among diverse preprocessing workflows are exceedingly scarce, particularly across heterogeneous medical datasets. Furthermore, there is a notable gap in comparative research on encoding strategies (eg, one-hot, frequency, and target encoding) and imbalance correction methods (eg, SMOTE and Random Over Sampling Example [ROSE]), which hampers the establishment of best practices in clinical ML.
This study proposes a comprehensive, reproducible framework specifically for medical tabular data. The framework is built upon three core components: (1) the Data Processing Strategy Layer, which systematically evaluates essential preprocessing techniques, including missing value imputation, variable encoding, and class imbalance correction; (2) the Model Selection and Optimization Layer, which ensures compatibility with a diverse range of supervised learning algorithms; and (3) Cross-Dataset Validation, which tests the framework’s transferability and consistency on 2 highly heterogeneous real-world clinical datasets. This design not only streamlines the preprocessing pipeline but also minimizes overlap with subsequent methodological details.
The primary contribution of this work lies in developing and empirically validating an end-to-end data processing framework that transcends the limitations of single-model, single-dataset analyses. By shifting the focus from solely model-centric performance metrics to a holistic methodological architecture, our approach provides both data scientists and clinical researchers with a modular and standardized workflow. This framework is expected to bridge the gap between algorithm development and clinical application, offering robust empirical evidence and actionable guidance for advancing predictive modeling in health care.
Initially, we enrolled 542 adult patients with ESRD undergoing hemodialysis at the hemodialysis unit of a medical center between October 1, 2018, and December 31, 2021. Patients who had received hemodialysis for less than 3 months or who were transferred to other clinics during the study period were excluded. After these exclusions, 412 adult patients with ESRD undergoing chronic hemodialysis without transfer remained eligible for analysis. Among them, 242 patients had no occurrence of major adverse cardiovascular events (MACEs), while 170 experienced at least one MACE. The primary objective of this study was to determine the incidence of MACEs in this population. A flowchart of the study participants is presented in . The detailed baseline demographic and clinical characteristics of the cohort are summarized in .
Figure 1. Flowchart of study population selection and major adverse cardiovascular event (MACE) grouping. ESRD: end-stage renal disease; MACE: major adverse cardiovascular event; Table 1. Initial demographic and clinical profiles of the research cohort.VariablesOverall (n=412)MACEP valueNever occurredaMACE: major adverse cardiovascular event.
bAV cal: aortic valve calculation
cAR: aortic regurgitation.
dAS: aortic stenosis.
eLVH: left ventricular hypertrophy.
fDM: diabetes mellitus.
gPAOD: peripheral arterial occlusion disease.
hCXR_AoAC: chest X-ray for aortic arch calcification.
This study specifically targeted the unique clinical needs and complexities of patients with ESRD by collecting 84 variables. These variables were meticulously selected based on their significant impact on clinical outcomes in patients with ESRD. Demographic data, such as age and gender, were included, and dialysis vintage reflected the duration and history of each patient’s dialysis treatment. Additionally, we detailed the anatomical and functional characteristics of the aortic and mitral valves, which are crucial for understanding cardiovascular complications in patients with ESRD. Specifically, we included parameters such as types of arteriovenous access (AVA), mitral valve calcification (MV calc), aortic regurgitation (AR), aortic stenosis (AS), mitral regurgitation (MR), and mitral stenosis (MS).
In terms of cardiovascular health, this study placed particular emphasis on the grading and types of left ventricular hypertrophy (LVH) [] and the ejection fraction (EF) of the heart, which are critical indicators of cardiovascular health in patients with ESRD. Given the high prevalence and impact of comorbidities such as diabetes mellitus, hypertension, dyslipidemia, coronary artery disease, heart failure, chronic obstructive pulmonary disease, liver cirrhosis, malignancy, arrhythmia, and a history of amputation among patients with ESRD, we included these comorbidities in our analysis.
To better meet the clinical needs of patients with ESRD, we expanded the range of biochemical laboratory data. This included comprehensive assessments of total protein, albumin, liver enzymes (aspartate aminotransferase and alanine aminotransferase), alkaline phosphatase [], total bilirubin, lipid profiles, glucose levels, complete blood count, iron studies, aluminum levels, postdialysis weight, uric acid, and key electrolytes. To address the specific needs of patients with ESRD, we further measured calcium and phosphate metabolism indicators, such as calcium and phosphate levels, urea kinetics (Kt/V), parathyroid hormone levels, and the calcium-phosphate product, to more effectively manage mineral and bone disorders in these patients.
The medication history was also thoroughly documented, particularly focusing on drugs commonly used in the management of ESRD, such as phosphate binders, calcitriol, and other treatments relevant to the patients’ condition.
This study included 412 patients. The mean age was 69.19 years, with older patients more likely to experience MACE (70.94 years vs 67.96 years, P=.01). Females comprised 46.6% of the cohort, with a lower proportion in the MACE group, although this difference was not statistically significant (P=.08). Aortic valve calcification was more prevalent in the MACE group (72.6% vs 56.1%, P=.005), and AS as well as certain types of LVH were also associated with the occurrence of MACE. Diabetes mellitus (DM) and peripheral arterial occlusive disease (PAOD) were more common among patients who experienced MACE (DM: 58.2% vs 40.9%; P=.001; PAOD: 34.1% vs 21.9%; P=.008). In terms of medication use, a higher proportion of patients in the MACE group were on insulin (P<.001) and antiplatelet drugs (P<.001). These results provide an overview of the baseline characteristics of the patients and offer important references for improving the accuracy of predictive models.
Data PreparationThe data preprocessing methodology in this study is organized into 2 primary components: variable encoding and data balancing. The overall analytical workflow and preprocessing framework are illustrated in . This integrated approach is designed to enhance the quality and representativeness of clinical datasets, thereby improving the robustness and generalizability of subsequent predictive models.
Figure 2. Research analysis process framework diagram. Preprocessing components are fit on training folds only and then applied to validation folds; ROSE or SMOTE are applied to training folds only to prevent leakage. Data ImputationMissing data are a common challenge in clinical datasets and can significantly compromise the validity of statistical inferences if not appropriately addressed. In this study, we used a nonparametric multiple imputation strategy using the missForest package in R (R Core Team) []. This method uses random forest models to iteratively predict missing values based on observed data, effectively capturing complex nonlinear associations and interactions between variables. Unlike simpler imputation techniques such as mean substitution or k-nearest neighbors, missForest has been shown to yield more accurate and less biased estimates in both continuous and categorical variables, particularly in mixed type medical data [,]. This approach offers a robust and flexible foundation for downstream ML analysis while preserving the integrity of the original dataset.
Compared to SMOTE, which generates new samples through linear interpolation between neighboring minority class points, ROSE applies a smoothed bootstrap technique. It estimates a kernel density around each minority instance and samples new points from this local distribution. This approach allows ROSE to preserve the original variance and nonlinear structure of the data, making it particularly effective for clinical datasets where preserving subtle distributional patterns is important.
Imputation models were fit on the training fold only and then applied to the corresponding validation fold within each split. No statistics from the validation fold were used to fit the imputation model. This fold-wise procedure was repeated across all folds to prevent information leakage.
Variable Encoding and ExpansionIn clinical datasets, categorical variables are abundant and require conversion into numerical formats to be effectively used in statistical and ML models. Three encoding methods were implemented to address this challenge, each offering a unique balance between preserving information and managing computational complexity. Although all 3 approaches aim to transform qualitative data into quantitative representations, they differ in their operational mechanisms and associated trade-offs.
One-hot encoding converts each categorical variable into a series of binary indicators, where each category is represented by an individual binary feature. This method maintains the inherent nonordinal nature of the original variable but can lead to a substantial increase in dimensionality, particularly when the variable in question has many unique categories. In contrast, target encoding substitutes each category with a statistical summary such as the mean, weighted mean, or smoothed mean of the target variable computed from the training dataset. This not only reduces dimensionality but also encapsulates the predictive relationship between the categorical feature and the outcome variable, although it necessitates careful handling to avoid target leakage []. A third approach, frequency encoding, assigns to each category a numerical value based on its relative frequency within the dataset. This method is highly efficient in reducing computational burden and memory usage, as it compresses categorical information into a single continuous variable without imposing any artificial order [].
To combine the complementary strengths of these methods, a unified encoding strategy was adopted. Specifically, the one_hot function from the mltools package in R was applied to perform one-hot encoding, which expanded the original set of 83 variables to 113. This expansion increased the granularity of the dataset, facilitating a more detailed representation of the clinical phenomena under this research. The inclusion of target and frequency encoding provided an additional layer of comparison, enabling an evaluation of their relative performance under conditions of significant missingness and class imbalance. Prior work has demonstrated that, particularly in datasets with high proportions of missing data and imbalanced classes, preprocessing methods based on one-hot encoding can significantly enhance both accuracy and robustness in classification tasks [].
All categorical encoders were fit on the training fold only and then applied to its validation fold. Target encoding used an out-of-fold smoothing scheme: for each fold, category means and smoothing weights were computed from the training fold and then mapped to the validation fold. No target information from the validation fold was used to compute encodings.
Data ImbalanceClinical datasets often exhibit imbalanced class distributions, where the minority class, despite its clinical significance, is underrepresented. Such imbalance can lead to biased models that disproportionately favor the majority class. To counteract this, 2 complementary strategies, Random Over-Sampling (ROS) and the SMOTE, were incorporated, along with an evaluation of the ROSE method [].
The ROSE method uses a bootstrap resampling framework augmented by kernel density estimation to generate synthetic samples for the minority class. This approach avoids the pitfalls associated with simply duplicating minority samples, offering a more nuanced correction of the class distribution. Menardi and Torelli [] provide an extensive discussion of these resampling techniques, emphasizing their utility in balancing datasets for binary classification tasks. On the other hand, SMOTE, as introduced by Chawla et al [], synthesizes new minority class instances by interpolating between existing samples. This technique enriches the minority class by generating additional, diverse examples, thereby improving the model’s ability to capture the characteristics of rare events. Empirical studies have shown that SMOTE can lead to marked improvements in performance metrics, such as the area under the ROC curve (AUC), though it may overgeneralize when faced with extreme imbalance.
Comparative evaluations in the literature further underscore the respective merits of these techniques. For instance, Kamalov et al [] found that ROSE, despite its relative simplicity, remains a stable and computationally efficient solution, especially in multi-label contexts. Similarly, investigations by Gnip et al [] and Nguyen et al [] have confirmed that both ROSE and SMOTE are effective in mitigating the adverse effects of class imbalance, with the optimal choice being contingent upon the specific characteristics of the dataset and the available computational resources.
Integration of Preprocessing ComponentsThe overall preprocessing workflow was designed to integrate the 2 components variable encoding and data balancing, into a coherent sequence. Initially, missing values were imputed using the random forest approach, ensuring that the dataset was complete and reliable. This was followed by the transformation of categorical variables via the combined encoding strategy, which not only translated qualitative data into numerical features but also expanded the feature set to enhance data granularity. Finally, class imbalance was addressed through the application of ROSE and SMOTE, thereby ensuring that the resulting dataset was well-suited for the development of predictive models.
Each preprocessing step was carefully implemented so that subsequent operations built upon an increasingly refined version of the dataset. The robust imputation stage preserved the original data’s distributional properties, while the encoding procedures facilitated the construction of a rich, multidimensional feature space. The balancing techniques further adjusted the dataset to prevent bias toward the majority class, enabling the models to more effectively capture the subtleties of clinically significant, albeit rare, events.
Methodological Rationale and Literature JustificationThe methodological choices articulated above are firmly grounded in this body of literature. The use of a random forest–based imputation method is well-supported by studies demonstrating its efficacy in preserving data structure and ensuring the validity of statistical inferences [,]. Similarly, the comparative evaluation of encoding methods draws on prior research that highlights the trade-offs between dimensionality, computational efficiency, and the risk of target leakage [,]. Moreover, the advantages of one-hot encoding in contexts marked by high missingness and class imbalance have been substantiated by empirical investigations []. The incorporation of data balancing strategies, including ROSE and SMOTE, is equally well-documented, with seminal works establishing their effectiveness in correcting class imbalances [,], and more recent studies further validating these methods [-].
The data preprocessing methodology presented herein represents a rigorous and systematic approach to overcoming the multifaceted challenges inherent in clinical datasets. By sequentially addressing missing data, encoding categorical variables into a more informative numerical format, and correcting class imbalances, the workflow transforms raw clinical data into a format that is both analytically robust and amenable to predictive modeling. This integrated preprocessing pipeline is instrumental in bridging the gap between the complexities of clinical data and the demands of advanced statistical and ML techniques, ultimately contributing to the development of models that are both reliable and clinically pertinent.
Class rebalancing was applied only to the training portion of each fold. Validation folds preserved the original class distribution. For transparency and reproducibility, we report resampling parameters in Table S4 in , including SMOTE and ROSE sampling ratios and the random seed used.
Data Segmentationk-fold cross-validation is a commonly used resampling technique for evaluating the performance of ML models. Specifically, the dataset is divided into k subsets (folds), and in each iteration, one subset is used as the validation set while the remaining k-1 subsets are used as the training set. This process is repeated k times, with a different subset used for validation each time.
This method effectively prevents overfitting and provides a more robust evaluation of the model. It is particularly useful for estimating the generalization error in small datasets [].
In this study, we used 5-fold cross-validation to obtain an accurate estimate of model performance through multiple splits and evaluations while reducing bias caused by data partitioning.
Robust Model Engineering: Tuning, Processing, and Risk MitigationAll preprocessing and modeling steps, including cleaning, standardization, encoding, resampling, model fitting, and hyperparameter tuning, were executed separately within each dataset. No features, encoders, parameters, or statistics were transferred across datasets. For each model, hyperparameters were tuned by grid search with 5-fold cross-validation on the training set. A single predefined hyperparameter search space and evaluation criterion, detailed in Table S1 in , was used for both ESRD and BRFSS. Grid search and model selection were rerun independently in each dataset so that optimal hyperparameters were learned within ESRD and BRFSS separately and were never reused across datasets.
Model Building and Validation-Phase 1In Phase 1, we compared alternative encoding and class-imbalance handling strategies for predicting MACE. Traditional logistic regression and 6 ML models were evaluated, including decision trees [], random forests [,], XGBoost [], CatBoost [], and LightGBM []. All models were trained and evaluated using 5-fold cross-validation [] within each dataset, following the generic engineering framework described in the previous subsection, that is, grid search hyperparameter tuning with the shared search space in Table S1 in and dataset-specific model selection.
The workflow, including preprocessing, resampling, and hyperparameter tuning, was then rerun independently on the BRFSS 2015 dataset to assess cross-dataset portability of the pipeline rather than to externally validate a single ESRD-trained model. All training was executed on a workstation equipped with an Intel Core i9 10th-generation CPU (3.3 GHz) and 64 GB RAM, using the CPU only. Average runtimes for each pipeline are reported in and to help readers gauge computational cost. In this phase, multiple performance metrics were used, including accuracy, sensitivity, specificity, precision, F1-score, and AUC. The mean and SD of these metrics were calculated across folds to assess predictive accuracy, robustness, and consistency. The overall workflow is summarized in algorithm S1 in .
Table 2. Evaluation metrics of 5-fold cross-validation using all encoding methods and imbalanced data processing methods.Encoding, imbalance, andaAUC: area under the ROC curve.
bSMOTE: Synthetic Minority Over-sampling Technique.
cRF: random forest.
dCAT: CatBoost.
eLightGBM: light gradient boosting machine.
fROSE: Random Over Sampling Examples
gThis method has the highest model accuracy and a small standard deviation, indicating that it has a relatively good ability to make judgments
Table 3. Evaluation metrics of remodeling with the top 15 most important variables, using one-hot encoding and SMOTE or ROSE.ETL and machine learning methodsAccuracyF1-scoreAUCRuntime (sec)mean (SD)95% CImean (SD)95% CImean (SD)95% CISMOTELGR0.723 (0.045)0.668‐0.7790.680 (0.050)0.618‐0.7430.728 (0.044)0.673‐0.7830.08DT0.636 (0.035)0.592‐0.6800.547 (0.073)0.457‐0.6370.622 (0.044)0.567‐0.6762.97RF0.699 (0.027)0.665‐0.7330.675 (0.027)0.642‐0.7090.711 (0.020)0.687‐0.7369.79XGB0.658 (0.021)0.632‐0.6840.630 (0.052)0.566‐0.6940.670 (0.024)0.641‐0.7002.59CAT0.699 (0.044)0.644‐0.7540.703 (0.027)0.669‐0.7370.737 (0.021)0.711‐0.76393.8LightGBM0.709 (0.030)0.671‐0.7470.682 (0.045)0.626‐0.7380.717 (0.034)0.675‐0.76014.4ROSELGR0.738 (0.030)0.700‐0.7760.699 (0.050)0.636‐0.7620.755 (0.050)0.694‐0.8170.08DT0.755 (0.089)0.644‐0.8660.703 (0.126)0.547‐0.8600.750 (0.100)0.626‐0.8752.96RF0.891 (0.131)0.728‐1.0000.879 (0.133)0.713‐1.0000.915 (0.133)0.750‐1.0009.05XGB0.913 (0.122)0.762‐1.0000.899 (0.138)0.728‐1.0000.906 (0.137)0.736‐1.0003.28CAT0.862 (0.073)0.771‐0.9520.831 (0.092)0.717‐0.9450.896 (0.092)0.782‐1.000270LightGBM0.906 (0.115)0.763‐1.0000.892 (0.119)0.745‐1.0000.925 (0.112)0.786‐1.00019.2aAUC: area under the ROC curve.
bSMOTE: Synthetic Minority Over-sampling Technique.
cLGR: logistic regression.
dDT: decision tree.
eRF: random forest.
fXGB: extreme gradient boosting.
gCAT: CatBoost
hLightGBM: light gradient boosting machine
iROSE: random over sampling example.
jThis method\'s model results have the highest AUC and a small standard deviation, indicating a relatively good ability to make judgments.
Performance Estimation and Statistical TestingWithin each dataset, model performance was summarized from 5-fold cross-validation. For each metric (AUC, accuracy, and F1-score), we calculated 95% CIs from the 5-fold–specific estimates using a 2-sided Student t interval. This procedure corresponds to applying the t.test function in R to the vector of fold-level values. Because these metrics are bounded between 0 and 1, upper limits were truncated at 1.0 when necessary, and models with zero variance across folds yield a point interval at the mean.
For pairwise AUC comparisons within the same dataset, we applied the nonparametric DeLong test to out-of-fold predictions pooled across the 5 validation folds, using the roc.test function from the pROC package. We report 2-sided P values; all tests are conducted within the dataset, and we do not compute pooled AUCs or conduct between-dataset hypothesis tests.
Model Refinement and Variable Selection-Phase 2In the second phase, the AUC [] was used to evaluate the models from the first phase. For the optimal data processing method and the corresponding best-performing ML model, a detailed scoring and ranking of variables was conducted based on their importance and contribution to the model’s predictive performance.
The variables were converted into percentile rankings, with the most important variable assigned a score of 100, while the least important variable was assigned a score of 0. Using the 5-fold cross-validation approach, the scores from 5 iterations were summed and ranked, with the highest possible score being 500. Based on this scoring system, the top 15 highest-ranked variables were selected for remodeling, allowing for a more focused and in-depth analysis of their impact on the outcomes.
Ethical ConsiderationsThis study was reviewed and approved by the Institutional Review Board of Shin-Kong Wu Ho-Su Memorial Hospital (IRB No. 20231101R; approval date: December 14, 2023). The requirement for informed consent was waived by the ethics committee due to the retrospective nature of the study and the use of deidentified data. All procedures were conducted in accordance with the ethical standards of the responsible institutional and national committees on human experimentation and with the principles of the Declaration of Helsinki. The privacy and confidentiality of all participants were strictly protected throughout the study, and no personally identifiable information was disclosed. No compensation was provided to participants.
In this phase, we systematically evaluated the predictive performance of multiple ML models, including logistic regression, decision trees, random forests, XGBoost, CatBoost, and LightGBM, using 5-fold cross-validation. The primary objective was to determine the most effective combination of encoding methods (one-hot Encoding, frequency encoding, and target encoding) and data imbalance handling techniques (ROSE and SMOTE) for predicting MACE.
Models were evaluated by accuracy, F1-score, and AUC with 5-fold cross-validation, and reports the fold mean, SD, and t-distribution–based 95% CIs for each combination. Among all pipelines, the OneHotE_ROSE–LightGBM model achieved the best overall performance, with mean accuracy of 0.932 (SD 0.112; 95% CI 0.759‐1.000), F1-score of 0.918 (SD 0.137; 95% CI 0.754‐1.000), and AUC of 0.940 (SD 0.116; 95% CI 0.794‐1.000). Frequency and target encoding under ROSE also performed strongly (AUC 0.913 and 0.900, respectively), but at a slightly lower level than one-hot encoding in the same setting.
To formally assess differences between preprocessing strategies, we performed pairwise DeLong tests on out-of-fold AUCs, with results summarized in Table S2 in . Most ROSE pipelines showed significantly higher AUC than their SMOTE counterparts (all P≤.002), and OneHotE_ROSE in particular was markedly superior to OneHotE_SMOTE. Within the ROSE group, OneHotE_ROSE, FreqE_ROSE, and TargetE_ROSE did not differ significantly from each other (P≥.67), indicating a top-performing cluster. Together with the averaged ROC curves in -, these results support selecting one-hot encoding with ROSE as the primary preprocessing strategy for Phase 2.
Figure 3. Comparison of the average area under the ROC curve (AUC) of different data models. AUC: area under the ROC curve; ROSE: Random Over Sampling Example; SMOTE: Synthetic Minority Over-sampling Technique. A direct comparison of encoding methods in further confirmed that one-hot encoding was the most effective approach, achieving the highest average AUC (0.78), outperforming both frequency encoding and target encoding (both at 0.67). Meanwhile, illustrates that ROSE significantly outperformed SMOTE, achieving an average AUC of 0.83, compared to 0.59 for SMOTE. This result reinforces the conclusion that ROSE generates more representative synthetic samples, preserves the original data distribution more effectively, and enhances model generalization.
Our findings strongly suggest that one-hot encoding combined with ROSE is the most effective preprocessing strategy for this predictive task. This combination preserves categorical feature integrity while mitigating class imbalance, leading to the most robust and accurate predictive models. In contrast, SMOTE not only struggled to enhance model performance but, in some cases, even contributed to its degradation. These insights guided the selection of the optimal model for further feature importance analysis in Phase 2.
Figure 4. Comparison of the average area under the ROC curve (AUC) of different encoding methods. AUC: area under the ROC curve.
Figure 5. Comparison of the average area under the ROC curve (AUC) of different imbalance methods. AUC: area under the ROC curve; ROSE: Random Over Sampling Example; SMOTE: Synthetic Minority Over-sampling Technique. Phase 2Following the identification of the optimal preprocessing strategy in Phase 1, we conducted a detailed analysis of feature importance and remodeled the dataset using the most influential variables. presents the top 15 most important variables ranked based on their cumulative scores from 5-fold cross-validation, with antiplatelet, chest X-ray for aortic arch calcification (CXR.AoAC).0, and insulin emerging as the m
Comments (0)