A machine learning-based screening model for the early detection of prostate cancer developed using serum microRNA data from a mixed cohort of 8,741 participants

3.1 Research samples

In order to make this study’s PCa screening model based on serum miRNAs have good accuracy, universality, and clinical utility, we have built a large-scale mixed sample cohort. The samples included in our research were divided in four cohorts: PCa, Oca (consisting of 10 subtypes), BPD, and HP. During data preprocessing, the samples that did not have a specific disease category reference were removed. According to the ratio of 4:1, the data set was randomly divided into model construction set and validation set (n = 1741). Model construction set was randomly divided into training set (n = 5569) and testing set (n = 1393). Based on the training set data, candidate miRNA features were obtained for constructing a screening model (Table 1). The model was validated using the testing, validation and external sets (Fig. 1A).

Table 1 The basic situation of mixed cohort containing samples was used in this studyFig. 1

Select candidate miRNAs for PCa screening. A Work flow of PCa, OCa, BPD, HP and for establishing the screening screening model. B Principal component analysis map using 18 miRNAs features. C Hierarchical clustering analysis of a heatmap showing 18 miRNAs features. PCa, prostate cancer; OCa, other cancers; BDP, benign prostatic disease; HP, healthy participants; NCa, non-cancer; NPCa, non-prostate cancer

3.2 Selection of miRNAs as candidate markers for PCa screening

Based on training set data, the MRMR, IV and LASSO algorithms were applied to analyze and screen the candidate feature sets in PC and NPC samples. The top 10 important miRNA candidate features obtained by each algorithm were combined, resulting in 18 candidate miRNA features finally identified. In the training set, PCA principal component analysis algorithm (Fig. 1B) and unsupervised clustering analysis with a heatmap (Fig. 1C) were respectively used to reduce the dimensionality mapping of the remaining 18 miRNA features and analyze the differences in their expression levels. It can be seen that the 18 miRNA features can distinguish well between the PCa and NPCa populations, but cannot distinguish between the PCa and OCa populations. The expression levels of the 18 miRNAs showed significant differences between PCa and NPCa (Fig. S1).

3.3 Identify the optimal miRNA combination and optimal machine learning model for early PCa screening

Based on 18 candidate miRNAs, six different machine learning algorithms including KNN, SVC, XGBoost, RFC, LR and AdaBoost are combined with cross-validation methods to design screening models containing 1–6 miRNAs. Figure 2A–F shows the mean ACC, AUC, SEN, SPE, PPV and NPV of screening models established based on 1–6 miRNAs using six machine learning algorithms. The AUC (Fig. 2A), ACC (Fig. 2D) and PPV (Fig. 2E) of four miRNAs models were better than those of three miRNAs models. In the training set, the performance of the five-miRNA models was not superior to that of the four-miRNA models (Fig. 2A–F). The PCa screening model constructed using four miRNA features demonstrates excellent predictive efficiency while maintaining a minimal feature set. With 18 candidate miRNAs, there are 3,060 possible four-miRNA combinations. Using six machine learning methods, models were built for all 3,060 combinations, and their ACC, AUC, SEN, SEP, PPV, and NPV were evaluated on the testing set. Compare the predictive performance of all the models comprehensively. Finally, it was found that the screening model composed of four miRNAs, miRNA-1290, miRNA-6777-5p, miRNA-1343-3p and miRNA-6836-3p, had the best predictive efficacy. The results of five-fold cross-validation shows that using four miRNA features to establish a PCa screening model based on six machine learning algorithms can achieve stable and excellent screening performance in testing set. After comparing the AUC (Fig. 2G), SEN (Fig. 2H), SEP (Fig. 2I), ACC (Fig. 2J), PPV (Fig. 2K) and NPV (Fig. 2L) of all machine learning algorithm models, we ultimately determined that the model established by the AdaBoost algorithm using miRNA-1290, miRNA-6777-5p, miRNA-1343-3p, and miRNA-6836-3p as biomarkers is the best screening model for PCa (PCa4miR model). The AUC values for the ROC curve in the training set (Fig. 3A), testing set (Fig. 3B), and validation set (Fig. 3C) were calculated to be 0.999 (95% CI: 0.999–1.000), 0.972 (95% CI: 0.956–0.985), and 0.981 (95% CI: 0.973–0.987), respectively. The ACC, AUC, SEN, SPE, PPV, and NPV of the PCa4miR model in the training (Fig. 3D), testing (Fig. 3E), and validation sets (Fig. 3F) are visualized using a radar chart. The specific values of ACC, AUC, SEN, SPE, PPV and NPV are shown in Table S1. The heatmap clustering analysis illustrates the expression levels of the four miRNAs in the training set (Fig. 3G), testing set (Fig. 3H), and validation set (Fig. 3I). The scatter plot results indicate that miR-1290, miRNA-1343-3p and miRNA-6777-5p are highly expressed in PCa patients, whereas miR-6836-3p is expressed at a lower level in PCa patients, as observed in the training set (Fig. 3J), testing set (Fig. 3K), and validation set (Fig. 3L). In the DCA, PCa4miR model demonstrated an absolute superiority net benefit within a wide range of decision-making threshold probabilities, compared to the miRNA-1290, miRNA-6777-5p, miRNA-1343-3p and miRNA-6836-3p in the training set (Fig. 3M), testing set (Fig. 3N) and validation set (Fig. 3O). Therefore, the PCa4miR model has excellent screening performance and exceptional prediction stability.

Fig. 2

Identify the best screening model. Histogram of the mean A AUC, B SEN, C SPE, D ACC, E PPV and F NPV of screening models established based on 1–6 miRNAs using six machine learning algorithms. Different colored dots represent different machine learning algorithms. Histogram of the mean G AUC, H SEN, I SPE, J ACC, K PPV and L NPV of the fivefold cross-validation results of the models built by six machine learning algorithms based on four miRNAs features miRNA-1290, miRNA-6777-5p, miRNA-1343-3p and miRNA-6836-3p. Different colored dots represent different cross-validation results. KNN, K-Nearest Neighbor; SVM, Support Vector Machine; XGBoot, eXtreme Gradient Boosting; RFC, Random Forest classifier; LR, Logistic Regression; AUC, area under the curve; SEN, sensitivity; SPE, specificity; ACC, accuracy; PPV, positive predictive value; NPV, negative predictive value

Fig. 3

Effectiveness of the PCa4miR model. ROC curves of PCa4miR model in A training set, B testing and C validation set. The radar chart summarized the ability of PCa4miR model to recognize PCa in D training set, E testing set and F validation set, which were determined by ACC, AUC, SEN, SPE, PPV and NPV. Heatmaps of miRNA-1290, miRNA-6777-5p, miRNA-1343-3p and miRNA-6836-3p in G training set, H testing set and I validation set. The levels of miRNA-1290, miRNA-6777-5p, miRNA-1343-3p and miRNA-6836-3p in the PCa group and NPCa group in J training set, K testing set and L validation set. In a wide range of decision threshold probability, the difference of net benefit between PCa4miR model and serum biomarkers using the DCA in the M validation set, N testing set and O validation set. AUC, area under the curve; SEN, sensitivity; SPE, specificity; ACC, accuracy; PPV, positive predictive value; NPV, negative predictive value

3.4 Discrimination of PCa and OCa, BPD, HP using the PCa4miR model

To evaluate the PCa4miR model’s capability to distinguish PCa from OCa, BPD, and HP, the discrimination index was calculated for various disease categories across the training set (Fig. 4A), testing set (Fig. 4B), and validation set (Fig. 4C). Discrimination indices of ≥0.5 were classified as PCa, while values <0.5 were categorized as non-PCa. The results demonstrated that the PCa4miR model effectively differentiates PCa from the other 10 tumor types, as well as from BPD and HP. Screening ACC metrics for each group, summarized in radar charts (Fig. 4D–F, Table S2), highlight the model’s strong performance in distinguishing PCa from OCa, BPD, and HP.

Fig. 4

Performance of PCa4miR model in the discriminants of PCa and OCa, BPD, HP. Screening index was calculated and plotted in a dot plot among PCa and OCa, BPD, HP for the discriminants in the A training set, B testing set, and C validation set. The radar chart summarized the ACC of PCa4miR model in each cohort. Orange polyline represented the ACC value of PCa4miR model in distinguishing each cancer in the D training set, E testing set, and F validation set. Screening index was calculated and plotted in a dot plot among PCa in different clinical subgroups for the discriminants in G training set, H testing set and I validation set. The histogram represented the ACC value of PCa4miR model in distinguishing each subgroup in J training set, K testing set and L validation set. ACC, accuracy; PCa, prostate cancer; OCa, other cancers; BDP, benign prostatic disease; HP, healthy participants; LCa, lung cancer; CCa, colorectal cancer; GCa, gastric cancer; BCa, bladder cancer; PC, pancreatic cancer; ECa, esophageal cancer; BTCa biliary tract cancer; HCC, hepatocellular carcinoma

Furthermore, to assess the model’s accuracy across different clinical subgroups of PCa, the screening index was analyzed in subgroups based on age (≤65 vs. >65 years) and disease stage (I-II vs. III-IV) within the training (Fig. 4G), testing (Fig. 4H), and validation (Fig. 4I) sets. The results confirmed that the PCa4miR model consistently achieved high screening accuracy across these subgroups. Histograms (Fig. 4J–L) further illustrated that the model maintain accuracy levels exceeding 80% across different clinical categories, demonstrating its robust performance and generalizability.

3.5 The PCaSS model was developed using the relative ratios of specific miRNAs

We identified miRNAs—miRNA-1290, miRNA-6777-5p, miRNA-1343-3p, and miRNA-6836-3p—as crucial biomarkers for PCa screening. Among these, miRNA-1290, miRNA-6777-5p, and miRNA-1343-3p were significantly upregulated in PCa patients, whereas miRNA-6836-3p exhibited downregulation. By calculating the ratio of upregulated to downregulated miRNAs, a scoring system was established. Using six machine learning algorithms, the PCa screening score (PCaSS) model was developed based on the expression levels of miR-1290/miRNA-6836-3p, miR-6777-5p/miRNA-6836-3p, and miR-1343-3p/miR-6087. Evaluation of model performance metrics, including AUC (Fig. 5A), SEN (Fig. 5B), ACC (Fig. 5C), SPE (Fig. 5D), PPV (Fig. 5E), and NPV (Fig. 5F), identified the XGBoost algorithm as the optimal approach for constructing the PCaSS model.

Fig. 5

Identify the best PCaSS model. Histogram of the mean A AUC, B SEN, C SPE, D ACC, E PPV and F NPV of the fivefold cross-validation results of the models built by six machine learning algorithms based on the values of miR-1290/miRNA-6836-3p, miR-6777-5p/miRNA-6836-3p and miR-1343-3p/miR-6087. Different colored dots represent different cross-validation results. KNN, K-Nearest Neighbor; SVM, Support Vector Machine; XGBoot, eXtreme Gradient Boosting; RFC, Random Forest classifier; LR, Logistic Regression; AUC, area under the curve; SEN, sensitivity; SPE, specificity; ACC, accuracy; PPV, positive predictive value; NPV, negative predictive value

The AUC values achieved for the PCaSS model were 0.998 (95% CI: 0.997–1.000) in the training set (Fig. 6A), 0.963 (95% CI: 0.952–0.972) in the testing set (Fig. 6B), and 0.964 (95% CI: 0.952–0.974) in the validation set (Fig. 6C). Radar charts summarizing performance metrics across datasets are shown in Fig. 6D–F, with detailed results in Table S3. Heatmaps illustrate miRNA ratio clustering across the training (Fig. 6G), the testing (Fig. 6H) and the validation (Fig. 6I) sets, while scatter plots (Fig. 6J–L) highlight significant differences in miRNA ratios between PCa and NPCa samples. Decision curve analysis (DCA) demonstrated that the PCaSS model outperformed individual miRNA ratios in net benefit over a wide range of decision thresholds (Fig. 6M–O). Collectively, these findings emphasize the high accuracy and reliability of the PCaSS model.

Fig. 6

Effectiveness of the PCaSS model. ROC curves of PCaSS model in A training set, B testing and C validation set. The radar chart summarized the ability of PCaSS model to recognize PCa in D training set, E testing set and F validation set, which were determined by ACC, AUC, SEN, SPE, PPV and NPV. Heatmaps of miR-1290/miRNA-6836-3p, miR-6777-5p/miRNA-6836-3p and miR-1343-3p/miR-6087 in G training set, H testing set and I validation set. The levels of miR-1290/miRNA-6836-3p, miR-6777-5p/miRNA-6836-3p and miR-1343-3p/miR-6087 in the PCa group and NPCa group in J training set, K testing set and L validation set. In a wide range of decision threshold probability, the difference of net benefit between PCaSS model and serum biomarkers using the DCA in the M validation set, N testing set and O validation set. AUC, area under the curve; SEN, sensitivity; SPE, specificity; ACC, accuracy; PPV, positive predictive value; NPV, negative predictive value

To further validate the PCaSS model’s ability to distinguish PCa from other conditions such as OCa, BPD, and HP, discrimination indices were calculated in the training (Fig. S2A), testing (Fig. S2B), and validation sets (Fig. S2C). A threshold of ≥0.5 classified PCa, while <0.5 indicated NPCa. Results confirmed the PCaSS model’s effectiveness in differentiating PCa from 10 other tumor types, BPD, and HP. Radar charts (Fig. S2D–F, Table S4) further illustrated its strong screening performance in these settings.

Additionally, subgroup analysis evaluated the model’s accuracy in clinical subpopulations stratified by age (≤65 and >65 years) and cancer stage (I–II vs. III–IV) in the training (Fig. S2G), testing (Fig. S2H), and validation sets (Fig. S2I). The PCaSS model demonstrated consistently high accuracy across all subgroups. Histograms (Fig. S2J–L) further highlight the model’s robust performance, showing reliable screening accuracy across diverse clinical categories.

3.6 External validation of PCa4miR and PCaSS models

Both the PCa4miR and PCaSS models were further validated using an external dataset (n = 38). The AUC for the PCa4miR model (Fig. 7A) was 0.811 (95% CI: 0.705–0.898), whereas the AUC for the PCaSS model (Fig. 7D) was 0.898 (95% CI: 0.758–0.997). A radar chart summarizing the ACC, AUC, SEN, SPE, PPV, and NPV for both models is shown in Fig. 7B and E, with detailed values in Table S5. DCA demonstrated a significant net benefit for both the PCa4miR (Fig. 7C) and PCaSS (Fig. 7F) models across a wide range of decision thresholds. These findings clearly highlight the strong predictive performance and reliability of both models when tested with the external dataset.

Fig. 7

External validation of PCa4miR and PCaSS models. A ROC curves of PCa4miR model. B The radar chart summarized the ability of PCa4miR model to recognize PCa, which were determined by ACC, AUC, SEN, SPE, PPV and NPV. C In a wide range of decision threshold probability, the net benefit of PCa4miR model. D ROC curves of PCaSS model. E The radar chart summarized the ability of PCaSS model to recognize PCa. F In a wide range of decision threshold probability, the net benefit of PCaSS model. AUC, area under the curve; SEN, sensitivity; SPE, specificity; ACC, accuracy; PPV, positive predictive value; NPV, negative predictive value

View original article

HORMONES & CANCER / DISCOVER ONCOLOGY

Share Bookmark

0 0 0 0 0 0 0

More from this channel

A machine learning-based screening model for the early detection of prostate cancer developed using serum microRNA data from a mixed cohort of 8,741 participants

Comments (0)