Identification and validation of a novel six-gene signature based on mucinous adenocarcinoma-related gene molecular typing in colorectal cancer

3.1 Analysis of MAC-related genes in CRC

In comparison to the small number of normal samples, more MAC samples for subsequent analysis were found in the TCGA database. We analyzed DEGs between 58 MAC samples and 328 adenocarcinoma samples in the TCGA-COAD dataset. DEGs were analyzed between 14 MAC samples and 118 adenocarcinoma samples in the TCGA-READ dataset. DEGs were presented in volcano plots (Fig. 2A and D). There were 1418 differentially up-regulated genes and 736 differentially down-regulated genes in TCGA-COAD MAC. A total of 606 genes were up-regulated and 372 down-regulated in TCGA-READ MAC. We further performed GO and KEGG analyses on MAC-related DEGs to explore the potential functional mechanism (Fig. 2B, C, E, and F). In CRC, according to the results of GO analysis, MAC-related DEGs were involved in receptor-ligand activity, signaling receptor activator activity, and other functional mechanisms. According to the results of KEGG analysis, MAC-related DEGs were involved in neuroactive ligand-receptor interaction, cytokine-cytokine receptor interaction, and IL-17 signaling pathway. Our results suggested that DEGs were closely related to the malignant biological behavior of colorectal MAC. GSEA of the hallmark gene set was performed on the MAC-related genes, which indicated that the epithelial-mesenchymal transition (EMT) pathway-related gene set was activated (Fig. 2G and H).

Fig. 2figure 2

Analysis of the MAC-related genes. A Volcano plot of MAC-related DEGs in the TCGA-COAD dataset. B GO analysis of MAC-related DEGs in the TCGA-COAD dataset. C KEGG analysis of MAC-related DEGs in the TCGA-COAD dataset. D Volcano plot of MAC-related DEGs in the TCGA-READ dataset. E GO analysis of MAC-related DEGs in the TCGA-READ dataset. F KEGG analysis of MAC-related DEGs in the TCGA-READ dataset. G GSEA of MAC-related genes in the TCGA-COAD dataset. H GSEA of MAC-related genes in the TCGA-READ dataset

3.2 Identification of molecular subtypes in CRC

The intersection of genes related to TCGA-COAD MAC and TCGA-READ MAC showed 264 genes to be up-regulated and 141 genes to be down-regulated simultaneously. A total of 405 genes that may have biological importance for colorectal MAC were identified (Fig. 3A). Survival analysis was performed on the critical genes grouped based on the optimal truncation value of patient survival. Genes with P < 0.05 were selected, and 192 MAC-related key genes were found to have a correlation with the OS of CRC patients in the TCGA dataset. These 192 MAC-related key genes were then used for the molecular typing of CRC by the NMF method. According to phenotype, dispersion, and silhouette, the optimal number of clusters was three (Fig. 3B). Three molecular subtypes were obtained: Cluster 1 (n = 280), Cluster 2 (n = 127), and Cluster 3 (n = 130). The PCA diagram (Fig. 3C) showed an apparent distinction among CRC's molecular subtypes, indicating that the three molecular subtypes were well differentiated. The consistency clustering heatmap can be seen in Fig. 3D. Survival analysis was performed for samples with survival data, and the Kaplan–Meier curves showed significant differences in the survival of patients with different molecular subtypes (Fig. 3E).

Fig. 3figure 3

Molecular subtyping analysis of MAC-related genes in CRC. A Venn diagram of the intersection of TCGA colorectal MAC-related genes. B Line charts of phenotypic correlation, dispersion, and silhouette at rank = 2–10. C PCA graph of different subtypes of TCGA CRC dataset. D Consensus graph of NMF clustering. E Survival analysis of three molecular subtypes. F GSVA heatmap of three molecular subtypes. G ESTIMATE analysis of three molecular subtypes. H ImmuCellAI analysis of three molecular subtypes. I CIBERSORTx analysis of three molecular subtypes. ****P < 0.0001; ***P < 0.001; **P < 0.01; *P < 0.05; ns: no significance

GSVA were then performed on the tumor-related gene sets “H.all.v7.4.symbols.gmt” of three molecular subtypes, and a heatmap was drawn for the GSVA scores (Fig. 3F). From the heatmap, it can be found that “EPITHELIAL_MESENCHYMAL_TRANSITION”, “INFLAMMATORY_RESPONSE” or “ANGIOGENESIS” got relatively higher scores in Cluster 2. In addition, the differences in immune characteristics among the three subtypes were analyzed. The molecular subtypes' ImmuneScore, StromalScore, and ESTIMATEScore were all calculated (Fig. 3G). Scores in Cluster 2 were relatively high, which indicated more immune and stromal components in the tumor microenvironment. Therefore, the degree of cell invasion in the tumor microenvironment of the three molecular subtypes was further explored. Significant differences in infiltrating cells were observed in ImmuCellAI (Fig. 3H) and CIBERSORTx (Fig. 3I) analyses. The degree of macrophage invasion in the tumor microenvironment of Cluster 2 was higher, and NK-T cell infiltration was considerably lower. These results suggested that the three molecular subtypes of CRC differentiated by MAC-related genes have significant differences in clinical prognosis and potential mechanism pathways, with apparent differentiation.

3.3 Identification of CRC molecular typing-related genes

Significant biological differences were observed in the three molecular subtypes distinguished by the key genes of MAC, from clinical features to mechanism. Therefore, DEGs between Cluster 2 and Cluster 1 and DEGs between Cluster 2 and Cluster 3 were analyzed. 3,733 genes were up-regulated, and 1,009 genes were down-regulated between Cluster 2 and Cluster 1. 2,674 genes were up-regulated, and 833 genes were down-regulated between Cluster 2 and Cluster 3.

After merging the two groups of DEGs, 4,457 were up-regulated, and 1,677 were down-regulated. Due to the large number of genes obtained by MAC-related molecular typing, we analyzed the DEGs of 17 paired CRC samples in the GSE32323 dataset from the GEO database to narrow the gene range. Then, we screened out genes which were closely associated with the occurrence and development of CRC. The prognostic signature could be constructed in the whole range of CRC, and the applicability of the prognostic signature could be expanded. In the GSE32323 dataset, 718 genes were up-regulated, and 719 genes were down-regulated. There were 139 genes in the intersection of up- and down-regulated genes (Fig. 4A). This gene set may play an essential role in CRC.

Fig. 4figure 4

Construction and validation of the signature. A Venn diagram of the intersection of key genes in CRC samples. B Ten-fold cross-validation of lambda selection by LASSO. C The LASSO coefficient spectrum in the TCGA training set. DF Survival analysis showed the difference in the prognoses of patients in the TCGA CRC training set, internal testing set, and entire set. GI The one-, three-, and five-year ROC curves of the TCGA CRC training set, internal testing set, and entire set. JL Kaplan–Meier curves in GSE17538, GSE38832, and GSE39582. MO The one-, three-, and five-year ROC curves in GSE17538, GSE38832, and GSE39582

3.4 Development of the RiskScore signature

The gene set that contained 139 genes was obtained using the aforementioned steps. After removing samples with missing survival data or the survival time of 0, the entire TCGA CRC set (512 samples) was randomly divided into the training set and the internal testing set (7:3). Based on the expression levels of 139 CRC molecular typing-related genes and survival information in the training set, we constructed a CRC prognostic signature. Firstly, P < 0.05 was chosen as the screening threshold, and univariate Cox analysis was performed on the 139 genes in the training set to screen out the genes with prognostic significance. A total of 12 genes met the screening threshold by univariate Cox analysis. Since the number of 12 genes was still relatively large, we further compressed these 12 genes using the LASSO regression (Fig. 4B and C). Ten-fold cross-validation was performed in the LASSO regression. When lambda = 0.0116, 12 genes were compressed to nine genes. Therefore, these nine genes were included in the multivariate Cox analysis, and the stepwise method was used to construct the signature.

The six genes found to be related to CRC molecular typing were ultimately used for constructing a prognostic signature of CRC patients: CALB1, MMP1, HOXC6, ZIC2, SFTA2, and HYAL1. The following formula was used for calculating the RiskScore (the coefficients were rounded to seven decimal places).

$$RiskScore = _\left(t\right)}(0.1187114\times CALB1-0.1208561\times MMP1+0.3187824\times HOXC6+0.0974445\times ZIC2+0.2239488\times SFTA2-0.2384973\times HYAL1)$$

The samples were divided into high- and low-RiskScore groups based on the median RiskScore in the TCGA training set, TCGA internal testing set, and TCGA entire set. Kaplan–Meier analysis suggested that low-RiskScore patients had a significantly better survival outcome. ROC curve results showed the signature constructed in this study had the good predictive ability (five-year AUC = 0.720; three-year AUC = 0.692; one-year AUC = 0.704). In particular, the AUC values of the TCGA internal testing set and the entire set at one, three, and five years were all greater than 0.7, and the AUC value of the internal testing set at three years was 0.8. Therefore, it could be seen that this prognostic signature showed good discriminative and predictive ability (Fig. 4D-I). To determine the independence of the six-gene signature in clinical application, univariate and multivariate Cox regression analyses were conducted on clinical variables in TCGA CRC samples with complete clinical information (shown in Table 1). Univariate Cox regression analysis showed that stage III/IV vs. stage I/II, age, and the RiskScore were significantly correlated with survival. However, gender was not a prognostic factor. In multivariate Cox analysis, stage III/IV vs. stage I/II, age, and the RiskScore had a significant correlation with survival. These results suggested the potential for the six-gene signature to be used as an independent risk factor for predicting the prognosis of CRC patients.

Table 1 Univariate and multivariate Cox analyses3.5 Validation of the six-gene signature in three independent GEO CRC datasets

Three independent GEO datasets (GSE17538, GSE38832, and GSE39582) were used as external testing sets for verifying the prognostic signature. The samples were divided into high- and low-RiskScore groups based on the median RiskScore of each dataset. In the external testing sets, Kaplan–Meier curves showed the median OS of patients in the high-RiskScore group to be significantly shorter than in the low-RiskScore group (Fig. 4J, K, and L). The one-, three-, and five-year AUC values also suggested that the prognostic signature had the good predictive ability in the three independent datasets (Fig. 4M, N, and O).

3.6 Correlation between RiskScores and clinicopathological features

Based on the best cut-off P values, we first analyzed the prognostic value of these six genes in the TCGA CRC dataset. The results indicated that patients with high expression of CALB1, HOXC6, ZIC2, or SFTA2 had poor prognoses, while those with high expression of MMP1 or HYAL1 had good prognoses (Fig. 5A-F). It could be inferred that the six genes had a close association with the CRC prognosis. The relationship between the RiskScores and clinical parameters of CRC was further explored. Regardless of gender, age, tumor locations, or T/N/M stages, patients with low-RiskScores were found to have relatively better prognoses (Fig. 6A-F).

Fig. 5figure 5

The effect of signature genes on the prognosis of patients in the TCGA CRC dataset. A Effect of CALB1 on patient survival. B Effect of HOXC6 on patient survival. C Effect of HYAL1 on patient survival. D Effect of MMP1 on patient survival. E Effect of SFTA2 on patient survival. F Effect of ZIC2 on patient survival

Fig. 6figure 6

Survival analysis between RiskScores and different clinical parameters in the TCGA CRC dataset. A Effect of RiskScores on the prognosis of different gender. B Effect of RiskScores on the prognosis of different ages. C Effect of RiskScores on the prognosis of different primary tumor locations. D Effect of RiskScores on the prognosis of different T stages. E Effect of RiskScores on the prognosis of different N stages. F Effect of RiskScores on the prognosis of different M stages

Subsequently, we further analyzed the relationship between RiskScores and clinical characteristics in the aforementioned TCGA CRC entire set. The IC50 values of Erlotinib, BMS.754807, OSI.906, and SB.216763 in the low-RiskScore group were found to be lower. The corresponding IC50 values of Camptothecin, CI.1040, ABT.263, and A.770041 were all higher. No significance was found in the IC50 values of MS.275 and Cytarabine between the two groups (Fig. 7A). These results suggested that patients in the high-RiskScore group could choose drugs with lower corresponding IC50 values, potentially improving efficacy. The differences in gene mutations between high- and low-RiskScore groups were explored (Fig. 7B). The mutation rates in the high-RiskScore group of the first five were APC (75%), TP53 (63%), TTN (52%), KRAS (38%), and SYNE1 (32%). Mutation rates in the low-RiskScore group of the first five were APC (79%), TP53 (58%), TTN (51%), KRAS (44%), and PIK3CA (30%). Interestingly, we found a significant difference in the distribution of BRAF mutation between the two groups (P < 0.001), with 18% in the high-RiskScore group and 6% in the low-RiskScore group. The relationship between RiskScores and three Clusters formed by MAC-related genes was further analyzed (Fig. 7C). We were also surprised that Cluster 3 with the low RiskScore had the best prognosis, while Cluster 2 with the high RiskScore had the worst prognosis.

Fig. 7figure 7

Analysis of clinical features of the signature. A IC50 values of antitumor drugs in patients with different RiskScores. B Gene mutation waterfall diagram of TCGA CRC samples with different RiskScores. C Relationship between different Clusters and RiskScores. ****P < 0.0001; ***P < 0.001; **P < 0.01; *P < 0.05; ns: no significance

3.7 Evaluation of the tumor microenvironment in groups stratified by the RiskScore

ImmuCellAI and CIBERSORTx online tools were used for analyzing the immune infiltrating cells of patients in the high- and low-RiskScore groups. A significant difference was observed in some immune infiltrating cells between the high- and low-RiskScore groups. ImmuCellAI analysis (Fig. 8A) showed that the infiltration levels of macrophages and DC were higher in the high-RiskScore group. CIBERSORTx analysis showed that the infiltration levels of plasma cells and activated mast cells were higher in the low-RiskScore group (Fig. 8B). Through the above analyses, it was found that the RiskScore had an impact on the tumor microenvironment of CRC samples.

Fig. 8figure 8

A ImmuCellAI analysis of tumor microenvironment immune infiltrating cells in different RiskScore groups. B CIBERSORTx analysis of tumor microenvironment immune infiltrating cells in different RiskScore groups. ****P < 0.0001; ***P < 0.001; **P < 0.01; *P < 0.05; ns: no significance

3.8 Analysis of the differences in molecular mechanism caused by the RiskScore

Due to the significant differences in clinical characteristics caused by the RiskScore, DEGs in the high- and low-RiskScore groups in the aforementioned TCGA CRC entire dataset were further explored (Fig. 9A). GO analysis was conducted on DEGs (Fig. 9B). It was found that DEGs were mainly involved in the receptor ligand activity, signaling receptor activator activity, and some metabolic processes. KEGG analysis of the potential mechanisms conducted on DEGs (Fig. 9C) showed that DEGs caused by the RiskScore were engaged in tumor occurrence and development-related pathways and metabolic-related pathways. GSEA was then further performed on the two groups with different RiskScores, and the primary mechanism caused by the RiskScore was analyzed using hallmark gene sets (Fig. 9D). It could be seen that HALLMARK_APICAL_JUNCTION, HALLMARK_MYOGENESIS, and HALLMARK_KRAS_SIGNALING_DN were activated, while HALLMARK_INFLAMMATORY_RESPONSE and HALLMARK_IL6_JAK_STAT3_SIGNALING were inhibited.

Fig. 9figure 9

Functional enrichment analyses of the signature. A Volcano plot of DEGs caused by different RiskScores. B GO analysis of DEGs caused by different RiskScores. C KEGG analysis of DEGs caused by different RiskScores. D GSEA of RiskScore-related genes

3.9 Construction of a nomogram combined with the RiskScore and clinical features

A nomogram was constructed based on the prognostic signature related to colorectal MAC-related molecular subtypes. Multivariate Cox regression analysis on potential variables, which included the T/N/M stage, age, gender, and the RiskScore, was used to predict one-, three-, and five-year OS rates in CRC patients (Fig. 10A). The C-index of the nomogram was 0.783. It was observed that a higher RiskScore led to a worse prognosis. For example, a clinician could calculate a 61-year-old female CRC patient with clinical characteristics of T2N0M0 and the RiskScore for a score of the corresponding variables, resulting in a total score of -0.35. The probabilities of survival time longer than one, three, and five years were found to be 0.971, 0.933, and 0.861, respectively.

Fig. 10figure 10

Construction and validation of the nomogram. A OS rates of patients for one, three, and five years were predicted by the nomogram constructed with important clinical parameters. B DCA evaluated the nomogram compared with the T/N/M stage (one-year DCA). C The calibration curves. D C-index histogram

Calibration curves were also constructed for evaluating the consistency of the OS rates and OS rates predicted by the nomogram. The results showed that this method fit well for one-, three-, and five-year OS prediction (Fig. 10C). In comparison to the T/N/M stage, the results of one-year DCA proved the nomogram had reasonable clinical practicability for prognosis prediction (Fig. 10B). The constructed nomogram was compared with the nomograms of CRC previously published [41,42,43,44,45] and it was found that the nomogram used in this study had a higher C-index (Fig. 10D).

Comments (0)

No login
gif