Identifying Common Diagnostic Biomarkers and Therapeutic Targets between COPD and Sepsis: A Bioinformatics and Machine Learning Approach

Introduction

Chronic obstructive pulmonary disease (COPD) is a heterogeneous condition1 characterized by persistent respiratory symptoms and progressive airflow obstruction.2 In 2017, approximately 544.9 million individuals worldwide were diagnosed with chronic respiratory diseases, with COPD accounting for about 55%, making it the third leading cause of death globally.3,4 Data statistics show that in 2020, approximately 384 million people worldwide had COPD, resulting in about 3.2 million deaths, which accounted for 6% of the total global mortality.5,6 Patients with COPD frequently experience a range of complications, including cardiovascular disease, depression, and anxiety,7,8 which significantly threaten their health and quality of life. Therefore, there is an urgent need to explore novel treatment strategies for COPD.

Sepsis is a clinical syndrome resulting from an abnormal immune response to infection9 and is characterized by life-threatening multiple organ dysfunction.10 The disease course can be divided into two stages: (i) the initial stage of systemic inflammatory response syndrome (SIRS), which may lead to multiple organ failure (MOF) and septic shock; and (ii) the subsequent stage of compensatory anti-inflammatory response syndrome (CARS), which results in sepsis-induced immunosuppression, increasing the risk of late infection and mortality.11 Approximately 48.9 million individuals are diagnosed with sepsis worldwide each year, with about 11 million fatalities resulting, accounting for 19.7% of total global deaths.12 Sepsis is heterogeneous and can result in a variety of infections caused by different pathogens, leading to various types of organ damage.13 Therefore, despite the gradual standardization of sepsis treatment methods, significant challenges and opportunities for improvement remain,14 necessitating further research.

Studies have indicated that the prevalence and severity of COPD are elevated in patients with sepsis, resulting in more severe comorbidities, including increased complications, significant inflammation, and metabolic abnormalities.15 In summary, although significant differences exist in the pathogenesis and clinical manifestations of COPD and sepsis, an intrinsic relationship exists between the two, both involving abnormalities in the immune system and inflammatory responses. Understanding the relationship between COPD and sepsis, along with their common influencing factors, will elucidate their pathogenic mechanisms and provide new insights for diagnosis and treatment. In this study, we employed Mendelian randomization analysis, comprehensive bioinformatics analysis, and machine learning to elucidate the pathogenesis of comorbidity between COPD and sepsis. This research aims to enhance our understanding of the pathogenesis of COPD and sepsis while providing new insights and directions for potential therapeutic targets, which holds significant importance and value.

Materials and Methods Research Design

In this study, we investigated the causal relationship between COPD and sepsis through a two-way two-sample Mendelian randomization (MR) and examined the mediating role of immune cells. In our MR study, single nucleotide polymorphisms (SNPs) were designated as instrumental variables (IVs).16 Subsequently, comprehensive bioinformatics analysis methods and two machine learning approaches were employed to identify co-diagnostic genes for COPD and sepsis, facilitating immune infiltration analysis, identification of transcription factors and miRNAs, and evaluation of candidate drugs. Figure 1 presents the workflow of this study.

Figure 1 The work flow chart of this study.

Abbreviations: COPD, chronic obstructive pulmonary disease; GWAS, genome-wide association study; logFC, log fold change; GSE, Gene Expression Omnibus Series; DGEs, differentially expressed genes; WGCNA, weighted gene co-expression network analysis; GO, gene ontology; KEGG, kyoto encyclopedia of genes and genomes; LASSO, Logical Regression of Selection Operators; SVM-RFE, support vector machine recursive feature elimination; TFs, transcription factors.

Data Sources

In the Mendelian randomization (MR) analysis, we utilized summary-level data obtained from the largest publicly accessible genome-wide association study (GWAS) database (https://gwas.mrcieu.ac.uk/) for each trait (Table 1). Exposure (COPD) and outcome (sepsis) data were extracted from the FinnGen Biobank (FREEZE 11; https://r11.finngen.fi/) and the UK Biobank. All participants included in the analysis were of European ancestry to minimize potential race-related influences on the outcomes.

Table 1 Research Details of COPD and Sepsis Data Sets in MR Analysis

GWAS summary statistics for each immune trait are publicly available from the GWAS Catalog (accession numbers GCST90001391 to GCST90002121).17 A total of 731 immunophenotypes were included, encompassing absolute cell (AC) counts (n=118), median fluorescence intensities (MFI) reflecting surface antigen levels (n=389), morphological parameters (MP) (n=32), and relative cell (RC) counts (n=192). Specifically, the MFI, AC, and RC features include B cells, CDCs, mature stages of T cells, monocytes, myeloid cells, TBNK (T cells, B cells, natural killer cells), and Treg panels, while the MP feature comprises CDC and TBNK panels. The original GWAS on immune traits was conducted using data from 3757 European individuals, with no overlapping cohorts. Approximately 22 million single nucleotide polymorphisms (SNPs) genotyped with high-density arrays were imputed using the Sardinian sequence-based reference panel,18 and associations were tested after adjusting for covariates (ie, sex, age, and age²).

The gene expression datasets for COPD and sepsis were obtained from the Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/).19 The following criteria guided the search: (1) the samples were from Homo sapiens, (2) the experimental data type was microarray, (3) the dataset included both the control and disease groups, and (4) the total number of samples in each cohort was at least 10. Finally, four GEO datasets—GSE5058, GSE95233, GSE8545, and GSE57065—were selected as the datasets for COPD and sepsis. Among them, GSE5058 (COPD) and GSE95233 (sepsis) were designated as the discovery cohorts. GSE8545 and GSE57065 were used to validate the hub genes of COPD and sepsis, respectively. The detailed information regarding the aforementioned datasets is presented in Table 2.

Table 2 The Basic Information of GEO Dataset Used in This Study

Instrumental Variable Selection and Data Reconciliation in MR Analysis

We included genome-wide significant single nucleotide polymorphisms (SNPs) (P < 5 × 10−8). If no genome-wide significant SNPs were available as instrumental variables (IVs), SNPs below the genome-wide significance level (P < 5 × 10−6) were utilized as candidate IVs. If no available SNPs are screened out within these two ranges, then SNPs below the genome-wide significance level (P < 1 × 10−5) were utilized as candidate IVs. SNP selection was based on linkage disequilibrium clustering (window size = 10,000 KB, r² < 0.001). Linkage disequilibrium levels were estimated from the 1000 Genomes Project based on European samples.20 If no specifically exposed SNPs are identified in the resultant dataset, the linkage disequilibrium (LD) marker employs proxy SNPs. The Mendelian randomization analysis excluded palindromic and ambiguous SNPs from the IVs.21 The F statistic is calculated based on the variance explained by the SNP for each exposure, specifically as (N−K−1)K/R2(1−R2)\frac\bigg/\fracK(N−K−1)/(1−R2)R2, where K represents the number of genetic variations and N denotes the sample size. Weak instrumental variables (F statistic < 10) were excluded in this study.22,23 The minor allele frequency (MAF) was set to > 1%.

Main MR Analysis

Figure 2 presents a summary of the analysis. We conducted a two-sample two-way Mendelian randomization (MR) analysis to assess the causal relationship between chronic obstructive pulmonary disease (COPD) and sepsis (Figure 2A), designating it as the total effect. Inverse variance weighting (IVW) was employed to perform a meta-analysis that combines the Wald ratio of each SNP’s causal effect.21,24 Subsequently, the MR-Egger25 and weighted median26 methods were utilized as supplementary analyses to IVW. Different methods are employed to obtain MR estimates based on varying validity assumptions. The application of IVW relies on the premise that all SNPs are valid instrumental variables. Consequently, this method can yield accurate estimation results. The MR-Egger method assesses the directional pleiotropy of instrumental variables, where the intercept can be interpreted as an estimate of the average pleiotropy of genetic variations. In comparison to the MR-Egger analysis, the weighted median method has the advantage of achieving higher accuracy (ie, smaller standard deviation). In the presence of horizontal pleiotropy, even if 50% of the genetic variations are invalid instrumental variables (IVs), the weighted median method can provide consistent estimates.27

Figure 2 The associated charts studied in this study. (A) Bidirectional MR analysis of chronic obstructive pulmonary disease (COPD) and sepsis. c is the total effect of exposure and sepsis as a result of gene-predicted COPD. d is the total effect of genetic prediction of sepsis as a result of exposure and COPD as a result. (B) The total effect was decomposed into: (i) indirect effect using two-step method (a is the effect of COPD on immune cells, b is the effect of immune cells on sepsis) and product method (a×b); (ii) Direct effect (c′ = c−a×b). The proportion mediated by immune cells is the indirect effect divided by the total effect.

Mediation MR Analysis

We additionally conducted a mediation analysis employing a two-step Mendelian randomization (MR) design to investigate whether immune cells mediate the causal pathway from chronic obstructive pulmonary disease (COPD) to sepsis outcomes (Figure 2). The overall effect can be decomposed into an indirect effect (mediated by immune cells) and a direct effect (without mediation).28 The total effect of COPD on sepsis was divided into (1) the direct effect of COPD on sepsis (c′ in Figure 2B) and (2) the indirect effect of COPD mediated by immune cells (a × b in Figure 2B). We calculated the percentage mediated by the mediating effect by dividing the indirect effect by the total effect. Simultaneously, the 95% confidence interval was calculated using the delta method.29

Sensitivity Analysis

The causal direction of each extracted SNP concerning exposure and outcomes was assessed using MR Steiger filtering.30 This method calculates the variance explained by the exposure and instrumental SNPs, testing whether the variance in the outcomes is less than that in the exposure. The “TRUE” MR Steiger results indicate that the causal relationship is in the expected direction, whereas the “FALSE” results suggest that the causal relationship is in the opposite direction. SNPs resulting in “FALSE” were excluded, indicating that these SNPs exerted a primary effect on the outcomes rather than serving as evidence of exposure.

Heterogeneity among SNPs was assessed using Cochran’s Q statistics and funnel plots.31,32 The MR-Egger intercept method25 and the MRPRESSO method33 were employed to detect horizontal pleiotropy. If an outlier is detected, it is removed, and the MR causal estimation is re-evaluated. If heterogeneity remains high after removal, the stability of the results is evaluated using a random effects model, which is less susceptible to weak SNP exposure associations. Finally, omission analysis was conducted to verify the impact of each SNP on the overall causal estimation.

Identification of Differentially Expressed Genes (DEGs)

Preprocessing of all original datasets, including background adjustment and normalization, was conducted using the “affy” package of R software (version 4.3.0). Utilizing the GPL570 platform (Affymetrix Human Genome U133 Plus 2.0 Array), probes were transformed into gene symbols. The average value was chosen as the gene expression level for genes matching multiple probes. Subsequently, differentially expressed genes (DEGs) in the COPD and sepsis datasets were identified using the “Limma” R software package. The threshold criteria for DEGs were set at an adjusted p-value < 0.05 and |log2 fold change (FC)| ≥ 1.0. The “ggplot2” and “pheatmap” R packages were employed to generate the differential gene volcano plot and cluster heatmap. The “Venn” package was utilized to create a Venn diagram to identify the common DEGs between COPD and sepsis.

Weighted Gene Co-Expression Network Analysis (WGCNA) and Module Gene Identification

To explore gene interactions, the systems biology method Weighted Gene Co-expression Network Analysis (WGCNA) was employed to construct a gene co-expression network. First, genes exhibiting more than 25% variation in the samples were integrated, and the integrated dataset was imported into WGCNA. Second, to ensure the reliability of the network construction results, outlier samples were removed. Third, the adjacency degree was calculated from the soft threshold power β derived from the co-expression similarity using the pickSoftThreshold function. Subsequently, the adjacency relationship was transformed into a topological overlap matrix (TOM), and the corresponding dissimilarity (1-TOM) was calculated. Fourth, modules were detected through hierarchical clustering and the dynamic tree cut function. To classify genes with similar expression profiles into gene modules, average linkage hierarchical clustering was performed based on the TOM-based dissimilarity measure. The minimum size of the gene tree was set to 50.34 Fifth, module membership (MM) and gene significance (GS) were calculated for modules related to clinical attributes. Finally, the characteristic gene network was visualized. The differentially expressed genes (DEGs) screened from the integrated dataset intersected with the genes in the significant module to identify common genes (cg).

Gene Ontology (GO) Annotation and Kyoto Encyclopedia of Genes and Genomes (KEGG) Cg Pathway Enrichment Analysis

DAVID (Database for Annotation, Visualization and Integrated Discovery) 6.8 (https://david.ncifcrf.gov/tools.jsp) is a gene function classification tool, which is used to perform gene ontology (GO) by enrichment analysis of common genes (cg).35 The enriched GO terms were divided into biological process (BP), cellular component (CC) and molecular function (MF) ontology. In order to understanding the biological function of common genes (cg), we uploaded the common gene (cg) to the DAVID database (Gene Function Classification Tool, https://david.ncifcrf.gov/tools.jsp).36 The significant enrichment threshold of GO and KEGG analysis was p<0.05, and the count was≥2.

Machine Learning for Identifying Candidate Biomarkers

Based on potential diagnostic factors, two machine learning algorithms were employed to predict the status of COPD and sepsis. LASSO is a regression analysis algorithm that employs regularization techniques to enhance prediction accuracy. The LASSO regression analysis was conducted using the “glmnet” software package in R to identify genes significantly associated with the diagnostic capability for COPD, sepsis, and healthy specimens. Support Vector Machine (SVM) is a widely utilized machine learning technique for classification and regression analysis. To prevent overfitting, the Recursive Feature Elimination (RFE) algorithm is employed to select the optimal genes from the metadata queue. Consequently, SVM Recursive Feature Elimination (SVM-RFE) is employed to identify the gene set with the highest recognition capability.

Expression Analysis and Diagnostic Evaluation of Candidate Biomarkers

The expression levels of candidate biomarkers in the control group and the disease group were visualized using the “ggplot2” package (p<0.05). The area under the curve (AUC) score of the receiver operating characteristic (ROC) curve was calculated, and the diagnostic accuracy of the candidate biomarkers was evaluated using the “pROC” package.

Immune Infiltration Analysis

CIBERSORT is a deconvolution algorithm that utilizes gene expression data to identify 22 different types of immune cells. The “CIBERSORT” software package is employed to analyze immune cell infiltration. The results are visualized using the “ggplot2”, “corrplot”, and “vioplot” packages. Spearman correlation analysis was employed to assess the correlation between immune cells and candidate biomarkers.

Identification of Transcription Factors and miRNAs

Utilizing the ENCODE and TarBase databases, transcription factor-gene and gene-miRNA regulatory networks were constructed using the NetworkAnalyst 3.0 platform (https://www.networkanalyst.ca/). ENCODE is a project that offers comprehensive and integrated data regarding the functional genomic characteristics of human and mouse cells and tissues. TarBase is the primary experimental validation database for miRNA-mRNA interactions. Finally, Cytoscape was employed to visualize these interactions.

Evaluation of Candidate Drugs

The enrichment platform was used to identify the relationship between drug molecules and hub genes, and the data were from the drug feature database (DSigDB, https://tanlab.ucdenver.edu/DSigDB). The database currently has 22,527 gene sets, including 17,389 drugs, covering 19,531 genes.37 The three-dimensional structure of drug target molecules was obtained from PubChem (https://pubchem.ncbi.nlm.nih.gov/)38 and saved in SDF file format. In Protein Data Bank (PDB; the https://www.rcsb.org/) database39 obtains the 3D structure of key gene targets and saves them as files in PDB format. Molecular docking and visual analysis were performed using AutoDock Vina40 and PyMol 1.8.1 (Schrödinger, http://www.pymol.org/) to show the interaction between drug molecules and gene targets.

Statistical Analysis

Statistical analysis was conducted using R software (version 4.3.0), with p<0.05 regarded as statistically significant. Mendelian randomization (MR) analysis was conducted using the “TwoSampleMR” package.41 MR-pleiotropic residuals and outliers (MR-PRESSO) and robustly adjusted profile scores (MR.RAPS) were calculated using the R software packages “MRPRESSO” and “MR.RAPS”, respectively. Additionally, the PhenoScanner search was utilized to evaluate all known phenotypes associated with the genetic tools included in our analysis. In the screening of differentially expressed genes (DEGs), module genes, enrichment analysis, transcription factor (TF)-gene networks, gene-miRNA networks, and immune infiltration analysis, p<0.05 was regarded as significant.

Result Relationship Between COPD and Sepsis

Following the exclusion of palindromic and ambiguous SNPs identified through MR Steiger filtering, as well as non-proxy SNPs and those with incorrect causal directions, the remaining SNPs were utilized as instrumental variables. In the context of COPD, 17 SNPs were identified, while 13 SNPs were identified in the context of sepsis (Supplementary Tables S1 and S2).

The IVW, MR-Egger, and weighted median methods were employed to estimate the causal relationship between genetically predicted COPD and sepsis (Figure 3A). All three MR methods indicated a positive correlation between COPD and sepsis, with the IVW odds ratio (OR) for each standard deviation (SD) increase in COPD being 1.1428 (95% CI, 1.0476–1.2466; P = 0.0026); the MR-Egger OR being 1.3435 (95% CI, 1.1176–1.6151; P = 0.0067); and the weighted median OR being 1.2204 (95% CI, 1.0735–1.3873; P = 0.0023). However, our MR analysis results indicated no evidence of a reverse causal relationship, meaning that genetically predicted sepsis does not exhibit a causal effect on COPD. Specifically, the IVW OR for each SD increase in sepsis was 1.0779 (95% CI, 0.9688–1.1992; P = 0.1682); the MR-Egger OR was 0.9937 (95% CI, 0.7936–1.2443; P = 0.9572); and the weighted median OR was 0.9917 (95% CI, 0.8705–1.1300; P = 0.9009). The results are illustrated in Figure 3B.

Figure 3 Forest plot illustrating the causal relationship of COPD with sepsis. (A) With COPD as the exposure factor and sepsis as the outcome factor, using the IVW, MR Egger, and weighted median methods to study the causal relationship between the two. (B) With sepsis as the exposure factor and COPD as the outcome factor, using the IVW, MR Egger, and weighted median methods to study the causal relationship between the two.

Screening of Immune Cells Mediating the Relationship Between COPD and Sepsis

Initially, we treated COPD as the exposure factor and immune cell characteristics as the outcome factors. After removing palindromic and ambiguous SNPs, non-proxy SNPs, and SNPs identified with incorrect causal directions by MR Steiger filtering, we extracted a total of 19 genome-wide significant SNPs as instrumental variables (Supplementary Table S3). Using the IVW method for batch screening, we identified causal relationships between immune cell characteristics and COPD, resulting in three findings (Supplementary Table S4).

Subsequently, we extracted genome-wide significant SNPs for immune cell characteristics using the same methodology. SNPs identified with COPD as the exposure factor were removed, and the remaining SNPs were utilized as instrumental variables (Supplementary Tables S5–S7). Using sepsis as the outcome factor, we employed the IVW method to screen for immune cell characteristics with causal relationships to sepsis, resulting in three identified findings: CD33bright HLA DR+ CD14- %CD33bright HLA DR+, BAFF-R on IgD- CD38bright, and CD45 on CD33- HLA DR+ (Supplementary Table S8).

Proportion of Association Between COPD and Sepsis Mediated by Immune Cells

We analyzed these three immune cell characteristics as mediators in the pathway from COPD to sepsis. We found that the risk of sepsis was positively correlated with CD33bright HLA DR+ CD14- %CD33bright HLA DR+, which, in turn, was positively correlated with COPD risk; the risk of sepsis was negatively correlated with BAFF-R on IgD- CD38bright and CD45 on CD33- HLA DR+, which were also negatively correlated with COPD risk. As shown in Figure 4A, our study indicates that CD33bright HLA DR+ CD14- %CD33bright HLA DR+ accounts for 6.5% of the increased risk of sepsis associated with COPD (Indirect effect: 0.0651). As depicted in Figure 4B, BAFF-R on IgD- CD38bright accounts for 12.8% of the increased risk of sepsis associated with COPD (Indirect effect: 0.1279). As illustrated in Figure 4C, CD45 on CD33- HLA DR+ accounts for 3.9% of the increased risk of sepsis associated with COPD (Indirect effect: 0.0389). Detailed data are presented in Supplementary Table S9.

Figure 4 Schematic illustration of the mediating role of immune cell characteristics. (A) The immune cell CD33bright HLA DR+ CD14- %CD33bright HLA DR+ play a mediating role between COPD and sepsis and its indirect effect values. (B) The immune cell BAFF-R on IgD- CD38bright play a mediating role between COPD and sepsis and its indirect effect values. (C) The immune cell CD45 on CD33- HLA DR+ play a mediating role between COPD and sepsis and its indirect effect values.

Sensitivity Analysis

Several sensitivity analyses were performed to assess and address the presence of pleiotropy in causal estimates. The Cochran’s Q test and funnel plot revealed no evidence of heterogeneity or asymmetry in the causal relationships among these SNPs (Supplementary Table S10 and Supplementary Figures S1–S8). The impact of each SNP on the overall causal estimate was validated through a leave-one-out analysis (Supplementary Figures S9–S16). After systematically removing each SNP and reanalyzing the remaining SNPs using MR, the results remained consistent, indicating that the inclusion of all SNPs rendered the causal relationship significant.

Identification of Differentially Expressed Genes

In the COPD dataset (GSE5058), a total of 2998 differentially expressed genes (DEGs) were identified, including 1363 upregulated and 1635 downregulated genes (Figure 5A). In the sepsis dataset (GSE95233), a total of 1101 DEGs were identified, with 608 upregulated and 493 downregulated genes (Figure 5B). Cluster heatmaps based on all differentially expressed genes for both COPD and sepsis are presented (Figure 5C and D). Using a Venn diagram, a total of 112 common DEGs were identified between COPD and sepsis (Figure 5E).

Figure 5 Identification of differentially expressed genes (DEGs). (A and B) Volcano plots of all DEGs in COPD (GSE5058) and sepsis (GSE95233), respectively. (C and D) Heatmaps of all DEGs in COPD (GSE5058) and sepsis (GSE95233), respectively. (E) Venn diagram of overlapping DEGs between COPD and sepsis.

Functional Enrichment Analysis of Common Differentially Expressed Genes

Functional enrichment analyses employing Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways were conducted to explore the biological characteristics and pathways associated with the 112 common differentially expressed genes (DEGs). The GO analysis identified a total of 47 terms, comprising 29 biological processes (BP), 13 cellular components (CC), and 5 molecular functions (MF). The most significant 15 terms are presented here, encompassing 5 biological processes (BP), 5 cellular components (CC), and 5 molecular functions (MF). The biological processes (BP) were primarily enriched in the innate immune response, positive regulation of natural killer cell-mediated cytotoxicity, immune response, pyroptotic inflammatory response, and inflammatory response. The cellular components (CC) were enriched in tertiary granule membrane, specific granule membrane, tertiary granule lumen, specific granule lumen, and plasma membrane. The molecular functions (MF) were enriched in carbohydrate binding, pattern recognition receptor activity, lipopolysaccharide binding, beta-glucuronidase activity, and G protein-coupled purinergic nucleotide receptor activity. The KEGG pathway enrichment analysis focused on two primary pathways, namely neutrophil extracellular trap formation and transcriptional misregulation in cancer. The results of the GO and KEGG enrichment analyses are presented in Figure 6, with complete findings listed in Supplementary Table S11. These findings indicate a significant association between chronic obstructive pulmonary disease (COPD), sepsis, and inflammatory responses, as well as immune processes.

Figure 6 Functional enrichment analysis of common DEGs between COPD and sepsis. GO analysis results of the common DEGs: displaying the top 5 biological process (BP) terms, top 5 cellular component (CC) terms, and top 5 molecular function (MF) terms. Two KEGG pathways associated with the common DEGs.

Weighted Gene Co-Expression Network Analysis (WGCNA) and Module Gene Identification

We conducted Weighted Gene Co-expression Network Analysis (WGCNA) to identify the most pertinent gene modules in chronic obstructive pulmonary disease (COPD) and sepsis. The optimal soft-thresholding power (β) was set to 5 for the COPD dataset (GSE5058) and 20 for the sepsis dataset (GSE95233) (Figure 7A and B). In the COPD dataset, five modules were identified, with the turquoise module exhibiting the strongest negative correlation with COPD (correlation coefficient = −0.88, p = 2e-13) and comprising 4256 genes (Figure 7C and E). In the sepsis dataset, seven modules were identified, with the blue module demonstrating the strongest negative correlation (correlation coefficient = −0.91, p = 2e-29) and comprising 1009 genes (Figure 7D and F). Scatter plots illustrate the correlation between module membership and gene significance in the turquoise module for COPD and in the blue module for sepsis (Figure 7G and H). Subsequently, a Venn diagram was employed to identify 130 overlapping genes within the modules associated with both COPD and sepsis (Figure 7I).

Figure 7 Construction of WGCNA networks and identification of key modules. (A and B) Selection of soft thresholds for COPD (GSE5058) and sepsis (GSE95233), respectively. (C and D) Hierarchical clustering trees of co-expression gene clusters for COPD (GSE5058) and sepsis (GSE95233), respectively. (E and F) Heatmaps showing the correlation of each module with clinical characteristics in COPD (GSE5058) and sepsis (GSE95233), respectively. (G) Scatter plot of module membership versus gene significance for the Turquoise module in COPD (GSE5058). (H) Scatter plot of module membership versus gene significance for the Blue module in sepsis (GSE95233). (I) Venn diagram of overlapping genes between COPD and sepsis identified through WGCNA analysis.

Functional Enrichment Analysis of Disease-Related Common Genes

Subsequently, we conducted Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses to elucidate the biological functions and pathways associated with disease-related common genes. A total of 48 GO terms were identified, comprising 31 biological processes (BP), 9 cellular components (CC), and 8 molecular functions (MF). The 15 most significant GO terms are presented here, encompassing 5 BP, 5 CC, and 5 MF. Additionally, nine KEGG pathways were identified, with the top five pathways highlighted (Figure 8). Detailed results are available in Supplementary Table S12. In the BP category, most genes were primarily involved in the adaptive immune response, positive regulation of natural killer cell-mediated cytotoxicity, positive regulation of type II interferon production, cell surface receptor signaling pathways, and immune response. Within the CC category, the majority of genes participated in the external side of the plasma membrane, the plasma membrane, the immunological synapse, the cell surface, and membrane rafts. Regarding the MF category, most genes were involved in transmembrane signaling receptor activity, signaling receptor binding, phosphatidylcholine binding, cytokine activity, and non-membrane-spanning protein tyrosine kinase activity. The KEGG pathway enrichment analysis revealed that most genes were significantly enriched in the T cell receptor signaling pathway, Th1 and Th2 cell differentiation, viral protein interactions with cytokines and cytokine receptors, Th17 cell differentiation, and the PD-L1 expression and PD-1 checkpoint pathway in cancer. These findings suggest that COPD and sepsis are primarily associated with inflammatory and immune responses, aligning with the enrichment analysis results for the common differentially expressed genes (DEGs).

Figure 8 Functional enrichment analysis based on disease-related common genes. GO classifications obtained from WGCNA analysis between COPD and sepsis, displaying the top 5 terms for biological process (BP), molecular function (MF), and cellular component (CC). KEGG enrichment analysis results obtained from WGCNA analysis between COPD and sepsis, showing the top 5 pathways.

Identification of Candidate Biomarkers Through Machine Learning

To further investigate the shared pathogenic mechanisms between COPD and sepsis, we identified 33 genes by intersecting the 112 DEGs and the 130 genes detected through WGCNA (Figure 9A). Subsequently, two machine learning algorithms were applied to screen for potential candidate biomarkers among these 33 shared genes. In the COPD dataset (GSE5058), LASSO regression identified five genes (Figure 9B and C), while the SVM-RFE algorithm identified 26 genes with the lowest root mean square error (RMSE) (Figure 9D). Ultimately, five overlapping genes were identified in the COPD group through the intersection of these two algorithms (Figure 9E). Similarly, in the sepsis dataset (GSE95233), LASSO regression identified 11 genes, and the SVM-RFE algorithm selected 21 genes (Figure 9F–H). Ten overlapping genes were subsequently identified in the sepsis group (Figure 9I). Finally, AIM2 and RNF125 were identified as candidate biomarkers through a Venn diagram (Figure 9J).

Figure 9 Machine learning for screening candidate biomarkers. (A) Venn diagram showing the intersection of common genes obtained from WGCNA and DEGs. (B and C) Five genes identified as the most suitable for COPD diagnosis based on the Lasso regression algorithm at the minimum binomial deviance. (D) The top 26 genes with the minimum error and highest accuracy in COPD selected based on SVM-RFE. (E) Venn diagram of intersecting genes obtained from two machine learning algorithms in COPD. (F and G) Eleven genes identified as the most suitable for diagnosing sepsis based on the Lasso regression algorithm at the minimum binomial deviance. (H) The top 21 genes with the minimum error and highest accuracy in sepsis selected based on SVM-RFE. (I) Venn diagram of intersecting genes obtained from two machine learning algorithms in sepsis. (J) Venn diagram showing the overlap of final candidate biomarkers between COPD and sepsis.

Diag Nostic Value and Verification of Candidate Biomarkers

To verify whether AIM2 and RNF125 serve as diagnostic markers for COPD and sepsis, we further analyzed their expression levels and diagnostic efficacy. As shown in Figure 10A and B, for the COPD dataset, the expression levels of these two genes were significantly higher in the disease group compared to the control group. In Figure 10C and D, for the sepsis dataset, the expression level of AIM2 was significantly higher in the disease group compared to the control group, while the expression level of RNF125 was significantly lower in the disease group. The ROC curve analysis indicated that each biomarker had good predictive performance (Figure 10E–H): for the COPD dataset (GSE5058), AIM2 (AUC: 0.967, 95% CI: 0.911–1.000) and RNF125 (AUC: 0.986, 95% CI: 0.950–1.000) (Figure 10E and F); for the sepsis dataset (GSE95233), AIM2 (AUC: 0.983, 95% CI: 0.949–1.000) and RNF125 (AUC: 0.999, 95% CI: 0.995–1.000) (Figure 10G and H).

Figure 10 Assessment of the diagnostic value of candidate biomarkers in the discovery dataset. (A and B) Expression differences of two shared genes in the COPD (GSE5058) discovery dataset. (C and D) Expression differences of two shared genes in the sepsis (GSE95233) discovery dataset. (E and F) ROC curves for the two shared genes in the COPD (GSE5058) discovery dataset. (G and H) ROC curves for the two shared genes in the sepsis (GSE95233) discovery dataset.

To further assess the accuracy of the candidate biomarkers, we validated them in two external datasets (GSE8545 for COPD and GSE57065 for sepsis). Consistent with previous results, the expression of both biomarkers was significantly upregulated in the disease group of the validation COPD dataset (Figure 11A and B), and the ROC curve analysis confirmed that all candidate genes had good diagnostic value (Figure 11E and F). In the validation sepsis dataset, the expression of AIM2 was significantly upregulated, while the expression of RNF125 was significantly downregulated in the disease group (Figure 11C and D), and the ROC curve analysis confirmed that all candidate genes had good diagnostic value (Figure 11G and H). In summary, the above research results indicate that AIM2 and RNF125 can serve as promising diagnostic biomarkers for COPD and sepsis.

Figure 11 Assessment of the diagnostic value of candidate biomarkers in the validation dataset. (A and B) Expression differences of two shared genes in the COPD (GSE8545) validation dataset. (C and D) Expression differences of two shared genes in the sepsis (GSE57065) validation dataset. (E and F) ROC curves for the two shared genes in the COPD (GSE8545) validation dataset. (G and H) ROC curves for the two shared genes in the sepsis (GSE57065) validation dataset.

Analysis of Immune Cell Infiltration

Both COPD and sepsis are infection-related diseases characterized by strong immune responses, which prompted us to use CIBERSORT to analyze the proportions of immune-infiltrating cells. Figure 12A and B show the distribution of 22 immune cell types in COPD and sepsis samples, respectively. Correlation heatmaps of immune cell interactions are presented in Figure 12C and D (Figure 12C for COPD and Figure 12D for sepsis).

Figure 12 Analysis of immune cell infiltration. (A–D) Percentages of 22 immune cells identified by the CIBERSORT algorithm in the COPD dataset (GSE5058) and sepsis dataset (GSE95233). (E and F) Box plots showing the proportions of immune cells in COPD and sepsis patients versus controls. (G and H) Correlation of AIM2 (G) and RNF125 (H) with infiltrating immune cells in COPD and normal samples. (I and J) Correlation of AIM2 (I) and RNF125 (J) with infiltrating immune cells in sepsis and normal samples.

In COPD samples, we observed dysregulation in B cells memory, plasma cells, activated CD4 memory T cells, follicular helper T cells, gamma delta T cells, activated NK cells, M2 macrophages, resting dendritic cells, and eosinophils compared to normal samples (Figure 12E). In sepsis samples, dysregulated levels were observed in naïve B cells, plasma cells, CD8 T cells, naïve CD4 T cells, resting and activated CD4 memory T cells, follicular helper T cells, gamma delta T cells, resting NK cells, monocytes, M0, M1, and M2 macrophages, resting dendritic cells, resting and activated mast cells, eosinophils, and neutrophils compared to normal samples (Figure 12F).

Next, we investigated the correlation between the shared biomarkers and immune cells. In COPD samples, AIM2 was negatively correlated with follicular helper T cells, plasma cells, and M2 macrophages, while it was positively correlated with M0 macrophages, eosinophils, and resting dendritic cells (Figure 12G). RNF125 exhibited a negative correlation with regulatory T cells (Tregs), follicular helper T cells, plasma cells, activated NK cells, and M2 macrophages, and a positive correlation with M0 macrophages, eosinophils, and resting dendritic cells (Figure 12H). In sepsis samples, AIM2 was negatively correlated with CD8 T cells, resting CD4 memory T cells, resting NK cells, and resting dendritic cells, but positively correlated with gamma delta T cells, plasma cells, neutrophils, monocytes, M0 macrophages, activated dendritic cells, and memory B cells (Figure 12I). RNF125 was negatively correlated with gamma delta T cells, follicular helper T cells, naïve CD4 T cells, plasma cells, neutrophils, monocytes, M1 and M0 macrophages, while it was positively correlated with CD8 T cells, resting CD4 memory T cells, activated CD4 memory T cells, resting NK cells, resting mast cells, M2 macrophages, and resting dendritic cells (Figure 12J).

These findings suggest that AIM2 and RNF125 are closely linked to immune cell types, particularly macrophages, dendritic cells, and regulatory T cells, underscoring their potential role in the immune landscape of both COPD and sepsis.

Construction of TF (Transcription Factors)-Gene and Gene-miRNA Regulatory Networks

Transcription factors (TFs) and microRNAs (miRNAs) play crucial roles in understanding disease progression. To identify the key transcriptional and post-transcriptional regulatory elements for the shared biomarkers, we constructed TF-gene and gene-miRNA networks, thereby deepening our insight into the regulatory mechanisms of the disease. The TF-gene network consists of 15 nodes and 17 edges (Figure 13A), while the gene-miRNA network includes 83 nodes and 86 edges (Figure 13B). In the TF-gene network, RELA, NFKB1, PPARG, and FOXC1 were found to interact with both hub genes, suggesting their potential as key transcriptional regulators. Similarly, in the gene-miRNA network, hsa-miR-196a-5p, hsa-miR-34a-5p, hsa-miR-124-3p, hsa-miR-21-5p, and hsa-miR-203a-3p were identified as interacting with the two hub genes, highlighting their possible role as post-transcriptional regulators. These findings suggest that these transcription factors and miRNAs may serve as common regulatory elements for the hub genes, though further validation is required to confirm their functional roles.

Figure 13 Gene-miRNA network of Tfs genes and two shared genes. (A) Regulatory network of Tfs genes for the two shared genes. (B) Gene-miRNA regulatory network for the two shared genes.

Identification of Candidate Drugs

We utilized the DSigDB database on the Enrichr platform to search for potential drugs targeting the hub genes, resulting in the identification of 32 gene-drug associations, with detailed results provided in Supplementary Table S13. We extracted the top 10 drugs sorted by p-value, which include carbamazepine (HL60 DOWN), Gadodiamide hydrate (CTD 00002623), Melittin (CTD 00006261), metoclopramide (HL60 DOWN), benfotiamine (PC3 DOWN), isotretinoin (HL60 UP), MeIQx (CTD 00001739), iohexol (HL60 DOWN), suloctidil (HL60 UP), and Aizen uranine (BOSS) (Table 3).

Table 3 Predicted Top 10 Drug Compounds

Molecular docking was performed between the drug molecular targets identified and the corresponding key genes. Protein structures of the gene targets were downloaded from the PDB database, and molecular ligands of the drugs were retrieved from PubChem for docking analysis. The docking results indicated that the binding energies of the drug ligands with the potential gene targets were all below −4.5 kJ/mol, suggesting that the conformational energies of the drug ligands with the gene targets are low, indicating strong binding activity (Table 4). Notably, the three-dimensional structures of Gadodiamide hydrate (CTD 00002623) and Melittin (CTD 00006261) were not found in PubChem. Additionally, the docking results for suloctidil (HL60 UP) showed that no hydrogen bonds were formed between the drug molecule and the corresponding gene. Therefore, only seven drug targets were validated through molecular docking, as shown in Figure 14.

Table 4 Corresponding IDs and Docking Binding Energies of Each Drug Ligand with Potential Gene Targets

Figure 14 Molecular docking diagrams of seven drugs. (A) The molecular docking results of drug Aizen uranine and gene AIM2. (B) The molecular docking results of drug Isotretinoin and gene AlM2. (C) The molecular docking results of drug Benfotiamine and gene RNF125. (D) The molecular docking results of drug Carbamazepine and gene RNF125. (E) The molecular docking results of drug lohexol and gene RNF125. (F) The molecular docking results of drug Metoclopramide and gene RNF125. (G) The molecular docking results of drug MelOx and gene RNF125.

Discussion

Chronic obstructive pulmonary disease (COPD) is a long-term respiratory condition characterized by persistent respiratory symptoms and airflow limitation, often accompanied by multiple complications.42,43 It is the third leading cause of death globally.44 Sepsis, a severe infection associated with high mortality, is closely linked to septic shock and multi-organ dysfunction.45 In recent years, the incidence of sepsis has risen, placing significant strain on healthcare systems.46 While several systematic reviews and meta-analyses have indicated a bidirectional relationship between COPD and sepsis, the underlying pathophysiology remains complex and not fully understood. Exploring the causal link between COPD and sepsis through Mendelian randomization (MR) analysis can help reduce the confounding effects of potential biases.

Bioinformatics analysis is a valuable tool for comprehensively understanding the genetic-level pathophysiological mechanisms of diseases. Although prior studies have investigated the mechanisms of COPD and sepsis independently, no studies have focused on their common mechanisms. This study integrates MR analysis with bioinformatics and machine learning methods to explore the shared transcriptional features between COPD and sepsis, with a particular focus on the mediating role of immune cells.

Initially, we used MR analysis to assess the role of immune cells as mediators in the increased risk of sepsis among COPD patients. Our goal was to explore the causal relationship between COPD and sepsis using existing genome-wide association study (GWAS) data, and to determine whether this relationship is mediated by immune cells. The results indicated that genetically predicted COPD is associated with a 14.3% increased risk of sepsis for each standard deviation increase in COPD severity. The mediation effects of immune cells were quantified as follows: CD33br HLA DR+ CD14- %CD33br HLA DR+ accounted for 6.5%, BAFF-R on IgD-CD38br for 12.8%, and CD45 on CD33- HLA DR+ for 3.9%. To our knowledge, this is the first study to use MR to investigate the role of immune cells in mediating the increased risk of sepsis in COPD. These findings align with previous observational studies that have shown a strong association between postoperative sepsis and COPD, particularly in surgeries such as craniotomy, left ventricular surgery, and hip fracture procedures.47–49 However, unlike observational studies, which are prone to reverse causality and confounding factors, our MR approach provides stronger causal inference.

Nevertheless, our study has several limitations. First, the analysis was conducted on a European population, which limits its generalizability to other populations. Second, the GWAS dataset for sepsis included a relatively small number of sepsis cases, and future studies would benefit from larger datasets for validation. Third, although we took measures to identify and remove pleiotropic effects, the possibility that pleiotropy influenced our results cannot be entirely ruled out. Fourth, the use of aggregate-level statistics instead of individual-level data precluded subgroup analyses, such as sex-specific causal relationships. Lastly, while we identified genetic prediction rates for immune cell-mediated COPD at 6.5%, 12.8%, and 3.9%, these rates are relatively low, suggesting the need for further research to identify additional mediators.

We identified 112 differentially expressed genes (DEGs) and 130 disease-associated module genes shared between COPD and sepsis. Functional enrichment analysis revealed that these genes were predominantly involved in innate and adaptive immune responses, as well as inflammatory processes. Collectively, these findings suggest that the regulation of immune response and cytokine secretion may be a crucial link between COPD and sepsis. By applying two machine learning algorithms (LASSO and SVM-RFE), we identified two key diagnostic genes, AIM2 and RNF125, for both diseases. The expression of these genes was significantly higher in COPD patients compared to the control group. In sepsis patients, AIM2 expression was elevated, while RNF125 expression was reduced relative to the control group. Furthermore, receiver operating characteristic (ROC) analysis demonstrated their potential as diagnostic biomarkers for COPD and sepsis, with area under the curve (AUC) values exceeding 0.80, underscoring their vital role in the diagnosis and progression of these conditions.

AIM2 (Absent in Melanoma 2), located on the 1q23 region of the human chromosome, consists of an N-terminal pyrin domain (PYD) and a C-terminal hematopoietic interferon-induced nucleoprotein domain (HIN200), which is a cytosolic DNA receptor belonging to the HIN-200 protein family. It is primarily localized in the cytoplasm, where the HIN domain recognizes double-stranded DNA (dsDNA). The PYD domain recruits and binds to apoptosis-associated speck-like proteins (ASC) to assemble inflammasomes.50–53 When AIM2 binds dsDNA, it recruits ASC and pro-Caspase-1, forming inflammatory complexes that lead to the secretion of pro-inflammatory cytokines IL-1β and IL-18, thus triggering pyroptosis.54–56 AIM2 plays a role in cancer,57 autoimmune diseases,58 infectious diseases,59 and cardiovascular conditions.60

View original article

INTERNATIONAL JOURNAL OF CHRONIC OBSTRUCTIVE PULMONARY DISEASE

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Identifying Common Diagnostic Biomarkers and Therapeutic Targets between COPD and Sepsis: A Bioinformatics and Machine Learning Approach

Comments (0)