The Seoul National University Bundang Hospital (SNUBH) cohort comprises 1,216 participants, consisting of 610 patients with GC and 606 non-cancer controls, with ages ranging from 26 to 80 years, who were recruited from the enrolled participants who visited the Digestive Disease Center, SNUBH. In our previous study [12], we conducted quality control (QC) procedures, guided by suggestions from Anderson et al. (Supplementary Fig. S1) [14], with using plink (v1.90b3.44)[15]. Following QC and imputation, we retained 4,962,361 variants from 968 participants (527 GC patients and 441 non-cancer controls) with complete data for cancer status, sex, age, H. pylori status, and smoking status.
Furthermore, from the National Biobank of Korea (NBK) (https://biobank.nih.go.kr/cmm/main/engMainPage.do), we obtained genetic data from 803 patients with GC, including 394 with DGC and 409 with IGC, whose genotypes were sequenced using the Affy 6.0 platform. In addition, within the Korean Genome and Epidemiology Study (KoGES) database, housed in the NBK, the control group comprised 3,693 urban residents from KoGES_health examinee (KoGES_HEXA) and 1,816 rural residents from KoGES_cardiovascular disease association study (KoGES_CAVAS). These databases were utilized to investigate susceptibility loci for chronic and cardiovascular diseases, respectively, with genotyping conducted using the same platform. We referred to the combined database of these three cohorts as GC_HC throughout the text. In GC_HC, we compared the genotypes of 803 patients diagnosed with DGC or IGC to those of the 5,509 participants. QC procedures were conducted separately for each dataset, following the same procedure as the SNUBH cohort, except for variant selection, which included only variants with a minor allele frequency (MAF) > 0.05 (Supplementary Fig. S1). The datasets were then merged. As the sequenced readings were aligned to the hg18 human reference genome, we lifted them over to the hg19 human reference genome using the LiftOver tool [16]. We performed QC procedures on the merged data, using the same cutoff values as in the previous steps. We generated a multi-dimensional scaling (MDS) plot to evaluate genetic heterogeneity among the three datasets, with variants pruned using the ‘–indep-pairwise 50 5 0.2’ option in plink (Supplementary Fig. S2A).
Next, we imputed the variants using the Michigan Imputation Server (v1.1) [17], employing the same parameters as those used for the SNUBH cohort (Supplementary Fig. S1). We removed variants with an imputation quality score < 0.8 and a MAF < 0.05 following imputation. Ultimately, we retained 4,541,617 variants from 6,233 participants (794 patients with GC, including 389 with DGCs and 405 with IGCs, and 5,439 controls.
We sub-grouped the participants based on the Lauren classification (diffuse or intestinal) and sex (female or male), and compared the variants of GC patients with those of controls within each cluster. For instance, in the SNUBH cohort, we compared the variants of 72 female patients with DGC to those of 248 female non-cancer controls, whereas in the GC_HC cohort, we compared the variants of 165 female patients with DGC to those of 2,958 female controls (Supplementary Fig. S1). Furthermore, female participants were stratified into two groups according to age (≥ 50 y or < 50 y), and variants were compared between DGC patients and controls in each group. The number of female patients with DGC and controls aged ≥ 50 y were 40 and 161 in the SNUBH cohort, and 82 and 1,953 in the GC_HC cohort. The number of female patients with DGC and controls aged < 50 y was 32 and 87 in the SNUBH cohort and 83 and 1,005 in the GC_HC cohort.
GWAS was conducted using SAIGE (v0.44.6.4) [18] in R (v4.1.1) environment with its default options, except for increasing the value of ‘maxiterPCG’ option from 500 to 1,000, and adjusting for the effects of sex (only when both sexes of participants were included), age, H. pylori status, smoking status, and the top four principal component (PC) scores in the SNUBH cohort, and sex, age and the top four PC scores in the GC_HC cohort. Summary statistics were utilized for meta-analysis with inverse variance weighting using METAL (version released on 2011–03-25) [19]. The qqman (v0.1.8) R package was used to generate Manhattan and quantile–quantile (QQ) plots [20]. The LocusZoom plot was created at http://locuszoom.org/ with default settings, except for designating the LD population as 1000G Nov 2014 ASN [21].
Validation of significant variants using the SNUBH2 and KoGES_Ansan/Ansung cohortsWe sequenced additional 102 patients with GC using the Korea Biobank Array to validate the results of the meta-analysis; the patients were selected from the same patient pool as the SNUBH cohort (referred to as SNUBH2). Within KoGES, genotypes from 5,493 participants in the KoGES_Ansan/Ansung cohort were sequenced utilizing the Korea Biobank array and served as controls in our study who were recruited to examine the associations between chronic diseases and lifestyle or diet from two cities in Korea. The combined cohorts were referred as SNUBH2_AA throughout the text. We employed the same QC and imputation steps as in the GC_HC cohort, resulting in the retention of 3,668,632 variants in 5,511 participants (101 patients with GC and 5,410 controls) (Supplementary Fig. S1). Among these participants, 25 females were diagnosed with DGC and 2,832 were controls. Before imputation, we generated an MDS plot to evaluate the genetic heterogeneity between participants in the SNUBH2_AA cohort using the same method used in the GC_HC cohort (Supplementary Fig. S2B). For the GWAS analysis, owing to the small case-to-control ratio, we matched patients with GC with controls using their top five PC scores, ensuring that each patient was matched with five controls without replacement. This was accomplished using the Matching R package (v4.9–9) [22]. Then we conducted a conditional logistic analysis using the survival R package (v.3.4-0) [23], adjusting for sex and age.
Identity-by-descent (IBD) among all participants was estimated using plink with pruned common variants across cohorts, selected using the same options as those used for generating MDS plots, and visualized as a heatmap with the pheatmap R package (v1.0.12) [24] (Supplementary Fig. S2C).
Gene-set analysisWe conducted gene-based tests using summary statistics from the GWAS results with two different methods. First, we conducted gene-based tests using MAGMA (v1.10) [25] with default options, as implemented in FUMA (v1.6.0) [26], accessible at https://fuma.ctglab.nl/. Second, we conducted tests using SPrediXcan (v0.7.5) [27] with stomach tissue as the target, modeled by MASHR. The MASHR-based model was constructed using the hg38 human reference genome. Therefore, before executing SPrediXcan, we updated the variant information in the summary statistics from hg19 to hg38 using the LiftOver tool.
Mediation analysisWe performed a mediation analysis to identify the genes that mediated the effects of significant variants on GC risk, adjusting for the effects of sex, age, and the top four PC scores. This analysis was conducted using the mediation R package (v4.5.0) [28] with 10,000 bootstrap iterations. Gene expression levels in stomach tissues were estimated from imputed datasets using PrediXcan (v0.7.5) [29] and the MASHR-based model.
Mendelian randomization studyMendelian randomization (MR) studies have been conducted to investigate the causal relationship between GC risk and the traits associated with the significant variants according to the OpenGWAS (https://gwas.mrcieu.ac.uk/). Summary statistics for laboratory phenotypes, such as red blood cell (RBC), white blood cell (WBC), neutrophil, and platelet counts [30], serum triglyceride [30], albumin [30], non-albumin protein (NAP) [31], and total protein (TP) levels [32], albumin-to-globulin (A/G) ratio [31], and estimated glomerular filtration rate (eGFR) [31] were obtained from the BioBank Japan Project (BBJ) and are accessible at https://pheweb.jp/.
For the MR study, we employed both Generalized Summary-data-based Mendelian Randomization (GSMR) [33], implemented in Genome-wide Complex Trait Analysis (GCTA) (v1.94.1) [34], and Mendelian Randomization Pleiotropy RESidual Sum and Outlier (MR-PRESSO) R package (v1.0) [35], GSMR was conducted using the default settings. In the case of MR-PRESSO, we designated the ‘OUTLIERtest’ and ‘DISTORTIONtest’ options as TRUE and set the value of the ‘NbDistribution’ option to increment from 1,000 to 10,000. This ensured that instrumental variables (IVs) were significantly associated with exposure (P < 5 × 10–8), did not violate the HEterogeneity In Dependent Instruments (HEIDI)-outlier assumption (P value of the HEIDI-outlier filtering analysis > 0.01). Also, no evidence of horizontal pleiotropic variants among the IVs, according to MR-PRESSO, was observed.
In addition, we performed multivariable MR (MVMR) using MVMR R package (v0.3) [36] to estimate the direct effects of both gene expression and phenotype on GC risk. GWAS summary statistics for gene expression were derived from predicted expression values for each cohort (SNUBH and GC_HC) using PrediXcan, and these were combined with inverse variance weighting by METAL. Significant variants were clumped using plink (‘–clump-p1 5e-8 –clump-p2 1 –clump-r2 0.05 –clump-kb 1000’), with variant correlation estimated from GC_HC cohort. BBJ GWAS summary statistics for traits were also used. IVs met the criteria for strong instruments (F statistics > 10) and showed no evidence of horizontal pleiotropy based on P value of Cochran’s Q statistic (PHet > 0.05). The ‘gencov’ option was set to zero due to different samples being used for estimating gene and trait effects on GC risk. The MVMR analysis was conducted using gene expression and only one phenotype at a time, as we could not determine overlapping samples across phenotypes from BBJ.
An MR mediation analysis was conducted to test whether the effect of the gene on GC risk was mediated by a specific phenotype. For this analysis, we estimated the causal effect using the inverse variance weighted (IVW) MR method, as in the MVMR approach, implemented in MendelianRandomization R package (v0.9.0) [37]. To assess the mediation effect, we performed a Sobel test [38] using MR analysis results from the relationships between the gene and each phenotype, as well as between each phenotype and GC risk.
Heritability estimationWe estimated the heritability of each phenotype using the LD score (LDSC; v1.0.1) [39]. For this analysis, only variants present in HapMap3 were included, utilizing pre-computed LD scores of EAS ancestry by LDSC authors. The value was considered as zero if the estimated heritability value was negative.
In addition, the local heritability of GC risk and related phenotypes was estimated using LAVA R package (v0.1.0) [40] to assess whether estrogen metabolism influences these traits. Candidate estrogen-related genes were selected based on Rending et al.[41], including Catechol-O-Methyltransferase (COMT), Cytochrome P450 Family 1 Subfamily A Member 1 (CYP1A1), Cytochrome P450 Family 1 Subfamily B Member 1 (CYP1B1), Cytochrome P450 Family 19 Subfamily A Member 1 (CYP19A1), Estrogen Receptor 1 (ESR1), Glutathione S-Transferase Mu 1 (GSTM1), Glutathione S-Transferase Pi 1 (GSTP1), Glutathione S-Transferase Theta 1 (GSTT1), Hydroxysteroid 17-Beta Dehydrogenase 1 (HSD17B1), Sulfotransferase Family 1A Member 1 (SULT1A1), and UDP Glucuronosyltransferase Family 1 Member A1 (UGT1A1), and genes associated with significant variants identified in our study. Genomic loci were defined as the region of each gene, obtained from the GeneCards database (https://www.genecards.org/), including a 500 kb flanking region in either side. EAS samples from the 1000G were used as references.
Analysis of known genesWe used the GWAS Catalog, accessible at https://www.ebi.ac.uk/gwas/, to identify genes mapped from variants significantly associated with GC (P < 5 × 10–8), excluding those associated with multiple cancers. For the selected genes, we evaluated their significance in relation to DGC risk and explored heterogeneity across sexes as predicted by SPrediXcan in stomach tissue. Cochran’s Q test was used to test for heterogeneity across sexes.
Comments (0)