The baseline assessments were conducted from October 2017 to February 2022, with longitudinal follow-ups planned every 3–5 years. To ensure efficient participant management, we developed a Member Management Information System (MMIS), which integrates electronic questionnaires and participant records on a secure commercial cloud server (Alibaba Cloud). Eligible residents were invited to their nearest community healthcare service center for physical examinations. Upon arrival, participants were registered using their unique identification card number, followed by a face-to-face interview and physical examination conducted by trained healthcare personnel (Fig. 1).
Fig. 1
Baseline recruitment, data collection, biobanking, and dynamic follow-up structure of the Guangdong Biobank Cohort (GDBC)
The GDBC cohort enrolled 35,081 adults, covering approximately 30–40% of local residents aged 40–84 years (based on 2016 Xiaolan Town Public Health Service Center data, Table S1, Fig. S1).
Electronic questionnaire investigationParticipants completed structured face-to-face interviews using the electronic questionnaire integrated into MMIS, administered by trained healthcare personnel via tablet computers or mobile devices. The questionnaire was designed with reference to the China Kadoorie Biobank to ensure comprehensive coverage of key epidemiological variables while adapting to the unique regional characteristics of Guangdong. It encompassed 205 items across multiple domains (Table 1). Demographic information included age, gender, education level, marital status, nationality, birth data, family address, and rurality. Lifestyle factors assessed cigarette smoking, alcohol consumption, physical activity, dietary habits, and Guangdong-specific lifestyle behaviors. Oral hygiene was evaluated through indicators such as frequency of teeth brushing, age started teeth brushing, persistent time of teeth brushing, loss of tooth, number of dental caries, age of first tooth loss, and toothache symptoms. Personal medical history documented the participants’ history of NCDs such as hypertension, diabetes, cancers, as well as results of previous endoscopies, CT scan, X-ray, EBV antibody testing, HPV, H pylori testing and other related examinations, along with medication history recording the use of antibiotics, calcium supplements, fish oil, vitamins, metformin, NSAIDs, and other commonly used medications. Family disease history focused on common cancers, including lung, breast, and colorectal cancer, as well as other chronic diseases among parents, siblings, and children. For female participants, reproductive history was also collected, covering age at menarche, age at menopause, menstrual cycle characteristics, pregnancy history, hormone therapy use, age at first birth, and breastfeeding history.
Table 1 Baseline data collection overview in the Guangdong biobank cohort studyEach interview lasted approximately 30–40 min, and the system was equipped with a real-time audio recording feature. To ensure data quality, approximately 2–5% of interviews were randomly selected for quality control, where designated personnel reviewed the recordings and cross-checked responses for accuracy and consistency. All audio recordings were permanently stored on a cloud-based server, ensuring data traceability and reliability.
Physical examinationA comprehensive physical examination was conducted for all participants during the baseline survey, following standardized protocols. Anthropometric measurements included height, weight, waist circumference, hip circumference, and grip strength. Height and weight were measured for each participant using an integrated digital height-weight scale (DHM-300G, Zhengzhou Dingheng Electronics Technology Co., Ltd.). Participants were required to remove shoes and wear light clothing. Height was measured with participants standing upright, heels and knees together, using a stadiometer, while weight was recorded using a calibrated digital scale. Waist circumference was measured at the midpoint between the superior border of the iliac crest and the lower rib margin, with a flexible measuring tape placed horizontally around the abdomen. Hip circumference was recorded at the widest part of the buttocks, ensuring the tape remained parallel to the floor. Grip strength was assessed using a hand dynamometer (Xiangshan dynamometer, Manufacturer: Guangdong Xiangshan Weighing Instrument Group Co., Ltd.), with each participant performing two trials for both hands, and the average value recorded. Body composition analysis was performed using a bioelectrical impedance analyzer (PICOOC, Manufacturer: Youpin international technology [Shenzhen] co., ltd), which measured the percentages of body fat, bone and muscle mass, visceral fat, water, skeletal muscle, subcutaneous fat, and basal metabolic rate. Vital sign assessments included blood pressure, heart rate, and respiratory rate. Blood pressure was measured in both upper arms, aligned with heart level, using an automated blood pressure monitor (Yuwell, Manufacturer: Jiangsu Yuyue Medical Equipment & Supply Co., Ltd.) after participants had rested for at least 10 min. If the initial test was abnormal, a repeat measurement was conducted after at least 30 min. Heart rate was obtained from the blood pressure monitor, while respiratory rate was manually counted using a stopwatch.
In addition, head and neck examinations were conducted to assess thyroid abnormalities, lymph node enlargement, and other structural anomalies. Vision assessment included visual acuity testing and screening for refractive errors. Dental examinations evaluated oral hygiene, periodontal disease, cavities, missing teeth, and other dental conditions. Gynecological examination was performed for female participants.
Laboratory measurementsParticipants underwent a series of laboratory tests to evaluate hematological, metabolic, hepatic, renal, and urinary parameters, as well as tumor biomarkers. All tests followed standardized clinical protocols to ensure accuracy and reproducibility.
Routine blood, biochemical, and urine tests were conducted at the laboratory departments of community health service centers, using automated analyzers. Hematological tests included complete blood counts (CBC), measuring hemoglobin (g/L), white blood cell count (10⁹/L), red blood cell count (10⁹/L), platelet count (10⁹/L), lymphocyte count (10⁹/L), lymphocyte percentage, mean corpuscular volume (fL), mean corpuscular hemoglobin (pg), mean corpuscular hemoglobin concentration (g/L), mean platelet volume (fL), and platelet distribution width (fL). Hematology was measured on an automated analyzer XN-1000 (Sysmex, Kobe, Japan/Shanghai, China) using fluorescence flow cytometry and DC detection with hydrodynamic focusing; hemoglobin was quantified by the SLS method. Metabolic assessments included fasting blood glucose (FBG, mmol/L) and lipid profiles, consisting of total cholesterol (TC, mmol/L), triglycerides (TG, mmol/L), high-density lipoprotein cholesterol (HDL-C, mmol/L), and low-density lipoprotein cholesterol (LDL-C, mmol/L). Hepatic and renal function tests assessed organ health and metabolic function, including alanine aminotransferase (ALT, U/L), aspartate aminotransferase (AST, U/L), creatinine (µmol/L), urea (mmol/L), and uric acid (µmol/L). Serum chemistry (fasting glucose, lipid profile, liver/kidney function) was measured on an automated analyzer Polarisc2000 (KHB, Shanghai, China) using photometric end-point/rate methods. Urinalysis was performed to detect urine glucose, urine ketones, urine protein, and urine occult blood with an automated urine chemistry analyzer Mejer-700I (Meiqiao, Shenzhen, China) by reflectance photometry.
In addition, tumor biomarker tests were conducted on a subset of participants in specialized laboratories. Alpha-fetoprotein (AFP) and hepatitis B surface antigen (HBsAg) were assayed by ELISA (Autobio, China), as well as Epstein-Barr virus viral capsid antigen immunoglobulin A (EBV VCA-IgA) by ELISA (EUROIMMUN, Germany), all following the manufacturers’ instructions. Carcinoembryonic antigen (CEA), a widely used tumor marker for gastrointestinal and lung cancers, was measured by radioimmunoassay (RIA) at Guangzhou KingMed Diagnostics (CAP-accredited).
To maintain quality control, each laboratory followed internal calibration protocols, conducted regular proficiency testing, and implemented quality assurance procedures to minimize measurement variability. All test results were interpreted by certified laboratory physicians, ensuring clinical validity and data reliability.
Auxiliary examinationsAuxiliary diagnostic assessments were conducted to further evaluate participants’ health status. These examinations were performed by trained specialists following standardized clinical protocols. A standard 12-lead resting electrocardiogram (ECG) was recorded using an ECG-1112 M electrocardiograph (Shenzhen Kaiwo Electronics Co., Ltd., Shenzhen, China). Chest X-rays were obtained with an MXHF-1500DR digital radiography system (Beijing Zhongji Guobei Medical Technology Co., Ltd., Beijing, China). Abdominal ultrasonography was performed using an HS50 ultrasound system (Samsung Medison Co., Ltd., Seoul, South Korea) to evaluate the liver, gallbladder, pancreas, spleen, and kidneys. For female participants, pelvic ultrasonography was conducted with the same system, along with cervical screening including Pap smear and pathological examination where clinically indicated.
Biospecimen collection and biological biobankTo support multi-omics analyses and biomarker discovery, fasting blood and saliva samples were systematically collected from all participants at baseline. Blood samples (10 mL) were drawn after overnight fasting (≥ 8 h) using EDTA anticoagulant tubes. Following centrifugation at 3,500 rpm for 10 min, the plasma, buffy coat, and red blood cell fractions were separated and aliquoted into labeled tubes. Saliva samples (3 mL) were self-collected after fasting (no food or water intake) into collection tubes pre-filled with a stabilization buffer to preserve nucleic acids and microbial composition. For each participant, the blood samples were divided into five aliquots, including three tubes of plasma, one tube of buffy coat (leukocytes), and one tube of red blood cells, while saliva samples were divided into three aliquots. Each sample was assigned a unique participant ID and linked to a QR code, which was scanned into the biobank management system for real-time tracking of sample location, volume, and processing details.
All processed samples were stored in − 80 °C freezers. All these freezers are equipped with real-time temperature monitoring and alarm systems, with alerts triggered when the temperature exceeds a predefined threshold (typically − 70 °C). Freezers are linked to a centralized system, and routine inspections are performed under SOPs. The power supply is configured with a dual-circuit redundant system, ensuring uninterrupted operation during emergency scenarios. To ensure high-quality biospecimens, the entire process—from sample collection to aliquoting and storage at − 80 °C—was completed within six hours. Rigorous quality control protocols were implemented, including integrity assessments in randomly selected samples, confirming their suitability for genomics, transcriptomics, epigenomics, proteomics and microbiomics research. The GDBC biobank currently houses over 400,000 aliquots, providing a valuable resource for future studies.
Multi-omics sub-cohortthe first phase of the study prioritized multi-omics profiling in a sub-cohort to investigate the genetic and microbial contributions to disease susceptibility and their interactions with environmental and lifestyle factors.
A subset of 2,530 participants underwent genome-wide genotyping using the Illumina Infinium Global Screening Array-24 Kit. Genomic DNA was extracted from buffy coat fractions of blood samples following standardized protocols. Quality control (QC) procedures were applied at both variant and individual levels. Variants with low call rates (< 95%), minor allele frequency (MAF) < 0.01, or significant deviations from Hardy-Weinberg equilibrium (P < 10⁻⁷ in controls or P < 10⁻¹² in cases) were excluded. Individuals with gender discrepancies, high variant missing rates (< 95%), extreme heterozygosity (>6 SD), cryptic familial relatedness (PI_HAT >0.25), or population outliers (determined via PCA using EIGENSTRAT) were removed. To enhance genotype resolution, imputation analyses were performed using different methods for MHC and non-MHC regions. For non-MHC regions, phasing was conducted with SHAPEIT (v2.12), and imputation was performed using IMPUTE2, with the 1000 Genomes Phase III dataset as the reference panel. For MHC regions, SNP2HLA was used with the Han Chinese reference panel (BGI, n = 10,689). Variants with low imputation quality or abnormal allele frequencies were removed [31].
To evaluate population structure and genetic homogeneity, we conducted principal component analysis (PCA) on high-quality autosomal SNPs after linkage disequilibrium (LD) pruning with PLINK (v1.9) (window size 200 SNPs, step 50 SNPs, r² = 0.1). Runs of homozygosity (ROH) were estimated using PLINK (v1.9) (minimum length = 1 Mb, ≥ 50 SNPs). Genomic control (λ_GC) analysis was performed by conducting genome-wide association analyses on (i) a randomly simulated phenotype and (ii) sex, with λGC calculated from the median χ² statistic of autosomal SNPs.
To explore the role of the oral microbiome in disease development and its interactions with host genetic and environmental factors, 16 S rRNA sequencing was performed on 2,049 participants. Saliva samples were collected in preservative-filled tubes, and microbial DNA was extracted using the PowerSoil DNA Isolation Kit (Qiagen, Germany). The V4 region of the 16 S rRNA gene was amplified using primer pairs 515 F/806R and sequenced on an Illumina MiSeq platform (2 × 250 bp paired-end reads). Raw sequencing data were demultiplexed based on sample-specific barcodes, and quality control and denoising were conducted using DADA2 to generate amplicon sequence variants (ASVs). Microbial composition and diversity were analyzed using the QIIME2 pipeline, enabling taxonomic classification, alpha and beta diversity analysis, and microbial community profiling [32].
All multi-omics data underwent rigorous quality control and preprocessing to ensure data integrity and reproducibility. This multi-omics sub-cohort provides a foundation for future large-scale analyses, supporting research into gene-environment interactions, host-microbiome interplay, and complex disease mechanisms.
Comments (0)