DRIVE-KG: Enhancing variant-phenotype association discovery in understudied complex diseases using heterogeneous knowledge graphs

Abstract

Multi–omics data are instrumental in obtaining a comprehensive picture of complex biological systems. This is particularly useful for women's health conditions, such as endometriosis which has been historically understudied despite having a high prevalence (around 10% of women of reproductive age). Subsequently, endometriosis has limited genetic characterization: current genome–wide association studies explain only 11% of its 47% total estimated heritability. Graph representations provide an intuitive and meaningful way to relate concepts across diverse data sources and address fundamental sparsity and dimensionality challenges with multi–omics data analysis. Here we present DRIVE-KG (Disease Risk Inference and Variant Exploration–Knowledge Graph), which uses a heterogeneous graph representation to integrate biological data from multi–omics datasets: dbSNP, NCBI Human Gene, Omics Pred, GTEx, and Open Targets. We drew directly from the knowledge captured in these data, using nodes to represent genes, single nucleotide polymorphisms, proteins, and phenotypes, and edges to represent relationships between these concepts. We trained two models using DRIVE-KG: a link prediction model to suggest associations between SNPs and two pilot phenotypes (endometriosis and obesity), and a graph convolutional network (GCN) to classify patient–level endometriosis status. We conducted the patient–level classification using data from 1,441 Penn Medicine BioBank participants with gold standard chart–reviewed endometriosis status. The link prediction model uncovered 66 high–confidence (score ≥ 0.95) previously unreported SNP–endometriosis associations. Many of these variants were linked to obesity/body mass index traits (24.2%), lipid metabolism (6%), and depressive disorders (4.5%), showing agreement with emerging hypotheses about endometriosis etiology. In contrast, 11% of the 149 high confidence, candidate SNP–obesity associations (score ≥ 0.9888) were in LD with known obesity associations. The GCN to classify patient endometriosis status had an AUPRC of 0.738 compared to 0.679 for a genetic risk score. Despite this moderate improvement, we found that the GCN learned meaningful stratification of underlying adenomyosis signal and severe grades of endometriosis. We have demonstrated that heterogeneous integration of multi–omics data is valuable for diverse downstream tasks—including discovery and clinical prediction—particularly for understudied diseases where traditional genomic approaches are insufficient.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

The PMBB is approved under IRB protocol #813913 and #852155 and supported by Perelman School of Medicine at University of Pennsylvania, a gift from the Smilow family, and the National Center for Advancing Translational Sciences of the National Institutes of Health under CTSA award number UL1TR001878. The research reported in this publication was supported by the Eunice Kennedy Shriver National Institute of Child Health and Human Development of the National Institutes of Health under award number R01HD110567-01A0, and by the US National Library of Medicine under award number R00LM013646.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

The study was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Board (or Ethics Committee) of University of Pennsylvania (protocol codes: 808346 approved 07/01/2008, 813913 approved 4/3/2013, and 817977 approved 6/6/2013, and 852155 approved on 08/24/2022) for studies involving humans.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

All data produced in the present study are available upon reasonable request to the authors. All code used in this work is available at https://github.com/Setia-Verma-Lab/multi_omics_kg.

https://github.com/Setia-Verma-Lab/multi_omics_kg

Comments (0)

No login
gif