We computed drug similarity scores based on: (1) chemical substructures (Chem) of drugs, (2) sequences (Seq) of drug-targeted proteins, (3) biological processes (BP) of drug-related genes, (4) molecular functions (MF) of drug-related genes, and (5) cellular components (CC) of drug-related genes.
For the Chem-based similarity score, we used 881 chemical substructures defined in Pubchem [18]. We assigned a binary vector to each drug according to the presence or absence of the corresponding substructures. For a pair of drugs with vectors \(\overrightarrow\) = (\(_\), \(_\),…, \(_\)) and \(\overrightarrow\) = (\(_\), \(_\),…, \(_\)) (\(_\) = 0/1 and \(_\) = 0/1), we computed the Chem-based similarity score as \(\frac_=1 \text _=1\right)}_=1 \text _=1\right)}\) (e.g., the Jaccard score) [16, 19].
For the Seq-based similarity score, we collected drug-targeted proteins from DrugBank [20]. We used the R package Biostrings to compute similarities between drug-targeted proteins with substitutionMatrix = BLOSUM62, gapOpening = 10, and gapExtension = 1 [21,22,23]. Where \(P\) denotes a protein, \(\rho\) denotes similarity between two proteins, \(}\) and \(}\) denote sets including drug-targeted proteins for two drugs, and \(1\left[ \bullet \right]\) denotes the indicator function, for a pair of drugs with drug-targeted protein sets \(}\) and \(}\), we computed the Seq-based similarity score as:
$$\frac_\in }}\sum__\in }}\left\_=_\right]+_\times 1\left[\left(_,_\right)\notin \left(}\cap }\right)\right]\right\}}_\in }}\sum__\in }}\left\_=_\right]+1\left[\left(_,_\right)\notin \left(}\cap }\right)\right]\right\}}$$
(1)
The similarity score in Eq. (1) was derived from the total number of overlapped proteins (i.e., the function \(1\left[_=_\right]\)) and average similarity score between non-overlapped proteins (i.e., the function \(1\left[\left(_,_\right)\notin \left(}\cap }\right)\right]\)), which ensured two drugs with complete overlapped proteins to have a similarity score = 1.
For BP-, MF-, and CC-based similarity scores, we collected drug-targeted genes and their ontologies from Gene Ontology (GO) [24]. For each type of ontology (e.g., BP, MF, or CC), we used R package GOSemSim to compute similarities between drug-targeted genes with measure = Wang [25, 26]. For a pair of drugs, we computed BP-, MF-, and CC-based similarity scores by using equation (1), in which \(_\), \(}\), and \(}\) in equation (1) refer to genes.
2.2 US Food and Drug Administration (FDA)’s Adverse Event Reporting System (FAERS) Data PreparationWe followed Banda et al. and Wang et al. to process FAERS data (2004Q1–2022Q2) [27, 28]. We removed duplicates, and normalized drug names to drug ingredient names according to RxNorm and DrugBank [20, 29]. We used MedDRA® Preferred Terms (PT) as ADE names [30]. MedDRA® (the Medical Dictionary for Regulatory Activities terminology) is the international medical terminology developed under the auspices of the International Conference on Harmonization of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH). The MedDRA® trademark is owned by the International Federation of Pharmaceutical Manufacturers and Associations (IFPMA) on behalf of the ICH. We excluded reports with > 10 drug ingredient names due to a high uncertainty in causal relationships involving > 10 ingredients. We included 12 million reports with ≤ 10 drug ingredient names. In our processed data, Chem-, Seq-, BP-, MF-, and CC-based similarity scores included all two-drug pairs among 2,659 drugs (i.e., Chem), 1,659 drugs (i.e., Seq), 692 drugs (i.e., BP), 692 drugs (i.e., MF), and 692 drugs (i.e., CC), respectively.
We included 445 relatively newer drugs with: (1) total reports ≤ 50 between 2004Q1 and 2008Q4, and ≥ 500 in 2022Q2; and (2) similarity scores with other drugs. We derived matched datasets for newer drugs to control for a confounding effect. For a newer drug, reports with and without the drug were matched on year/quarter (exact match), number of drugs in the report (exact match), and concomitant drugs in the report (best match). For instance, a report including a newer drug and two concomitant drugs (e.g., new drug + A + B) could be matched with other reports including three drugs (e.g., X + A + B, X + Y + B, X + Y + A, or X + Y + Z), in which “X + A + B” was defined as the best match. As we focused on early ADE detection, we followed newer drugs before their total reports exceeded 5,000. We selected the matching ratio as 1:100 for reports with versus without a newer drug. We computed percentages of matched concomitant medications.
2.3 Bayesian Drug Similarity-Based Adverse Drug Event (ADE) DetectionWe used a Bayesian approach, as the posterior probability of the null hypothesis provided false-positive control in high-throughput mining [31, 32], especially for investigating a large amount of drug-ADE pairs [10]. We used the following notations. For a drug and an ADE, let \(N\) and \(X\) be the frequencies of the drug and the drug-ADE pair, respectively. Let \(\theta\) be the probability of ADE (e.g., \(\theta\) = Prob[ADE | drug]). Let \(^\) be the reference value for testing \(\theta\), where \(^\) could be determined by subject matter expert or in a data driven manner (e.g., \(^\) = Prob[ADE | no drug]). Let \(\text\left(X,N;\theta \right)\) denote binomial distribution, \(\text\left(\theta ;\alpha ,\beta \right)\) denote beta distribution, \(\text\left(x;\alpha ,\beta \right)=\underset}\text\left(\theta ;\alpha ,\beta \right)d\theta\) denote incomplete beta distribution, and \(\text\left(X,N;\alpha ,\beta \right)\) denote beta-binomial distribution. Let subscript \(i\) indicate the \(i\) th drug, subscript \(j\) indicate the \(j\) th ADE, and subscript \((k)\) indicate the \(k\) th time point (e.g., quarters after initial drug reporting in FAERS data).
We assumed \(_\le _^\) under the null hypothesis (e.g., \(}_:\text[\text|\text]\) ≤\(\text[\text|\text]\)). We assumed \(_\) ~ \(\text\left(_,_;_\right)\) and \(_\) ~ \(\text\left(_;_,_\right)\) (i.e., the prior distribution). For a new drug with similarity scores, let \(_\) and \(_\) be the corresponding frequencies of the similar drugs. Parameters in the prior distribution could be estimated by optimizing the objective function based on beta-binomial distributions (Eq. 2), where box-constrained optimization could be used to control the influence of prior distribution or prevent potential overfitting in automatic high-throughput mining.
$$}_,}_=\underset_(k)},_(k)}\right]\in }}}_\text\left(_,_;_,_\right)$$
(2)
We defined the posterior probability of the null hypothesis as:
$$}\left( \le \theta_^ |X_ ,N_ } \right) = q_ = }\left( ^ ;\hat_ + X_ ,\hat_ + N_ - X_ } \right)$$
(3)
We defined the signal as \(_\) <0.05 and \(_\) >0. The average of \(_\text\) at times of detection estimated the false discovery rate (FDR) for mining all newer drugs and ADEs [10].
2.4 Performance EvaluationWe constructed a benchmark including 170,386 drug-labeled drug-ADE pairs according to SIDER and Demner-Fushman et al. [33, 34], where PT terms were available as ADE names. In real-world data-based performance evaluation, we used benchmark drug-ADE pairs as true positives, and non-benchmark drug-ADE pairs as true negatives. We summarize the performance evaluation analyses in the rest of this section, and present details (e.g., reference methods, an additional real-world dataset, and simulation procedures) in the Online Supplemental Material, S1.
First, we analyzed the processed FAERS data. The tested method was drug similarity-based posterior probability of the null hypothesis (equation 3) derived from top-25 similar drugs. The reference methods (Online Supplemental Material, S1.1) included empirical Bayes geometrical mean under multi-item gamma Poisson shrinker (EBGM(MGPS)) [6], the information component under a Bayesian confidence propagation neural network (IC(BCPNN)) [7], positive false discovery rate under MGPS (pFDR(MGPS)) [35], and posterior probability of the null hypothesis under MGPS (PP(MGPS)) [10]. As only the tested method, pFDR(MGPS), and PP(MGPS) provided false-positive control [10, 35], we detected ADE signals with FDR < 0.05 using these methods, and computed times to detection. We compared the tested method, pFDR(MGPS) and PP(MGPS) on time-to-detection. We also compared all methods on prioritizing the benchmark ADEs (i.e., area under the curve (AUC)).
Second, we analyzed ~1.5 million potential ADE cases derived from Optum’s de-identified Clinformatics® Data Mart Database (Online Supplemental Material, S1.2) [36]. The data included four types of ADEs (e.g., acute myocardial infarction, acute kidney injury, fall, and hypoglycemia) and 358 relative newer drugs in January 2008. The tested method was drug similarity-based posterior probability of the null hypothesis (equation 3) derived from the top-25 similar drugs. The reference methods (Online Supplemental Material, S1.2) included the positive false discovery rate under the precision mixture risk model (pFDR(PMRM)) [35], and posterior probability of null hypothesis under PMRM (PP(PMRM)) [36]. We detected ADE signals with FDR < 0.05 under all approaches, and computed times to detection. We compared the tested method with the reference methods on time-to-detection and prioritizing the benchmark ADEs (i.e., area under the curve (AUC)).
Third, we conducted simulation studies. The simulation procedures are described in Online Supplemental Material S1.3. In short, we used the estimated prior distributions of ADE probabilities and the marginal reporting frequencies in FAERS data analysis for simulation probabilities of ADEs (i.e., \(_\) ~ \(\text\left(_;}_,}_\right)\) and frequencies of drug-ADE pairs \(_\sim \text\left(_,_;_\right)\)). We determined condition positives and negatives by comparing the simulated ADE probabilities with the corresponding ADE reporting rates in FAERS data (i.e., the null hypothesis). As similarity-based prior risk distributions could be mis-specified, we further assumed 20% of the prior risk distributions could be mis-specified. For the tested method, we used true parameters (i.e., correctly specified prior risk distributions) and randomly selected parameters (e.g., mis-specified prior risk distributions) to compute the drug similarity-based posterior probability of the null hypothesis depending on randomly simulated binary indicators with a probability of success = 0.8. For reference methods, we computed pFDR(MGPS) and PP(MGPS). We detected signals with FDR < 0.05 with all methods (i.e., predicted positives). We conducted 10,000 simulations. We computed the averages of FDR, false omission rate, probability of detection, specificity, and AUC.
Comments (0)