While AI is not a new concept, and in fact was first conceptualized nearly 70 years ago, its popularity and use in multiple industries, including the pharmaceutical industry, has exploded in recent decades with the increasing availability of digital training data. Artificial intelligence has successfully been deployed to enhance drug discovery and development [30, 31], manufacturing [32], clinical trial design [33], and even marketing [34]. In addition, AI techniques have been used to increase efficiency and consistency in pharmacovigilance, a field incorporating repetitive tasks and large datasets, i.e., one that is a natural candidate for AI automation. Indeed, AI has reduced case processing times, improved information capture (e.g., from spontaneous reports and social media), automated extraction of products/AEs to standard dictionaries, enhanced identification of drug–drug interactions [35], automated causality assessment [36], and more [11].
In the current review on AI applications in signal management, with respect to signal detection, we made a distinction between the ‘detection’ of ADRs in written language and the formal process of signal detection, where qualitative or quantitative methods are applied to identify safety signals in data already imported to a safety database to which statistical alerts have not been applied. This is important because many studies leverage NLP or ML techniques to detect ADRs in unstructured data (e.g., medical charts, EHR databases, social media) [9] and do not make a clear distinction between this case capture/entry process and the “proper” aforementioned signal detection process. Historically, signal detection methods began with qualitative human review of ICSRs, a technique that still allows safety experts to apply clinical judgment to complex safety topics in low case volume scenarios [37]. However, as early as the era of the thalidomide tragedy in the 1960s [38], quantitative techniques to detect SDRs were being conceptualized [39] using a 2 × 2 contingency table format. These frequentist disproportionality analyses (e.g., ROR, PRR, relative risk reduction, Chi Square, Yule’s Q, Poisson probability) [40] became increasingly popular with the advent of digital safety databases. Around the turn of the century, the desire to combine quantitative methodologies with expert knowledge and past data in a feed-forward manner led to the adoption of Bayesian statistical methods, for instance, the Bayesian Confidence Propagation Neural Network [41] method, which was the first routinely used method to contribute to a change in the status quo to a quantitative first, qualitative second review paradigm. Around this time, association rule analysis (a very closely related technique to the Bayesian Confidence Propagation Neural Network) [42] and other techniques such as shrinkage Lasso regression—an early ML technique that could similarly identify patterns in large data [42, 43]—were being developed and considered for the purpose of pharmacovigilance signal detection [44]. The Bayesian Confidence Propagation Neural Network, association rule mining, and text mining paradigms, among others, would typically fall within the scope of this review as ML technologies. However, we considered them out of scope because of their existing extensive discussion in the literature [45, 46] to instead focus on emerging ‘advanced’ ML technologies.
Indeed, many advanced ML algorithms have recently gained popularity in many industries including pharmacovigilance. Decision tree-based ML algorithms including random forest and GBM/gradient-boosting trees are examples of supervised ML algorithms built on a basic architecture of a cascading set of rules. These models offer the flexibility to handle complex datasets while making no assumptions about existing relationships between variables, but require a large training corpus. Six of our included articles applied decision tree ML methods [14, 16, 18, 20, 23, 25]. A common tree-based method used was random forest, an “ensemble” or “bootstrap aggregating” (also known as “bagging”) method that aggregates results from multiple decision tree models, which each individually “vote” for an outcome [47]. In the study by Jeong et al. [23], random forest had the highest AUROC, outperforming logistic regression, support vector machine, and a neural network. In the other three studies using random forest, this model was outperformed by another tree-based method, GBM [14, 16, 20]. Gradient boosting machine, like random forest, is an ensemble classification method, but which in contrast to random forest builds a series of decision trees where each new tree corrects the errors of the one prior until no further improvements can be made; the final model is the weighted sum of all trees [48]. Gradient boosting machine, or a modification of it (e.g., eXtreme gradient boosting used by Gosselt et al. [18] and Zhu et al. [25] or gradient-boosted trees used by Dauner et al. [16]), was used by five of our included studies [14, 16, 18, 20, 25]. This type of ML model was a consistently high performer across these studies and often outperformed other models including random forest. Other models used included several non-ensemble methods (e.g., decision tree) and several non-ensemble, non-tree-based methods (e.g., logistic regression and support vector machine), although they rarely outperformed their ensemble counterparts. The fact that these advanced ML algorithms generally outperformed traditional signal detection methodologies highlights the great potential of this technology to improve safety science; however, these studies generally lack a level of standardization, transparency, and information on the timeliness of signal detection to strongly conclude that ML algorithms are truly superior. Additionally, these studies generally do not discuss the logistical challenges in operationalizing ML algorithms across a diverse portfolio of products—a significant barrier to full adoption.
When training their ML models, our included studies’ authors used a variety of internal or external positive and negative controls to “teach” the algorithms examples of what constitutes a valid signal and what does not. Most studies developed their own internal standards, although among these, most were developing ML algorithms for specific AEs or specific products, meaning the choice to develop their own standards was likely best. In these instances, listed ADRs typically served as the positive controls and unlisted AEs typically served as the negative controls [14, 16, 18, 22]. Two of three studies developing models for any AE and any (or almost any) product [20, 24] leveraged externally available “gold standards” including “Wahab13” [41] and “Harpaz14” [49] and Exploring and Understanding Adverse Drug Reactions and Observational Medical Outcomes Partnership standards. However, there are documented limitations even in these established standards. For example, a 2016 study by Hauben et al. [50] discovered that an estimated 17% of negative controls in the Observational Medical Outcomes Partnership standard may in fact have been misclassified because of methodological oversights. In the future, where applicable, authors should make decisions about internal versus external standards cautiously, and if choosing external standards, be weary of their quality and potential effects on their models.
Among others, one benefit of supervised ML algorithms (vs unsupervised clustering ML methods) is the opportunity to analyze which input features were given the highest weight by the algorithm during training (i.e., feature analysis). Likely, users of ML algorithms would hypothesize that case-level or PEC-level features known by experts to drive an assessment of causality (e.g., disproportionate reporting frequency, healthcare provider-reportedness, positive dechallenge, positive rechallenge) would be identified as top features. Indeed, feature analysis tools such as Shapley Additive Explanations feature importance [51] or Lasso regression [52] were used by several of our included studies’ authors to identify some of these expected features, as previously described above.
In contrast to the eight studies that we found applying supervised ML to the task of signal detection, often in comparison to frequentist or Bayesian disproportionality analyses, only one study was identified using a clustering (i.e., unsupervised ML) approach, in this case based on k-means [15]. This is unsurprising, given that the ultimate outcome of signal detection is a simple binary classification—signal or no signal. However, clustering is a creative solution to the complex task of signal detection. Large industry safety databases contain many PECs, even when considering only one product. When clustering these PECs, logically, some clusters will contain PECs constituting a safety signal, and others will not. However, in this approach, safety scientists must spend time “back-engineering” clusters to identify the reason the cluster was created and determine whether a signal is present or not, a clear limitation to this approach.
While many studies focused on signal detection, relatively few focused on the latter stages of signal management—signal validation/confirmation and signal evaluation/assessment. These stages are critical in pharmacovigilance but are generally less dependent on a statistical analysis and more dependent on a qualitative expert analysis (e.g., considering the biological plausibility of the topic, the presence of confounding factors, individual causality assessments)—the lack of ML studies in this area corresponds with this fact. However, other types of AI, specifically NLP, have clear uses in these areas. The studies reviewed above by Dong et al. [17] and Haerian et al. [19] reflect this concept. Large language models such as GPT, Bart, Llama, and others have a conceptual understanding of basic clinical concepts and can identify synonymous terms, and match terms to Medical Dictionary for Regulatory Activities or Anatomical Therapeutic Chemical. Further, there is an industry-wide desire for the development of large language models with domain-specific knowledge [53], some of which are already being developed and commercialized.
The current review must be considered in the context of its inherent limitations. First, we employed a rather narrow search and screening strategy, reducing 635 unique results to only 12 final included articles. There are many applications of AI in pharmacovigilance peripheral to signal management, or which were developed for a specific use case. Although some of these studies may theoretically be applicable to signal management in some way, we have not considered them in scope (many of these types of articles are reviewed elsewhere). Furthermore, our search strategy relied on articles including specific terms or phrases related to signal management and AI/ML/NLP in their keywords, titles, or abstracts. While these are common terms, it is possible that some studies that described uses of AI in signal management using alternate phraseology may not be captured here. As an emerging field, there is also an increasing number of publications, some of which were written and became available even during the peer-review process for this article. Six such publications were identified that pertained to signal detection via advanced ML algorithms such as deep learning, reinforcement learning, and others (although these studies do not alter the main conclusions of this review) [54,55,56,57,58,59]. Next, we considered device surveillance out of scope given that there is little standardization across companies’ safety organizations as to how device surveillance is performed (i.e., under or not under the umbrella of the drug safety function). Our search strategy did identify two device surveillance studies leveraging ML for surveillance of arthroplasty devices and dual-chamber implantable cardioversion devices [60, 61], indeed, the same methodologies, strengths, and limitations apply. Last, a lack of consistent use of performance metrics across studies precluded our ability to conduct a robust quantitative comparison of methodologies; instead, a qualitative approach was adopted.
Looking forward, the development and use of AI tools in signal management will continue to face technical and regulatory challenges. For instance, in the field of generative AI, synthesis of non-factual information (commonly termed ‘hallucination’), mis-prompting, privacy, and bias remain key challenges [62]. Overcoming these challenges will unlock AI drafting potential for ad hoc and periodic safety reports to regulators—a service that some vendors are beginning to supply. Similarly, in discriminative AI systems, bias in training data may be perpetuated by ML algorithms and must be investigated proactively, with several leaders in the technology industry (e.g., IBM, Microsoft) making pledges to do so [63]. Despite the rapid evolution of the technology, concrete regulatory frameworks have not yet been established. However, some publications on current thinking reveal various regulators’ ethical positions on AI. For instance, in 2021 the World Health Organization established ethical principles of AI use in healthcare, including how transparency and explainability (among other principles) must be carefully balanced with performance and innovation [64, 65]. Similarly, the European Medicines Agency recently released a draft reflection on the use of AI across the medicinal product lifecycle, which briefly opines on the responsibility of marketing authorization holders to “validate, monitor and document model performance and include AI/ML operations in the pharmacovigilance system” as related to signal detection and classification/severity scoring of AEs [66]. By surmounting these challenges and gracefully navigating this dynamic regulatory landscape in a risk-based manner [67], future applications of AI in signal management are poised to significantly improve efficiency across signal management, possibly even in the early detection of ‘black swan’ events, which are difficult to predict, have severe and widespread consequences, and are marked by retrospective bias [68].
Comments (0)