There are over 350,000 health-related apps on the market, each claiming to improve certain aspects of physical or mental health []. A small fraction of these apps is subject to Food and Drug Administration (FDA) regulations. Regulators, health care providers, and patients need to understand how these apps compare with alternatives (eg, pharmaceuticals) that undergo rigorous evaluation. As with pharmaceuticals, the risks and benefits of apps depend on how well people use them. Incorrect assumptions about adherence in clinical trials can lead to incorrect regulatory and treatment decisions. With pharmaceuticals, these risks are reduced by the gold standard practice of intent-to-treat analysis, which estimates effectiveness based on actual, typically imperfect, use. This standard is not the norm in trials of digital health apps, leading to an unknown risk of bias (ROB) in the estimated effects. Here, we provide a systematic review of current practices in FDA-regulated apps, leading to recommendations for reducing the risks of bias revealed by the review.
The FDA focuses on the regulation of software as a medical device (SaMD) therapeutics intended to prevent, diagnose, or treat diseases []. If a predicate therapeutic exists, applicants may use the FDA’s 510k pathway to prove that their therapeutic is substantially equivalent to the predicate therapeutic (ie, with the same intended use, technological characteristics, and benefits and risks of an approved or cleared therapeutic []). In the absence of a predicate therapeutic, SaMD therapeutics follow the FDA’s De Novo pathway, which requires evidence that the therapeutic is safe and effective. The FDA established the Digital Health Center of Excellence to create innovative ways to regulate SaMDs [], which, for example, are easier to update than pharmaceuticals. One such innovation, reviewed under the FDA’s precertification pilot program, conducted excellence appraisals of software companies. This program tested a streamlined approach to approving and updating therapeutics for companies that have demonstrated quality practices [,]. Other innovations have been applied across all FDA departments, such as allowing clearance, approval, and marketing claims based on “real-world evidence” []. There are also proposals, created outside FDA, specifying standard processes (eg, performance reporting standards) for clinical trials of low-risk digital health apps not subject to regulatory oversight []. Given the novelty of SaMDs and the associated regulatory environment, the FDA has the need and opportunity to create guidance and requirements for addressing adherence in future trials. We hope to inform that process.
A systematic review by Milne-Ives et al [] found that approximately three-fourths of digital health app trials collected and reported basic adherence information, such as the number of dropouts. These trials reported a variety of app engagement metrics, with only one-third reporting >60% use. Prior systematic reviews of digital health apps reported similar simple summary statistics (eg, average adherence and dropout rates), with few details on how adherence data were collected and analyzed [-]. This systematic review extends that work by examining, in detail, how adherence and engagement information is collected, analyzed, and reported. It considers how those practices affect the estimates of effectiveness and efficacy, defined as the app’s effect in the entire sample, regardless of adherence, and the app’s effect in the adherent subgroup, reflecting the moderating effect of adherence. This review focuses on digital health apps with a reasonably well-defined evidentiary base, namely, those that followed the FDA’s De Novo or 510k pathways.
Criteria for EvaluationROB FrameworkImperfect adherence can cause underestimation or overestimation of the safety and efficacy of a SaMD. For example, a therapeutic’s efficacy and side effects may be underestimated, if trial participants use it sparingly, but consistent use is assumed. Conversely, efficacy may be overestimated if adherence reflects neglected confounding variables (eg, income and lifestyle factors). As a hypothetical example, researchers evaluating an app to reduce the risk of preeclampsia may observe a reduced rate not because of participant adherence but because participants adhering to the app were recipients of commercial health insurance. To evaluate the ROB owing to imperfect adherence, we used the adherence components of the Cochrane ROB Assessment (version 2.0) [], a well-documented tool for systematic reviews and meta-analyses. To determine the ROB from nonadherence, the ROB tool first asks, “Was there nonadherence to the assigned intervention regimen that could have affected participants’ outcomes?” If outcomes could have been affected, the ROB tool then asks, “Was an appropriate analysis used to estimate the effect of adhering to the intervention?” We developed criteria to answer each question based on research regarding adherence metrics and common methods of analyzing efficacy.
Adherence and Engagement MetricsAdherence refers to how well participants use an intervention, as defined by a protocol or recommendation for use. Engagement refers to how participants use an intervention, irrespective of the intended use of the app. Engagement data can be used to measure adherence for a digital health app. As both adherence and engagement can affect the outcomes of a trial, we have reported both. When collecting and reporting adherence and engagement statistics, researchers must consider 3 facets of use []: initiation, when a person starts using an intervention; implementation, how a person uses the intervention between initiation and discontinuation; and persistence, how long a person uses the intervention before discontinuation.
Which metrics are collected and how they are collected can also affect the ability to conduct efficacy analyses and the analyses’ potential bias. For instance, adherence with recommendations from the therapeutic (eg, using backup contraception when an app detects fertility) could also affect effectiveness estimates. Without collecting this information, researchers would be unable to analyze efficacy in terms of adherence to behavioral recommendations. Therefore, we report adherence and engagement with both the therapeutic and its recommendations. The mechanism of collecting adherence and engagement information can act as a potential confounder if it prompts additional engagement with the therapeutic compared with real-world engagement. Reminders used to increase adherence (eg, email messages) can also be confounders if they are not part of the therapeutic design. To account for these potential confounders, we recorded whether reminders and mechanisms for measuring adherence and engagement were internal to the app or external (ie, an additional component not found in the marketed app). We found few prior studies or analysis plans that determined the level of adherence or engagement required to have a clinical effect. This level of adherence can vary depending on the therapeutic being used. Without a study or trial analysis plan defining low adherence or evidence of the level of adherence needed to produce a clinical effect, we cannot conclusively assess whether adherence is low or not because of insufficient information.
Analysis of EfficacyIn evaluating efficacy analyses, we ask how well a trial or study fulfills the assumptions required by its efficacy analysis method. There are 3 commonly used estimates of efficacy: the average treatment effect (ATE), per-protocol effect, and dose-response effect. describes each estimate, the common analysis methods for calculating estimates, and the assumptions required for unbiased estimates. [-] includes definitions of the following assumptions: consistency, positivity, ignorability, exclusion restriction, strong monotonicity, and the stable unit treatment value assumption (SUTVA). In addition to the requirements in , researchers should preregister their analyses of effectiveness and efficacy to reduce the risk of capitalization on chance [].
Table 1. Methods of analysis commonly used to account for imperfect adherence and the assumptions required for unbiased estimates.Estimate of efficacy and common analysis methodsAssumptions for unbiased estimatesATEa: estimates the average effect of treatmentaATE: average treatment effect.
bSUTVA: stable unit treatment value assumption.
cConsistent definition of treatment.
dITT: intent-to-treat.
eConsistent definition of adherence.
fIV instrumental variable.
We applied our framework, which was developed based on the Cochrane ROB, to evaluate how well existing trials and studies meet our standards, with the goal of improving future trials. We examined the completeness of their reporting and the appropriateness of the procedures reported. By focusing on SaMD therapeutics, the most rigorously evaluated digital health apps, we sought to identify improvements for future studies on all digital health apps.
A 2-stage search strategy was used to identify all product codes and registrations for patient-facing SaMDs, with intended repeated use for at least 2 weeks, that the FDA had approved or cleared before March 2022. In the first stage, 2 reviewers independently searched the FDA product code database for product codes related to SaMDs. We searched the device name, definition, physical state, and technical method attributes for the keywords “software,” “mobile,” “digital,” and “application.” In the second stage, we searched the FDA registration database for these product codes. We examined each registration’s supporting documents, De Novo decision summaries, and 510k decision summaries to determine whether the product met our inclusion criteria.
We then searched ClinicalTrials.gov, product websites, and MEDLINE for peer-reviewed publications corresponding to each included product. For the ClinicalTrials.gov search, we used the product and company names as keywords, individually and in combination, to identify clinical trials. We included all publications that evaluated the effectiveness or efficacy of the included products, including both randomized controlled trials (RCTs) and observational studies. We reviewed all publications listed at the end of the ClinicalTrials.gov registration for potential inclusion. For the MEDLINE search, product and company names were used as keywords. For the product website search, publications listed as clinical evidence on company websites were included. Two reviewers independently screened each publication, examining the title and abstract as well as the full text, where appropriate. Reviewer disagreements were reconciled by discussion. We screened and included only those articles published before March 2022. We did not include pilot or feasibility studies.
For example, the first stage of the search identified the PYT product code when the “device name” field was searched for “software.” All registrations coded as PYT (ie, “Device, Fertility Diagnostic, Contraceptive, Software Application”) were then evaluated for inclusion based on corresponding supporting documents, 510k decision summaries, and De Novo decision summaries. One included 510k for this product code was for the Clue app, K193330. In the second stage, we searched ClinicalTrials.gov using the keywords “Clue,” “Clue Birth Control,” “Biowink,” “Dynamic Optimal Timing,” and “Cycle Technologies.” We searched MEDLINE using the keywords “Dynamic Optimal Timing,” “Biowink,” and “NCT02833922.” Finally, we searched the product website [] for clinical trial documents.
Data ExtractionFor each publication, one reviewer extracted data and the other reviewer checked the accuracy of the data. Differences were reconciled by discussion between the reviewers. The Cochrane Data Collection Form for Intervention Reviews [] was completed with clinical trial characteristics, including the design, number of participants, sampling method, interventions, and outcomes.
The remainder of the data extraction form was created using the criteria for reporting adherence metrics described in the Adherence and Engagement Metrics section and the assumptions for the associated efficacy analysis method described in the Analysis of Efficacy section. Given the diversity of the apps and outcomes, we reported each metric that a clinical trial or study reported separately, without averaging across different metrics. When evaluating efficacy analyses, we categorized trials or studies as fulfilling the positivity condition if they had a control group. We categorized trials as fulfilling the consistency condition if they had definitions of treatment and adherence that avoided hidden variations of treatment that might affect participants differently.
Some assumptions, referenced in and described in , could not be fully evaluated. One such assumption is SUTVA, which requires no interaction between units of observation that could affect a result. Although it is impossible to prove that this assumption holds, some trial designs afford greater confidence than others. For example, if a trial has no central clinical team and treatment is administered only through an app, it would be difficult for participants to interact with the clinical research staff. By contrast, if clinical research staff interact with both the control and treatment groups, they might treat participants in the 2 groups in ways that affect their independence. We categorized a trial as fulfilling SUTVA if it had no central clinical team or if it had mechanisms for reducing the risk of interaction between participants or between participants and staff.
Similarly, it is impossible to fully evaluate the assumption that there are no unmeasured confounders. Instead, we asked whether the researchers demonstrated awareness of confounders by listing potential confounders explicitly and reporting their rationale for selecting them.
The results in the Adherence Metrics section and Analysis of Efficacy section below summarize practices for the included trials using means or counts as appropriate. Given the heterogeneity of the therapeutics and outcomes, we did not estimate the overall impact of all biases. The protocols and preregistrations referenced in the included articles were used as supporting documents. The protocol for this review was registered on the Open Science Framework [], which includes the data extraction forms and extracted data. Article screening data, extracted data, and summarized extracted data are also available in [-].
shows the completed PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) diagram. The 2-stage search for SaMD therapeutics identified 5% (15/301) of product codes and 44% (24/54) of registrations as potential SaMDs. These registrations included 18 unique SaMD therapeutics. Our search of ClinicalTrials.gov, company websites, and MEDLINE identified 40, 228, and 148 articles, respectively. After screening and removal of duplicate articles, 24 articles, involving 10 products, met all the inclusion criteria. A total of 8 products were excluded because clinical trials or observational studies evaluating efficacy for at least 2 weeks were not found in our literature search.
As seen in and , the 24 included articles (22 total trials) studied a variety of SaMD therapeutics, including those intended to treat irritable bowel syndrome, insomnia, substance use disorder, and attention-deficit/hyperactivity disorder. All the SaMD therapeutics were mobile apps and will be referred to as apps for the remainder of the article. shows an even mix of apps intended for continual use or module-based apps. Most trials (18/22, 82%) specified a recommended dose for their app, such as the frequency of use or the number of modules to complete. Overall, 11 (50%) trials or studies studied apps used a module-based design with a recommended dose for the app [,-,-], whereas 7 (32%) trials or studies studied apps used a continual use design with a recommended dose for the app [,,-]. Apps without a recommended dose only used the continual use design (4/22, 18%) [-,].
Table 2. Included articles and associated products.Product and condition treatedStudy, yearTitleApple Irregular Arrhythmia NotificationaADHD: attention-deficit/hyperactivity disorder.
bCBT: cognitive behavioral therapy.
cIBS: irritable bowel syndrome.
dEveritt et al [] and Everitt et al [] were based on the same trial.
eSUD: substance use disorder.
fOUD: opioid use disorder.
gChristensen et al [] and Maricich et al [] were based on the same trial.
Table 3. Summary of devices and trials included in the study (n=22).CharacteristicsValuesTherapeutic indication for use, n (%)aADHD: attention-deficit/hyperactivity disorder.
bIBS: irritable bowel syndrome.
cRCT: randomized controlled trial.
Most trials (14/22, 64%) were observational, with the remainder being RCTs (8/22, 36%). On average, the RCTs recruited 290 (SD 120) participants and lasted 300 (SD 270) days. On average, the observational trials recruited 5100 (SD 7000) participants and lasted 230 (SD 140) days.
Adherence Metricssummarizes how the articles measured and reported each of the 3 aspects of adherence. As each article could report different adherence metrics for the same trial or study and report separate analyses, duplicate trials and studies were counted twice. Of the 24 articles, 23 (96%) collected information about app engagement. All apps that provided recommendations (8/8, 100%) also collected information about adherence to their recommendations [,,,-]. Of the 23 articles that collected adherence information, 2 (9%) reported that adherence information was collected externally from the marketed app [,]. Three articles reported that researchers attempted to increase adherence by notifying inactive patients [-]. One reported the use of in-app notifications and 2 reported using email notifications.
Table 4. Summary of adherence metrics (N=24)a.Adherence metricsValues, n (%)Each reported metric (%), mean (SD)Trial collected information about app engagement23 (96)N/AbTrial collected information about adherence to recommendations (n=8 articles for apps that gave recommendations)8 (100)N/AAdherence information collected outside of the marketed app (n=23 articles for apps that collected adherence information)2 (9)N/AAdherence notification sent outside of app (n=3 articles reported sending adherence notifications)2 (67)N/AEngagement metrics (metric is not measuring prescribed use)aThe left-hand columns report what percentage of articles reported adherence or engagement information and what metrics were used by each article. The right-hand columns report the mean and SD for all the articles that reported that metric.
bN/A: not applicable for summary of facets of adherence.
cSD values are not applicable as only 1 article was included.
A total of 4 articles studied a product without prescribing how often to use the app. Engagement was reported in 3 articles on these products. Of the 24 articles, engagement was reported for 2 (8%) in terms of initiation, 2 (8%) in terms of implementation, and 1 (4%) in terms of persistence. Two continual use therapeutics prescribed app use in terms of initiation and implementation but not persistence. As such, 25% (6/24) of the articles studying these apps reported engagement persistence metrics.
Of the 24 articles, 15 (63%) reported initiation in 4 different ways (eg, the number of users who finished the first app module and the number of users who entered 20 data points into the app). Seven articles excluded participants who did not initiate app use, leading to a high adherence for their adherence metrics. Of the 24 articles, 16 (67%) reported implementation, with 9 different definitions (eg, proportion of days between starting and stopping the use of an app that users logged their temperature and the number of perfect use cycles reported by women [ie, abstaining or using contraception on all high-risk days]). Of the 24 articles, 4 (17%) reported persistence, with 2 different definitions (participants using the app over the prescribed period and participants completing the prescribed number of modules). reports the percentage of studies and the average adherence across trials and studies that used each metric. Of the 20 articles that prescribed use of the app, only 9 (45%) reported all prescribed facets of adherence [,-,,].
ROB: “Nonadherence to the Assigned Intervention Regimen”Of the 24 articles, 4 (17%) only reported engagement information, as there was no prescribed amount of app use. We found that the outcomes of the remaining articles could have been affected by nonadherence. Of the 83% (20/24) of articles for apps with prescribed use, 25% (5/20) reported adherence at or below their definition of low adherence for at least 1 facet of adherence. Of the remaining 15 articles, 12 (80%) reported that there was some nonadherence with the app for any prescribed facet of adherence or the app’s behavior recommendations but did not provide a definition of low adherence. These articles provided insufficient information to determine whether adherence was sufficient for each app. The remaining 3 articles did not report sufficient information about each prescribed facet of adherence to judge adherence.
Analysis of Efficacysummarizes the effectiveness and efficacy estimates from each article. Of the 24 articles, 20 (83%) estimated the app’s effectiveness as the ATE for all participants. Of these 20 articles, 11 (55%) preregistered their analysis of effectiveness. A higher percentage of RCTs preregistered their effectiveness analysis (7/9, 78%) compared with observational studies (4/11, 36%). Of the 24 articles, 15 (63%) estimated efficacy in terms of the ATE, per-protocol effect, or dose-response effect. Of these 15 articles, only 5 (33%) preregistered an efficacy analysis. Preregistration was more common for RCTs (3/6, 50%) than for observational trials (2/9, 22%).
Table 5. Summary of efficacy estimates (N=24).Efficacy estimatesValues, n (%)ReferencesEffectiveness estimate20 (83)—aaReferences not listed for summary rows.
bRCT: randomized controlled trial.
characterizes the articles in terms of how well they meet the assumptions for their method of analysis. Of the 24 articles, 2 (8%) estimated efficacy in terms of ATE [,]. One of them used intent-to-treat analysis and met the relevant reporting requirement [], and the other article calculated the ATE for an observational trial []. It met the criteria for SUTVA and had a clear definition of the treatment condition. However, it did not meet the positivity condition and lacked a control condition. The study adjusted for 1 confounder without saying how it was chosen.
Table 6. Fulfillment of required assumptions for efficacy analyses (n=14).Estimate category, analysis method, and articleSUTVAa, n (%)Positivity, n (%)Consistency, n (%)Exclusion restriction, n (%)Strong monotonicity, n (%)Assignment mechanism ignorability
Comments (0)