Extensive numerical studies were carried out to establish the operating characteristics of the proposed adaptive design using the Bayesian PP to make adaptive increase in sample size and/or minimal follow-up time. Several scenarios listed in Fig. 1 are investigated in the numerical study, including the null and the design alternative scenarios under the PH assumption as well as scenarios where the PH assumption is violated. To understand the benefits of the proposed design under non-proportionality of hazards it is important to compare the operating characteristics of the proposed design to that of a traditional (frequentist) adaptive design with sample size re-estimation using CP. Description of this comparator frequentist design is provided in Appendix. Power of a fixed design, i.e. without any interim adaptation is also included in the simulation results in the section below.
First we illustrate the computation of Bayesian PP for the delayed treatment effect scenario in Fig. 1 using one simulated dataset. Readers interested in mathematical details of the PP computation can find more information in the Appendix.
Illustrating Predictive Power Computation, Interim Decision, and Final Analysis for the Delayed Effect ScenarioUsing one simulated data set under the delayed treatment effect scenario (HR is 0.9 for the first 12 months and 0.65 thereafter, scenario 3 in Fig. 1), we illustrate the computation of the Bayesian PP and how an interim decision to increase the sample size and/or the minimal follow-up times can be aided using the computed predictive powers and a grid search ranging from the planned to the maximum sample size and minimal follow-up. Note that the dataset chosen here for illustration is such that the frequentist CP described in Appendix is small (below the promising-zone) while the PP is in the promising-zone. This situation is not an unlikely one as can be seen from the scatter plot of CP vs. PP for 100 simulated data under the delayed treatment effect scenario in Figure A2 (Appendix).
Predictive PowerIn Sect. 2.1 we introduced the Bayesian PP as an average conditional power. To be more precise the PP is the expected conditional power where the expectation is taken over the posterior distribution of the parameters of interest based on the interim data. In our case, the parameters of interest are the piecewise hazard rates for the two arms corresponding to the piecewise exponential model. In general terms, the CP is defined as the probability of showing statistical significance at the final analysis given the interim data. The CP in the frequentist framework has a closed form mathematical expression in terms of the interim log-rank z-statistic, the interim hazard-ratio estimate and the final planned number of events of interest while the Bayesian PP needs to be computed using Monte-Carlo methods.
For the illustrative dataset of the delayed treatment effect scenario, the interim analysis is carried out at 21 months calendar time with 1077 randomized patients (86% of the planned 1252) and with 200 primary endpoint events, 35 early withdrawals and the remaining 842 patients administratively censored with incomplete follow-up. The interim estimate for the hazard ratio using a Cox PH model is 0.96 (p-value = 0.817). The Kaplan-Meier plots obtained at the interim analysis are shown in Fig. 2 using dashed lines.
Fig. 2Posterior bands for Survival functions. Green: Treatment, Red: Control. Solid dark lines are posterior medians till 21 months, dotted lines are extrapolations for the median survival curve beyond 21 months using the posterior distribution of piecewise hazards while dashed lines are Kaplan-Meier plots till 21 months
The PP computation can be understood using the survival bands in Fig. 2 obtained using the Bayesian PWE model described above. See Appendix A1 on how these survival bands are computed from the posterior distribution of the piecewise hazard rate. The essence of the calculation is to use these survival bands for predicting new events to complete the incomplete data available at the interim along with predictions for patients who are yet to be enrolled. For computing the PP, one survival function is randomly drawn for each arm, from the green and red bands in Fig. 2. Using the drawn survival curves the incomplete data is completed conditional on the treatment assignment and the observed follow-up times till the interim. For example, if a patient is administratively censored at the time of the interim with an incomplete follow-up of 6 months, then the time of their event is predicted based on the drawn survival curve between 6 months and end of study. For new patients yet to be enrolled the entire drawn survival curve starting at zero is used while the recruitment rate is estimated from the interim data. Similarly, a Bayesian PWE model is also used to model the observed censoring times till the interim and used to predict censoring times for administratively censored and new patients. Once a set of predictions is made a complete dataset of required sample size is obtained. The final analysis is carried out using this complete dataset using the frequentist log-rank test to test for treatment effect in terms of the hazard ratio and a binary indicator of trial success is noted. This step of completing the incomplete data available at the interim is repeated several (\(\:R\ge\:1000\)) times, each time noting the binary indicator of trial success. The proportion of trial success in \(\:R\) replications provides an estimate for the Bayesian conditional power for the particular pair (treatment, control) of survival curves drawn. The drawing of the pairs of survival curves is then repeated several times (\(\:B\ge\:10,000\)) and the Bayesian conditional power is calculated for each pair. The average of the Bayesian conditional power over all sampled survival curve pairs then provides a Monte-Carlo estimate for the Bayesian PP.
For the same simulated dataset, the CP at the planned sample size of 1252 and a minimum follow-up of 12 months and assuming proportional hazard is only 0.02, suggesting no change of trial design, and possibly stopping for futility. On the other hand, the Bayesian PP at the planned sample size and minimum follow-up is 0.746, which is in the promising zone suggesting an increase in sample size and/or minimum follow-up in order for the predictive power to achieve the target power of 90%. These results can be explained by the survival bands and the Kaplan-Meier curves till the 21 months calendar time of the interim analysis in Fig. 2. We can clearly see the crossing of the Kaplan-Meier curves around 12 months and the late separation of the curves. However, the commonly used CP does not account for this late separation. Under proportional hazards assumption this scenario results in an interim HR estimate close to 1 suggesting no treatment effect and thus illustrating the fact that making interim decisions based on the CP in scenarios where proportionality of hazards may be violated can be potentially misleading.
Interim Decisions Using the Predictive PowerContinuing with the simulated dataset used for illustrations in the sections above, we illustrate how the interim decision regarding the increase of sample-size and/or minimal follow-up times are made. Recall that the PP based on the planned sample-size of 1252 and a minimum follow-up of 12 months for this interim dataset is 0.746. We carry out a grid search over different sample sizes ranging from 1252 to the maximum pre-specified sample size of 2500 and different minimum follow-up ranging from 12 months to the maximum pre-specified follow-up of 36 months and calculate the PP for each combination of sample-size and minimum follow-up time. PP at each combination of sample size and minimal follow-up is shown in the heatmap in Fig. 3. The combination that yields a predictive power of at least 90% with a minimum predicted study duration, defined as enrollment time plus minimum follow up (12 months), is chosen as the optimal adaptation. Figure 3 below shows the grid search results over a coarse \(\:7\times\:7\) grid. We see that the combinations hitting the 90% PP mark are \(\:\left(N=1252,F=16\right)\) with a predicted study duration of 39 months, \(\:\left(N=1400,F=16\right)\) with a predicted study duration of 39.4 months and \(\:\left(N=1800,F=11\right)\) with a predicted study duration of 41.3 months. Note that cells above and on the right of these cells all have predictive power greater than 90%. We choose \(\:\left(N=1252,F=16\right)\:\)as the starting combination and via a finer grid search we get a PP of 0.919 with \(\:\left(N=1400,F=15\right)\) with a predicted study duration of 38.43 months and 602 events. This is combination that minimizes the total study duration and hence used as the interim adaptation decision. With this adaptation the final analysis yields a HR estimate of 0.782 (p-value = 0.002).
Fig. 3Grid Search optimizing for study duration with PP ≈ 0.9
Simulations for Establishing Operating CharacteristicsHere we describe the results of running several simulations under each of the scenarios in Fig. 1 in order to establish and compare the operating characteristics of the proposed adaptive design with those of a traditional adaptive design using the CP. Note that unlike the frequentist CP setting [12], there is no analytical formula for determining the PP cut-off that guarantees the preservation of the type-1 under sample size adaptation, and thus needs to be determined using extensive simulations. Our background simulations, not covered here, show that a cut-off of PP = 0.5 provides type-I error rate control.
For all the scenarios listed in Fig. 1 replications were carried out except for the null scenario for which the results are based on 10,000 replications.
The simulation results presented in Table 2 demonstrate that using the Bayesian PP approach leads to more “specific” adaptation in sample size and/or minimum follow-up compared to the frequentist CP methods. By “specific” we mean that the sample size increase is not triggered unnecessarily when the interim data exhibits evidence of non-proportionality in hazards. This is achieved via the use of flexible modeling like the use of two separate PWE models for the treatment and the control arms and also the wholesome use of patient-level data via the use of the posterior predictive distributions under the Bayesian framework. The overall operating characteristics, such as type-I error rates and statistical power under the different scenarios, remain similar between the two methods. Additionally, we observed that while the probability of interim results falling within the promising zone (as shown in Table 2. column 7) is smaller for the Bayesian PP method, there are greater gains in statistical power when sample size and/or minimum follow-up are adapted as opposed to when they are not adapted. This difference can be seen in the powers with and without adaptation when interim results fall within the promising zone (as indicated in Table 2, columns 8 and 12). In essence, this means that the adaptation process is more specific, as it is triggered only when it is most needed. There is one exception to this trend, which occurs in the early-benefit only scenario. In this case, the probability of falling within the promising zone is higher for the Bayesian PP method compared to the frequentist CP method. However, this can be explained by the fact that, under this scenario, the use of the proportional hazard assumption results in most instances where the frequentist CP is either above or below the promising zone, leading to fewer adaptation when compared to scenarios with no adaptations. This once again highlights the limitation of making interim decision based on the proportional hazards assumption, especially when it may not hold true.
Table 2 Operating characteristics of the proposed adaptive design with bayesian PP compared with standard frequentist design with adaptation based on CP
Comments (0)