There is increasing use of claims and electronic health record databases to evaluate the safety and effectiveness of medical products.1–3 For such studies, computable phenotypes are used to measure patient characteristics and outcomes. However, their measurement is subject to error; minimizing such error and quantitatively assessing the arising bias is essential for causal inference.4,5
Validation studies are needed to understand the measurement characteristics of computational phenotypes compared to medical charts as an accepted reference, for instance, their specificity or positive predictive value.6 The core of a validation study is a set of labeled charts that provide a reference standard. However, establishing such labeled charts is expensive and time-consuming as medical experts need to review and label each chart. Approaches to minimize the number of charts that need to be labeled are desirable.7,8
As a practical example, we might be interested in knowing the positive predictive value (PPV) of the algorithm used to define an outcome of hospitalization for heart failure (HHF) for a study that compares the effectiveness of glucagon-like peptide-1 and sodium-glucose cotransporter 2 inhibitors. We focus on the PPV in this work because it is a measure that tells us how often it is actually true when our data say that a patient had an HHF event. When the PPV is high, we can be confident that the events represent real HHF events. But when there is misclassification of true HHF events, and especially if the misclassification differs between exposure groups, this can cause bias. However, there are methods that can be used to correct for bias due to misclassification using metrics from validation studies.9,10
Previous work considered a Bayesian framework for a validation study design which allows one to detect when sufficient validation data for a binary measure have been collected, whereby charts were drawn in random batches.11 However, this is done without focus on Neyman’s sampling or other sampling strategies, and the prediction of the stopping time is likewise unconsidered. Newer work employs Neyman sampling and raking weights assuming a fixed number of charts to review and samples to spend.12 The approach is not data adaptive and the authors did not compare it to alternative sampling approaches. Other work considered several sampling strategies including Neyman’s sampling in a classical measurement-error framework, but in a parametric model context and without the use of confidence bands or Bayesian credible intervals.13 To the best of our knowledge, the focus of our work, precisely the evaluation of adaptive multi-wave Neyman’s sampling, has not been previously considered.
In this work, we considered the use of an efficient sampling process in which batches of charts are adaptively sampled and evaluated sequentially to decide if a parameter of interest, such as the PPV, exceeds one (or more) pre-specified stopping thresholds that would indicate sufficiently high performance or precision of the algorithm that further validation is unnecessary. Similarly, if the sequential evaluations indicate futility, in other words, that with infinite additional samples, we do not expect to observe desirable performance for a particular algorithm, then spending additional resources on chart review can be avoided. Furthermore, an adaptive sampling process for chart validation can be used to track and predict time-dependent quantities. For example, it may be of interest to predict when a validation study will meet a pre-specified stopping threshold to aid in planning a validation study, assuming this cost is proportional to the number of charts involved in the validation study.
For this reason, we present an adaptive, multi-wave sampling validation algorithm for the scenario in which we have data consisting of a (binary) outcome variable that is being validated and several patient characteristics that may or may not be risk factors for the outcome variable. For illustrative purposes, the main quantity of interest in the chart validation study is the PPV for the proposed algorithm given the chart review reference standard.
We select two thresholds to encode a stopping criterion, for example, once we are confident that the lower bound of the sequentially evaluated PPV lies above some pre-specified threshold τ1, or that the upper bound of the PPV lies below some pre-specified threshold τ2 (for instance, τ1=τ2=0.8).
We are interested in exploring three aspects of the validation. First, we explore four strategies for sampling charts, called random sampling (sampling which disregards baseline covariates), two variants of stratified random sampling (equal-size batches across strata defined by baseline covariates, or proportional to the strata sizes), and Neyman’s sampling. The latter adapts the sample size selected based on the variance of the outcome in each stratum. Second, we explore the performance of our validation process when using two different approaches to compute confidence bands around the sequentially evaluated PPV. These are simultaneous confidence bands of Lai,14 and Bayesian credible intervals. Third, we investigate the issue of forecasting when the stopping criterion will be met, while the study is running, in the sense that at any point in time, we compute an estimate of the remaining number of charts samples to review.
The algorithm presented in this paper has been implemented in the R-package “chartreview”, available on CRAN.15
MethodsThis section first introduces some notation and the types of random sampling we consider. Those are random sampling, stratified random sampling (equal-size batches across strata, or proportional to the strata sizes), and Neyman’s sampling. We then define the types of confidence bands we use, in particular confidence bands and Bayesian credible intervals for a binomial quantity, and normal intervals for a continuous quantity. Moreover, we add a few remarks about raking. The section concludes with an algorithm to perform the chart validation.
NotationWe are given data for N∈ℕ individuals, encoded in a matrix X∈ℝN*p with each row containing data on p∈ℕ patient baseline characteristics. Additionally, we observe a vector y∈ℝN containing some response (outcome) of interest that is measured with an unknown degree of error. The patient baseline characteristics may or may not be risk factors associated with y. Moreover, we assume that the data in X can be stratified in m∈ℕ strata, which will determine how the sampling process is carried out.
We are interested in computing some statistic T(y) of interest. In many cases, T will be the PPV; however, our approach is not limited to the PPV and other metrics can be used so long as they are either monotone transformations of the PPV (meaning the bounds on the PPV we compute in this work carry over to bounds on T) or methodology is available to estimate T(y) and its confidence bounds by other means.
Overview of the Statistical MethodsThe following introduces three statistical ingredients which we will combine in our algorithm for adaptive multi-wave chart review sampling. Those three ingredients are (1) the sampling strategies which we use to obtain new samples in each wave, (2) the correction of the sampling distribution with raking weights, and (3) the way we quantify uncertainty of our estimates. These three ingredients are independent of each other and can be combined freely. For instance, we choose any of the four sampling strategies introduced in the next section (random sampling, stratified random sampling with/without equal-size batches, and Neyman’s sampling) to sample from the pool of data. Afterwards, we apply raking weights (Section “Raking weights”) to reduce sampling bias and ensure that the distribution of key risk factors in the binary samples drawn up to each wave resembles the distribution observed in the overall population. Finally, we quantify the uncertainty in the sampling process with the help of either frequentist confidence bands or Bayesian credible intervals (Section “Confidence bands”).
Sampling StrategiesTo adaptively estimate T(y), we assume that we can iteratively draw samples from y according to some sampling strategy. The sampling of any stratum will stop as soon as it is depleted, while the other non-depleted strata may continue to be sampled. The number of batches depends on both the reservoir of samples as well as on the batch size B∈ℕ we employ for sampling. Each time we draw samples from X, we draw samples for the same individuals from y, which are then used to approximate T(y).
Four sampling rules are considered in this article in order to draw samples from X and y, which are defined as follows:
We draw B samples per wave while disregarding any strata information. This will be referred to as “random sampling”. We draw [B/m] samples per stratum, where the brackets denote the floor operation that rounds down to the nearest integer. This will be referred to as “stratified random sampling 1” or “stratified1”. We draw [B*ws] samples for each stratum s∈, where ws∈[0,1] is the proportion of samples in stratum s with respect to the total number of samples. This will be referred to as “stratified random sampling 2” or “stratified2”. Note that for this method, we always weigh with respect to the initial proportion of samples in the strata, as opposed to the proportions while sampling is in progress. The fourth technique, called “Neyman’s sampling”, works as follows (Shepherd et al, 2021). Denote with Ns the sample size in each stratum s∈ with respect to the data that has already been validated. We aim to compute the new allocation of samples to each stratum in wave k. Denote with ni the overall number of samples spent in each wave i (in our case, this will be the batch size). Then,is the total number of samples spent in the first k waves. Denote the standard deviation of the data in each stratum s∈ as σs>0 (with respect to the data that has already been validated). Neyman’s sampling computes the new sample size nk,s for the current wave k and for each stratum s∈ as
where the sample sizes (nk,1,…,nk,m) can additionally be reweighted to sum up to the batch size B in order to spend exactly B new samples in each wave. The goal behind Neyman’s sampling is to optimize sample allocation (with respect to minimizing the variance) based on the variance within strata.
Three computational aspects are of importance. First, the standard deviation per stratum needed for Neyman’s sampling can be volatile at the start of the sampling process when sample sizes are low. Therefore, a robust estimator of the variance such as the median absolute deviation (mad) can be more suitable than the conventional estimator.
Second, the batch size plays an important role at the start of the sampling since larger batch sizes at the start likewise reduce volatility in the estimates. In later waves, when each stratum has already been sampled from and the standard deviation estimate is more stable, the choice of the batch size is less important.
Third, in Neyman’s sampling it can happen that the sample size of a stratum for the new wave is reduced to zero. In this case, no further sample will be drawn in future waves, which might introduce bias. It can therefore be advantageous to introduce a minimal batch size per stratum to prevent the scenario in which a stratum is excluded from being sampled in all future waves.
Confidence BandsAfter having drawn a new batch of B samples from y, we aim to update our knowledge on the quantity T(y) which we aim to estimate. To this end, we aim to compute valid confidence bounds for T(y), that is confidence bounds that contain the quantity being monitored simultaneously over successive batches being sampled, thus allowing us to make statements on the precision or futility of the chart validation process.
In the case that T is the PPV, and y is binary, we compute binomial confidence bounds for T(y). We consider two options. The first are the confidence bands of Lai,14 and the other are Bayesian credible intervals.
For Lai’s confidence bands, we assume that in any wave, we have observed s∈ℕ successes among k∈ℕ samples being drawn from y. The overall error probability is denoted by α∈(0,1). To compute Lai’s bounds, we solve (k+1)b(k,p,s)=α for p, where b(k,p,s)=k!/(s!(k-s)!) ps (1-p)k-s is the density of the binomial distribution. The aforementioned equation will have two solutions p1 and p2 for fixed α, which then form the confidence interval [p1,p2] for p. As proven in Lai (1976), this construction yields confidence bounds with an overall coverage of 1-α which can be updated as more samples are drawn and thus more successes are observed.
To compute Bayesian credible intervals, we consider a Beta-Binomial model, given by a Beta(1,1) prior on the unobserved parameter p which is then updated once Binomial samples are observed. This results in a Beta(1+s,1+k-s) posterior after having observed s successes among k samples. To arrive at a credible interval, we compute the α/2-quantile qα/2 and the (1-α/2)-quantile q1-α/2 of the Beta(1+s,1+k-s) posterior, which then form the credible interval [qα/2,q1-α/2]. Note that no alpha spending is required in the Bayesian case.
The previous approach can be extended in a straightforward fashion to a continuous quantity being validated. In this case, we compute normal confidence bands. Further details are provided in the Appendix.
Raking WeightsWhen sequentially evaluating T(y), we aim to ensure that the distribution of key risk factors in the binary samples drawn up to each wave resembles the distribution observed in the overall population. Ensuring that survey results represent the target population is important in order to reduce sampling bias or bias in the survey mode being used and avoid potentially misleading conclusions.
We correct for finite sample behavior with the help of a procedure called raking. Raking allows us to first identify if the sampled population distribution of risk factors is discrepant from the overall population, and second compute raking weights to correct it. We apply the R-package “anesrake” (available on CRAN16) to the vector y. Raking has a tuning parameter (the raking threshold) that controls the allowable difference between the source population and the sample distribution. We employ the R-package “anesrake” with default parameters.
Adaptive Multi-Wave Chart Review Sampling ProcessThe complete adaptive chart review sampling process is shown in Figure 1. In a prototypical setting, the investigators would create a chart review protocol or annotation guide for the phenotype or outcome of interest. A claims based algorithm for the outcome would be proposed and used to classify patients as either having or not having the endpoint of interest. Sequential samples of patients can then be identified for multi-wave chart review. These samples can be obtained via random sampling, stratified random sampling 1 and 2, or Neyman’s sampling. Each sample will contain data in X and y, where X contains data on patient risk factors that define strata for stratified sampling, and y reflects the error-prone claims-based algorithm classification of the endpoint. The algorithm works along the following steps.
Figure 1 Flowchart of the adaptive multi-wave chart review sampling process.
First, the data are sampled. We use either random sampling, stratified random sampling 1 and 2, or Neyman’s sampling to determine how to allocate a new batch of B samples to the m strata defined by X. Then, trained chart reviewers evaluate notes and other information from electronic health records for the sampled patients to determine the “gold” or reference standard classification. Next, the measurement performance characteristic of interest (e.g., PPV) is computed for all patients sampled from all prior or current samples, who have chart reviewer determined reference values. The point estimate and associated confidence intervals are estimated after using raking weights to re-weight the risk factor distribution of the sample to resemble the overall distribution in the whole population. We compute 1-α (frequentist) confidence bands or Bayesian credible intervals for T(y). The confidence intervals are updated with each new sequential sample of size B.
The chart review process stops once a pre-specified stopping criterion is met. For instance, we might stop the chart review if the confidence bound (or the credible interval in the Bayesian case) hits some pre-specified lower or upper stopping boundary. Usually, these boundaries are not symmetric, in the sense that if the parameter of interest falls below a threshold, this might indicate that the validation is so poor that the gold standard is applied to an entire cohort, while in the case that the parameter of interest falls above a threshold, the validation is so good that further validation is unnecessary. In other scenarios, we might not be interested in pre-specified (lower and upper) stopping bounds but rather stop the chart review once a certain length/width of the confidence bounds is reached, thus reflecting a desired level of confidence irrespective of the PPV.
The precise setting is dependent on the validation under consideration. In our simulations in the next section, the stopping boundaries often depend on the target PPV, and are therefore given individually for all experiments.
ResultsThis section presents our simulation results. We start with a description of our simulation setting, followed by a visual assessment of the four sampling strategies (random sampling, stratified random sampling 1 and 2, and Neyman’s sampling) and the two types of intervals (confidence bands and Bayesian credible intervals) in a single simulated run of the chart review process for a non-linked and linked scenario, where “linked” means that the patient baseline characteristics in the data matrix X are risk factors for y, the response variable. Afterwards, we quantitatively assess all techniques in a simulation study with respect to the PPV, the strength of the link between patient baseline characteristics and response, and the distribution of the strata sample sizes. We conclude by showcasing a heuristic to predict when the validation is complete, while our algorithm is running.
Simulation SetupWe consider a dataset of n = 6936 frailty score measurements in Medicare enrollees with liver disease. The measured frailty score values fall into the interval [0, 0.5]. We therefore stratify them into five strata, given by [0,0.1), [0.1,0.2), [0.2,0.3), [0.3,0.4), and [0.4,0.5]. These values are used as the data X, which is used in stratified random sampling and Neyman’s sampling strategies.
We focus on a binary response and choose the PPV as the performance characteristic of interest. We assume that our data only include patients for whom the claims-based algorithm indicated that the outcome was present (for instance, y = 1). We generate a binary reference standard classification of the outcome of interest that would be obtained after chart review.
In the Sections “Comparison of the four sampling approaches in one simulated run” and “Quantitative assessment as a function of PPV” we consider two scenarios, a non-linked scenario and a linked scenario. In the non-linked scenario, the data (frailty) is not related to the binary outcome, which is generated from a Bernoulli(p) distribution, where the success probability p is chosen as the PPV. In the linked scenario, the binary outcome is generated such that there is a different success probability per stratum in such a way that the resulting PPV matches some predefined value. The two scenarios are chosen to demonstrate the performance of the chart validation in cases where there truly is an association versus there is not.
In the Section “Quantitative assessment as a function of link strength”, we investigate the behavior of our algorithm for different linkage strengths between the data and the response. To achieve this, we again generate the binary outcome such that there is a different success probability per stratum while ensuring that the resulting overall PPV matches some predefined value. The choice of the success probabilities per stratum allows for one degree of freedom, their standard deviation. The standard deviation controls if the success probabilities are very similar across strata or very different. We regard the standard deviation of success probabilities across strata as a measure of linkage strength, where a low standard deviation indicates a low strength.
In order to investigate the behavior of our algorithm for different strata imbalances, we also vary the size of the five strata defined over [0, 0.5] that the frailty score values fall in (Section “Quantitative assessment as a function of strata balance”). They are chosen such that for our given dataset of frailty score measurements, we obtain most samples in the first stratum (which includes 0, denoted as “left skewed”), in the last stratum (which includes 0.5, denoted as “right skewed”), or we obtain a balanced allocation (denoted as “balanced”).
When running our adaptive multi-wave chart review sampling process, we employ a batch size of 100. Raking is always employed to standardize the distribution of risk factors to resemble the population for which the performance characteristic is evaluated. Confidence bands are always computed with overall error α=0.05, and Bayesian credible intervals are computed by calculating the α/2-quantile qα/2 and the (1-α/2)-quantile q1-α/2 of the posterior, likewise with α=0.05. We employ the “mad” (median absolute deviation) as well as a minimum batch size of 10 for each stratum when running our simulated chart review processes. All results presented in the tables are averages of 100 repetitions.
Comparison of the Four Sampling Approaches in One Simulated RunFigure 2 showcases one run of a simulated chart review process in the non-linked scenario with PPV = 0.8 and the two stopping boundaries (τ1=τ2=0.75). The rows show the four sampling strategies (random sampling, stratified random sampling 1 and 2, and Neyman’s sampling), while the columns show the two types of intervals (confidence bands and Bayesian credible intervals). The horizontal lines indicate the stopping bounds. The vertical lines indicate the point at which the stopping criteria were met, in the sense that either the lower confidence limit exceeds the lower bound of the gold standard, or the upper confidence limit falls below the upper bound of the pre-specified threshold(s).
Figure 2 Estimation of the PPV in multi-wave samples when the PPV is unrelated to a risk factor for the outcome being evaluated (non-linked response with PPV=0.8 and stopping criteria τ1=τ2=0.75). Random sampling (first row; A and B), stratified random sampling 1 (second row; C and D), stratified random sampling 2 (third row; E and F), and Neyman’s sampling (fourth row; G and H), using either Lai’s confidence bands (left column) or Bayesian credible intervals (right column). The horizontal lines indicate the stopping bounds, the vertical lines indicate the stopping time.
Three observations are noteworthy. First, in the depicted run, Neyman’s sampling performs best in the sense that it allows for the earliest stop, with fewest charts validated, while random sampling performs worse. Second, among the two variants of stratified random sampling, the second variant using batches drawn proportionally to the strata sizes performs better. Third, Bayesian credible intervals seem to yield tighter confidence bounds than Lai’s (frequentist) confidence sequences.
Figure 3 shows a similar evaluation in the case of a linked response, where the success probabilities for all strata were chosen individually but such that an overall PPV of 0.8 was met. The stopping boundaries are again τ1=τ2=0.75. Here, a different picture is observed, with stratified random sampling 2 performing best for confidence bands, and Neyman’s sampling performing best for Bayesian credible intervals. As before, Bayesian credible intervals seem to yield tighter confidence bounds than frequentist confidence sequences.
Figure 3 Estimation of the PPV in multi-wave samples when the PPV is related to a risk factor for the outcome being evaluated (linked response with PPV=0.8 and stopping criteria τ1=τ2=0.75). Random sampling (first row; A and B), stratified random sampling 1 (second row; C and D), stratified random sampling 2 (third row; E and F), and Neyman’s sampling (fourth row; G and H), using either Lai’s confidence bands (left column) or Bayesian credible intervals (right column). The horizontal lines indicate the stopping bounds, the vertical lines indicate the stopping time.
Figure 4 considers a different stopping criterion, the length of the confidence interval. To be precise, we stop the validation process whenever the length of the confidence interval at the current batch falls below 0.05. The simulation scenario is the same as for Figure 3, meaning a linked response with PPV = 0.8. We observe that all four approaches and both types of confidence bands perform similarly, in the sense that no method seems to have an obvious edge among the others.
Figure 4 Estimation of the PPV in multi-wave samples when the PPV is related to a risk factor for the outcome being evaluated (linked response with PPV=0.8, stopping criterion is interval length<0.05). Random sampling (first row; A and B), stratified random sampling 1 (second row; C and D), stratified random sampling 2 (third row; E and F), and Neyman’s sampling (fourth row; G and H), using either Lai’s confidence bands (left column) or Bayesian credible intervals (right column). The horizontal lines indicate the stopping bounds, the vertical lines indicate the stopping time.
Quantitative Assessment as a Function of PPVTable 1 assesses the performance of all four approaches (random sampling, stratified random sampling 1 and 2, Neyman’s sampling) and the two types of intervals (confidence bands and Bayesian credible intervals) for different values of PPV. In Table 1, the PPV is varied among while the stopping boundaries stay fixed at τ1=τ2=0.75.
Table 1 Non-Linked Scenario for Different PPV and Fixed Stopping Criterion
We assess the performance of the sampling strategies with three measures. First, we show the proportion of times that the chart review process ended due to meeting one of the stopping criteria (columns 3 and 4), where a higher proportion is better. Second, we display the proportion of runs that were stopped due to futility (columns 5 and 6). The proportion of runs stopped due to sufficient or even superior efficacy is given accordingly as one minus the futility proportion. Third, we measure the number of sampling waves that were required before the stopping criterion was met (columns 7 and 8), where fewer is better.
In Table 1, we observe that all runs manage to stop. For a PPV of 0.4, the stopping times are low due to the fact that the PPV is far away from the lower stopping boundary. As the PPV increases to 0.6 and 0.7, the validation takes longer as the decision of futility is not as straightforward any more. This behavior is expected. Here, random sampling and stratified sampling 1 perform best. When the PPV is increased to 0.8, the stopping times increase further, with random sampling and stratified sampling 2 being best in connection with Lai’s confidence intervals, and random sampling and stratified sampling 1 performing best in connection with Bayesian credible intervals. Moreover, Bayesian credible intervals allow for considerably shorter stopping times than frequentist Lai’s confidence bands. As expected, the proportion of runs stopped due to futility is essentially one for the runs with a PPV below the stopping threshold. Interestingly, this proportion stays at one even for a PPV of 0.7, and then switches to zero suddenly for a PPV of 0.8.
A similar comparison in which the PPV is chosen in the set while also varying the stopping boundaries τ1 and τ2 accordingly can be found in the Appendix (Table A1).
Quantitative Assessment as a Function of Linkage StrengthWe aim to assess the behavior of all four methods and both confidence intervals as a function of the linkage strength between the data and the response. The setup we use, and our definition of the linkage strength, is defined in the simulation setup. The quantitative results of this experiment are given in Table 2 for a fixed PPV of 0.8 and stopping boundaries τ1=τ2=0.75.
Table 2 Variable Linkage Strength at Fixed PPV=0.8 and Stopping Criterion τ1=τ2=0.75
We observe in Table 2 that the stopping time seems to increase as the linkage strength increases. This is clearly visible when considering Bayesian credible intervals. Random sampling seems to perform best overall across all considered linkage strengths, while Neyman’s sampling is well suited for larger linkage strengths. The performance of stratified random sampling 1 and 2 often lies somewhere in-between the one of random and Neyman’s sampling.
The proportion of runs stopped due to futility is again low for a low linkage strength and increases to around 0.3 for higher linkage strengths. This is as expected, since for a low linkage strength (standard deviation) we anticipate the futility proportion to be zero due to the choice of the PPV = 0.8 and stopping thresholds τ1=τ2=0.75 used in this experiment. However, when increasing the linkage strength (standard deviation), we increase a higher variability into the runs, meaning we would expect a slight increase in stopping due to futility.
Quantitative Assessment as a Function of Strata BalanceFinally, we aim to assess the influence of the strata balance on the performance of our algorithms. To this end, we fix the PPV at 0.8, and choose the stopping boundaries as τ1=τ2=0.75. However, we change the definition of the four strata that was given in the simulation setup in such a way that the number of samples falling into the four strata is either left skewed, balanced, or right skewed.
Tables 3 and 4 show the results of this experiment. Table 3 considers the non-linked scenario, showing that imbalanced strata seem to yield slightly faster stopping times than balanced ones, especially in connection with Bayesian credible intervals. In particular, Neyman’s sampling seems to perform very well across the scenarios considered.
Table 3 Non-Linked Scenario with Strata Balance for PPV=0.8 and Stopping Criterion τ1=τ2=0.75
Table 4 Linked Scenario with Strength(sd)=0.05
Table 4 repeats the same experiment for the linked scenario with linkage strength (standard deviation) 0.05. Here, a different picture is observed, with the balanced scenario being the easiest to validate, while the two imbalanced scenarios require longer stopping times across all methods. In particular, stratified random sampling 1 and Neyman’s sampling seem to perform very well across all scenarios. As before, the use of Bayesian credible intervals yields much faster stopping times than the frequentist counterpart.
In both Tables 3 and 4, the proportion of runs stopped due to futility is essentially zero. This is as expected, since it is consistent with previous findings (Section “Quantitative assessment as a function of PPV”) for the choice of PPV = 0.8 and stopping thresholds τ1=τ2=0.75 which were used here as well.
Prediction of StoppingAn important feature is the prediction of the point at which validation of a gold standard is possible. This is possible during runtime, in the sense that at any point, we aim to compute an estimate of the remaining number of batches and, equivalently, the remaining number of samples that would be needed.
To this end, at any point, we sample from a Bernoulli distribution with target probability set to the empirical mean of the observed data in y. This simulates a continued run of the chart review. Once the validation stops based on the updated confidence bands, we record the number of steps. By repeating this process, one can additionally compute confidence bands on the prediction.
Figure 5 shows the results of such a run. The top panel displays a run for the chart review process with stopping criteria τ1=0.78,τ2=0.82. As can be seen, the validation is achieved on approaching batch number 40. At the same time, the bottom panel shows the prediction of the remaining number of batches until completion, while the algorithm is running. The confidence band on the prediction is shaded in gray. We observe that the trend is captured correctly, though the number of remaining steps is underestimated by, on average, a factor of 2.
Figure 5 Prediction of the number of additional samples needed until the stopping criterion will be met. Top (A): Run of the multi-wave chart review process with Lai’s confidence bounds and Neyman’s sampling as a function of the batch number. PPV is associated with a risk factor for the outcome (linked response). Bottom (B): Prediction of the remaining steps until stopping as a function of the batch number including a confidence band (gray) for the estimate. Both plots have the same x-axis. The vertical lines indicate the stopping time.
DiscussionThis article considers the validation of a claims-based algorithm using an efficient multi-wave chart sampling strategy to determine the reference standard and estimate performance characteristics. We investigate the use of four sampling strategies (random sampling, stratified random sampling 1 and 2, and Neyman’s sampling), and the use of either frequentist confidence bands or Bayesian credible intervals to compute bounds on a binary response to be validated. To this end, we conduct several quantitative assessments of the aforementioned algorithms as a function of the PPV, linkage strength, and strata balance.
While no method uniformly performs best in all of our experiments, we observe that both random sampling and Neyman’s sampling yield the best performance overall, in the sense that they allow for the earliest stopping with fewest charts validated in the majority of our experiments. Moreover, we observe that Bayesian credible intervals yield tighter confidence bounds than Lai’s frequentist confidence sequences and thus allow for considerably shorter stopping times. This trend can be seen throughout our study for both the non-linked and linked scenarios. Given its simplicity, we recommend using random sampling over Neyman’s sampling in connection with Bayesian credible intervals for efficient chart validation.
Our adaptive multi-wave chart review has several tuning parameters. Those are the choice of the sampling strategy, the choice of the confidence bands and the error level α with which they are computed, the batch size B, the minimal number of samples to spend in each wave, and the threshold for raking. The error level α should be chosen according to the use case. Moreover, we recommend to select the batch size B in such a way that the pool of samples allows for a considerable number of updates (around 20 or 50, meaning B=N/20 or B=N/50), because repeated updates of the confidence bounds will generally allow for earlier stopping and thus improved performance. We do not find that the minimal number of samples to spend per wave and the raking threshold have a considerable influence on the results in our experiments.
Another contribution of this work is an algorithm to predict the point at which the validation of a gold standard is possible. We attempt this during runtime and provide, at any point, an estimate of the remaining number of batches that would need to be reviewed as well as a confidence region around the estimate. While this is work in progress and the stopping time estimates are quite volatile at the start of each run, we demonstrated that it is possible to approximately predict the stopping time at which the validation would be completed.
We believe that our recommendations are immediately applicable to real-world settings, allowing one to accelerate chart reviews by validating a response with fewer samples. When implementing our approach in a real-world setting, a team of medical experts would review and classify an initial batch of charts. The obtained information would be used in our algorithm to update the performance characteristics such as the PPV, after which a new batch would be determined for continued expert review. This process would continue until a decision on sufficiently high performance or futility can be made. However, there are two important limitations in real-world settings. The first pertains to data quality, as the chart review in real-world settings might be based on incomplete, missing, or biased data. The second pertains to the data volume. As real-world data is finite, it can happen that no validation is possible before the data are depleted. In this case, it is still possible to base decisions on point estimates, but it should be noted that the chart review did not formally conclude.
The main limitation of this work consists in the tuning of Neyman’s algorithm. Indeed, Neyman’s sampling is quite generic in that it allows for various tuning parameters, such as the estimate of the standard deviation, the batch size, the threshold for applying raking, or the choice of the minimal number of samples to spend per stratum and batch.
ConclusionsGiven its simplicity, we recommend using random sampling for efficient validation of a response in both the binary and the continuous case. Bayesian credible intervals are preferred as they yield tighter confidence bounds than their frequentist counterparts. Future work includes the tuning of the various parameters for each scenario. Moreover, understanding better the dependence of the choice of the stopping criterion (for instance, stopping based on the width of the confidence interval, or the selection of lower and upper bounds) on the chart review process is an interesting area of further work.
Comments (0)