Sample size recommendations for studies on reliability and measurement error: an online application based on simulation studies

In this result section, the effect of study design conditions on the estimation of the ICC and SEM are shown, in terms of bias and MSE (for ICC and SEM), the coverage of the CI of ICC, and the influence of various conditions on the width of the 95% CI of ICCs and SEMs, averaged over various conditions. For tailored results and recommendations about the sample size and number of repeated measurements for specific conditions, we developed an online application (https://iriseekhout.shinyapps.io/ICCpower/) that shows the implications for decisions. Subsequently, we describe the online application, and how to use this to come to tailored recommendations in future studies.

Bias and MSE values in ICC estimations

Results for bias showed a slight underestimation of the estimated ICCs, especially with small sample sizes. Overall the bias was so small that it was negligible, i.e., maximum bias for the ICCs found in any of the conditions was − 0.05 (in case of a sample size of 10 with only 2 raters of which 1 deviated in an ICC one-way random effects model with a v of 1 and r of 0.7).

In Fig. 1 we plotted the MSE of the ICC estimates for the number of raters per the conditions of sample size (shown in different colors), shown for the situation that one rater systematically deviates and for each of the three statistical models separately. Here, we see that the steepness of the curve declines most between k = 2 and k = 3, especially for a sample size up to n = 50. So the gain in precision (i.e., the largest change in MSE) is highest going from 2 to 3 raters. Moreover, we see the distance between the curves decreases when n increases, especially in the curves up to n = 40. So the gain in precision is relatively smaller above n = 40. In other words, the gain in precision diminishes at larger values of n and k. The MSE values for condition n = 40 and k = 4 is very similarly compared to the condition n = 50 and k = 3. As this pattern was seen for all conditions of r, and v, we averaged over these conditions in Fig. 1.

Fig. 1figure 1

MSE values of ICC estimations with different sample sizes, plotted against k per type of ICC model (one rater systematically deviates; averaged over all conditions of r and v)

The presence of a systematic difference between raters increased the MSE values for ICCone-way, but not for ICCagreement, and ICCconsistency (see online tool). This means that the required sample size for the one-way effects models increases when a systematic difference between raters occurs, while the required sample sizes for the two-way effects models remains the same.

Next, we noticed an influence of the correlation between scores on repeated measurements (r) on the MSE values for all types of ICCs, specifically when no rater deviated (Fig. 2 shows the MSE per correlation condition for ICCagreement). That is, increasing correlation (i.e., 0.8 instead of 0.6) leads to decreasing MSE values. When one rater deviates, r continues to affect the MSE for ICCconsistency to the same extent, but to a lesser extent for ICCagreement and ICCone-way (“Appendix 2”).

Fig. 2figure 2

MSE values of ICCagreement estimations plotted against k per condition r (no rater systematically deviated; averaged over all conditions of v)

Bias and MSE in SEM estimations

Overall the bias for the SEM was very small and thus negligible. All results for bias can be found in the online application.

In Fig. 3 we plotted the MSE values of the SEMagreement estimations for the number of raters per condition of sample size (shown in different colors), for one rater with a systematic difference and for each of the three conditions of r. Similar as we saw above for the MSE curves for ICCs, the steepness of the curves declines most between k = 2 and k = 3, especially for a sample size up to n = 50. Moreover, we see the distance between the curves decreases when n increases, especially in the curves up to n = 40. The MSE values for condition n = 30 and k = 4 for any of the three conditions r is very similarly compared to the condition n = 50 and k = 3 when r = 0.6, or n = 40 and k = 3 when r is higher.

Fig. 3figure 3

MSE values of SEMagreement estimations with different sample sizes, plotted against k per condition r (one rater systematic differs; v = 1)

So, we can conclude that the influence of the correlation r on the MSE value for SEM estimations is similar to the influence of r on the MSE values for ICC estimations.

In both SEMone-way and SEMagreement models all measurement error is taken into account (see “Appendix 1”), so the resulting SEM estimates are equal between these models (Mokkink et al. 2022). The MSE values for SEMconsistency are nearly the same if no rater deviates or when one rater deviates. When no rater deviates, the MSE values for the SEMone-way and SEMagreement are only slightly lower compared to the SEMconsistency (data available in the online application). However, aberrant from the MSE results for the ICC estimations (see Fig. 1), the MSE values for the SEMone-way and SEMagreement increase when one of the raters systematically deviates (see Fig. 4).

Fig. 4figure 4

MSE values of SEM estimations for different sample sizes, plotted against k per type of SEM model (one rater systematically deviates; v = 1; averaged over all conditions)

Coverage of the confidence intervals of ICCs

With no systematic difference between raters the coverage of the 95% confidence intervals around the ICC estimation was as expected, i.e., around the 0.95 for all three types of ICCs. As there were no differences found for the simulation study conditions (i.e., r, v, n and k) the results for coverage are only separated per type of ICC (Fig. 5, left panel).

Fig. 5figure 5

Lowest and highest coverage of the 95% confidence intervals around the ICC estimations over all conditions of n, k, r and v (left and middle panel k = 2–6, right panel k = 4–6)

The coverage of the ICCconsistency is very similar when one or two raters deviate, compared to the situation when no rater deviates. However, when one of the raters deviates the lowest coverage of the 95% confidence intervals around the ICCone-way estimation decreases (i.e., under-coverage) and the highest coverage increases (i.e., over-coverage) (Fig. 5, middle panel). While this change in coverage disappears again when two raters deviate (Fig. 5, right panel). Note that in this latter scenario always more than three raters are involved. Furthermore, the ICCagreement showed an over-coverage when one or two raters systematically deviated from the other raters, as the lowest value and the highest value for the coverage of the 95% confidence intervals around the ICCagreement both increase (Fig. 5 middle and right panel). A coverage of 1 means that the ICC of the population always fell within the 95% confidence intervals of the ICC estimation. This was due to the fact that the width of the confidence intervals around these estimations were very large, i.e., confidence interval width around 1.

Influence of various conditions on the width of the 95% confidence intervals of ICCs

When no rater deviates, the 95% CI width around the ICC is the same for the different variances (v) and the different ICC methods (one-way, agreement or consistency). However, the correlation r does impact the width of the 95% CI: an increase of r leads to a decrease of the width (i.e., smaller confidence intervals) (Fig. 6). This means that when we expect the ICC to be 0.7 (i.e., we assume the measurements will be correlated with 0.7) the required sample size will be larger to obtain an ICC with the same precision than when we expect the ICC to be 0.8.

Fig. 6figure 6

95% confidence interval width for the ICC for r = 0.6, 0.7 or 0.8 (v = 1, no raters deviate; averaged over de three ICC models)

When one rater deviates, the width of the 95% CI does not change for the ICCconsistency, but it does increase for ICCagreement, and even more for ICCone-way (see Fig. 7).

Fig. 7figure 7

95% confidence interval width for the three ICC models, when one rater deviates (r = 0.8, v = 1)

The 95% CI width around the ICC estimation for specific conditions can be used to determine what the optimal trade-off is for the sample size of patients and the number of repeated measures in these situations. In Fig. 6 (where we show results averaged over the three effects models) we can see that in the situation that no rater deviates, and v = 1, and we wish to estimate an ICC for three raters (k = 3), we need between 40 and 50 patients to obtain a CI width around the point estimation of 0.3 when r = 0.6 (i.e., + /– 0.15) (Fig. 6, left panel). If r is 0.7, then 30 patients is enough to reach the same precision (Fig. 6, middle panel), while if r = 0.8 20 patients is sufficient (Fig. 6, right panel). When one of the raters deviates, the chosen ICC method impacts the 95% CI width, in addition to the r (Fig. 7). To come to a 95% CI width of 0.3 around the point estimate when r = 0.8, v = 1, for a ICCagreement the sample size should be increased to 40, while the ICC one-way would require a sample size of 50.

Influence of various conditions on the width of the 95% confidence intervals of SEMs

The CI width for SEM estimation decreases when r increases (Fig. 8), similar as for ICC. However, in general, the width for SEM was smaller than for ICCs (Fig. 6).

Fig. 8figure 8

95% confidence interval width for the ICC for r = 0.6, 0.7 or 0.8 (averaged over SEM models, v = 1, no raters deviate)

When one rater deviates, the width of the 95% CI does not change for SEMconsistency, but it does increase for SEMagreement, and SEMone-way (see Fig. 9). In general, the width of the 95% CI is lower for SEM than it is for ICC. This means that in general, the SEM can be estimated with more precision than the ICC under the same conditions.

Fig. 9figure 9

95% confidence interval width for the three SEM models, when one rater deviates (v = 1, r = 0.8)

Online application that shows the implications for decisions about the sample sizes in reliability studies

As shown in the results of our simulation study, sample size recommendations are dependent on the specific conditions of the study design at hand. Therefore, based on these simulation studies, we have created a Sample size decision assistant that is freely available as an online application to inform the choice about the sample size and number of repeated measurements in a reliability study.

The Sample size decision assistant shows the implications of decisions about the study design on the power of the study, by using any of the three procedures described in the methods section (i.e., the width of the confidence interval (CI width) procedure, the CI lower limit procedure, and the MSE ratio procedure). Each procedure requires some assumptions about the study design as input, as described in Table 3. When you choose to use either the CI lower limit procedure or the MSE ratio procedure, you are asked to indicate what the target design is. The target design is the intended sample size of patients or the number of repeated measurements (e.g., raters), decided upon at the start of the study. For the MSE ratio procedure you are also asked to indicate the adapted design, which refers to the number of patients or repeated measurements of the new design, e.g., the numbers that are included in the study so far. For both procedures you are asked to indicate the target width of the 95% CI of the parameter of interest. The width depends of the unit of measurements. As the range of the ICC is always between 0 and 1, the range of the target width is fixed, and it is set default at 0.3 in the online application. However, the SEM depends on the unit of measurement, and changes across conditions v. Therefore, in the online application, the range for the target width of the 95% CI for the SEM changes across conditions v, and various default settings are used.

Table 3 Choices and assumptions per approach that are available in the online application (https://iriseekhout.shinyapps.io/ICCpower/)

In the design phase of a study, before the data collection has started, two approaches can be used. For example, to obtain the sample size recommendation to obtain the ICCagreement with the CI width procedure, we need to make some assumptions on the correlation between the repeated measurements, the presence of a systematic difference and expected variance in score. If we assume the measurements will be correlated with 0.8 (in other words, you expect to find an ICC of 0.8), with no systematic difference between the measurements (e.g., the raters), and the expected variance between the score is 10. Based on this information, we will get an overview as shown in Fig. 10.

Fig. 10figure 10

Print screen of the results of the expected width of the 95% confidence interval of the ICCagreement for sample size and rater combination based on the CI width procedure under the expected conditions r = 0.8, v = 10, and no systematic difference between raters

By scrolling over the different blocks in the online application, we can easily see what the consequence is for the width of the CI around the estimated ICC when adding an extra rater or including more patients. For example, when we use 3 raters and 20 patients, the estimated width of the CI around the ICC estimation is 0.293; or when k = 2 and n = 30 the width of te CI is 0.278; and when k = 2 and n = 25 the width is 0.33. In the online application this information automatically pops up.

If we compare the results for various conditions in the application, we see that the impact of whether or not a systematic difference exist on the sample size recommendations is much larger than the impact of different values for the variance between the scores, specifically when in the one-way random effects model, or the two-way random effects model for agreement.

The second procedure that can be used in the design phase is the CI lower limit procedure. This procedure is developed by Zou for ICCone-way. Note that procedure may lead to an overestimation of the required sample size for ICCs based on a two-way effects model (see results, and (Donner and Eliasziw 1987)). An example to use this procedure: if we expect the ICC to be 0.8, and we accept a lower CI limit of the ICC of 0.65, depending on the number of repeated measurements that will be collected, the adequate sample size is given (see Fig. 11). For example, for k = 3, a sample size of 40 is appropriate (under the given conditions). As this procedure is based on a formula, it can be used beyond the conditions chosen in the simulated data.

Fig. 11figure 11

Recommendations for n and k using the CI lower limit procedure (r = 0.8, acceptable lower bound of CI is 0.65, ICCone-way)

The third procedure, the MSE ratio procedure, is most suitable when we have started the data collection and realize that the target design cannot be reached. In that case we want to know how an adapted design compares to our target design that was described in the study protocol. Suppose that patients were observed in clinical practice and scored by three raters (k = 3) at (about) the same time. We envisioned 50 patients (i.e., target design). The number of raters cannot be changed anymore, as patients will possibly have changed on the construct measured, or it is logistically impossible to invite the same patients to come back for another measurement. Based on the results of previous studies, or by running preliminary analyses on the collected data within this study, we can make assumptions about: the expected correlation between the raters (i.e., the repeated measurements; e.g., 0.8), whether we expect one of these raters to systematically deviate from the others (e.g., no), and the expected variance in score (e.g., 10). Suppose we have collected data of three raters that each measured 25 patients; this is our adapted design. Now, we can see how much the 95% CI will increase, when we don’t continue collecting data until we have included 50 patients (i.e., your target design) (Fig. 12). The 95% CI will increase approximately from 0.2 that we would have had if we measured 50 patients three times (i.e., target design) to 0.3 now in the adapted design.

Fig. 12figure 12

Print screen of the expected decrease in width of the 95% confidence interval of the ICC between the adapted (k = 3, n = 25) and target design (k = 3, n = 50) based on the MSE ratio procedure

Another way to use this method, is to see how much one of the two variables n or k should increase to preserve the same level of precision as in the target design. For example, in the target design 3 raters would assess 25 patients. As one of the raters dropped out, there are only 2 raters in the adapted design. The MSE ratio in this scenario was 1.43. To achieve the same level of precision in the adapted design with 2 raters as in the target design (n = 25, k = 3), the sample size should be increased by 1.43, resulting in a sample size of n = 36.

Comments (0)

No login
gif