Serological tests are commonly used diagnostic tools in a broad medical field, spanning from infectiology1 2 to autoimmunity,3 4 oncology5 and transplantation medicine.6 They also play a critical role in animal disease surveillance.7 However, many serological tests come with acceptable but imperfect sensitivities and specificities. Tests with specificities slightly above 90% are considered good8 or even highly specific.5 However, at low seroprevalence rates, every single per cent counts: if the frequency of a given disease in the tested population is only 5%, a specificity of 90% would mean that—even at a sensitivity of 100%–5 true positives would be matched by 10 false positives. Thus, the probability of an individual with a positive test (positive predictive value, PPV) to be a true positive would be only 33%.
During the early phase of the SARS-CoV-2 pandemic, seroprevalences were far below 1%.9 Therefore, highly specific test systems were necessary (>99.5%) to provide good PPVs.10 Sensitivity and specificity can be seen as communicating vessels—the improvement of one is usually at the expense of the other.11 Consequently, test systems adjusted by the manufacturers to very high specificities (>99%) showed moderate sensitivity. This problem was particularly evident when non-hospitalised patients were included in the cohorts studied.12–14 To further increase specificity at very low seroprevalence levels, various methods have been proposed, for example, raising the thresholds for positivity or confirming a positive result with a second test (OTA; orthogonal testing).11 15 16 Unfortunately, these specificity improvement strategies inevitably lead to a further reduction of the previously moderate sensitivities.
As the pandemic progressed, the problem became more pronounced as antibodies declined, and sophisticated statistical models were required to compensate for the waning sensitivity.17 In the case of SARS-CoV-2, as with any evolving pandemic, increasing seroprevalence rates worldwide have attenuated the need for higher specificity.
However, the problem persists in non-epidemic diseases, where seroprevalence remains low. Moreover, each new pandemic begins with extremely low seroprevalence rates as well, and, in the future, we should have better diagnostic strategies in infectious serology ready here.
In the present work, we propose for the first time an orthogonal test algorithm based on real-life data for the SARS-CoV-2 antibody tests of Roche, Abbott, DiaSorin, and two commercial SARS-CoV-2 ELISAs18 intending to maximise both specificity and sensitivity at the same time. Although our algorithm follows a general principle, it was developed based on SARS-CoV-2 antibody tests. The SARS-CoV-2 pandemic provided a unique opportunity in this regard, as historical samples from before the pandemic are negative by definition (thus allowing accurate specificity testing). In addition, sufficient PCR-confirmed positive cases were available quickly, ensuring a sophisticated and reliable sensitivity verification. Thus, for SARS-CoV-2—in contrast to most other circulating microorganisms—a realistic and unusually accurate estimation of specificities and sensitivities of serological tests was possible. We benefited from this advantage to develop our sophisticated diagnostic algorithm.
MethodsStudy design and cohortsSera used in this non-blinded prospective cross-sectional study were either residual clinical specimens or samples stored in the MedUni Wien Biobank (n=1181), a facility specialised in the preservation and storage of human biomaterial, which operates within a certified quality management system (ISO 9001:2015).19
For derivation of the SIT2 algorithm, sample sets from individuals known to be negative and positive were established for testing. As previously described,20 samples collected before 1 January 2020 (ie, assumed SARS-CoV-2 negative) were used as a specificity cohort (n=1117): a cross-section of the Viennese population (the LEAD (Lung, hEart, sociAl, boDy) study),21 preselected for samples collected between November and April to enrich for seasonal infections (n=494); a collection of healthy voluntary donors (n=265); a disease-specific collection of samples from patients with rheumatic diseases (n=358) (see also online supplemental tables S1 and S2).
The SARS-CoV-2-positive cohort (n=64 samples from n=64 individuals) included patients testing positive with reverse transcription PCR (RT-PCR) during the first wave and their close, symptomatic contacts. Of this cohort, five individuals were asymptomatic, 42 had mild–moderate symptoms, 4 reported severe symptoms and 13 were admitted to the intensive care unit. The timing of symptom onset was determined by a questionnaire for convalescent donors and by reviewing individual health records for patients and was in median 41 (26, 25–49) days. For asymptomatic donors (n=5), SARS-CoV-2 RT-PCR confirmation time was used instead (for more details, see online supplemental tables S1 and S3). All included participants gave written informed consent to donate their samples for research purposes. The overall evaluation plan conformed with the Declaration of Helsinki as well as relevant regulatory requirements.
For validation of the SIT² algorithm, we used data from an independent UK cohort,22 including 1512 serum/plasma samples (536 PCR-confirmed SARS-CoV-2-positive cases; 976 negative cases, collected earlier than 2017).
Antibody testingFor the derivation analyses, SARS-CoV-2 antibodies were either measured according to the manufacturers’ instructions on three different commercially available automated platforms (Roche Elecsys SARS-CoV-2 (total antibody assay detecting IgG, IgM and IgA antibodies against the viral nucleocapsid, further referred to as Roche NC), Abbott SARS-CoV-2-IgG assay (nucleocapsid IgG assay, Abbott NC), DiaSorin LIASION SARS-CoV-2 S1/S2 assay (S1/S2 combination antigen IgG assay, DiaSorin S1/S2)) or using 96-well ELISAs (Technoclone Technozym RBD and Technozym NP) yielding quantitative results18 (for details, see online supplemental methods). The antibody assays used in the validation cohort were Abbott NC, DiaSorin S1/S2, Roche NC, Siemens RBD total antibody, and a novel 384-well trimeric spike protein ELISA (Oxford Immunoassay),22 resulting in 20 evaluable combinations. All samples from the Austrian SARS-CoV-2-positive cohort also underwent live virus neutralisation testing (VNT), and neutralisation titres (NT) were calculated, as is described in detail in online supplemental methods.
Sensitivity-improved two-test methodOur newly developed sensitivity-improved two-test (SIT²) method consists of the following key components: (1) sensitivity improvement by cut-off modification and (2) specificity rescue by a second, confirmatory test (figure 1A).
(A) The sensitivity improved two-test (SIT2) algorithm includes sensitivity improvement by adapted cut-offs and a subsequent specificity rescue by re-testing all samples within the re-testing zone of the screening test by a confirmatory test. (B) Testing algorithm for SIT2 using a screening test on an automated platform (ECLIA/Roche, CMIA/Abbott, CLIA/DiaSorin) and a confirmation test, either on one of the remaining platforms or tested by means of ELISA (Technozym RBD, NP). (C) All test results between a reduced cut-off suggested by the literature, and a higher cut-off, above which no more false positives were observed, were subject to confirmation testing. **results between 12.0 and 15.0, which are according to the manufacturer considered borderline, were treated as positives; ***suggested as a cut-off for seroprevalence testing; ****determined by in-house modelling; 1see23 ; 2see24 ; 3see.25
For the first component of the SIT² algorithm, positivity thresholds were optimised for sensitivity according to the first published alternative thresholds for the respective assays, calculated, for example, by ROC (receiver operating characteristic)-analysis.23–25 Additionally, a high cut-off, above which a result can be reliably regarded as true positive without the need for further confirmation, was defined. These levels were based on in-house observations20 and represent those values (including a safety margin) above which no more false positives were found. The highest results seen in false positives were 1.800 COI, 2.86 Index and 104.0 AU/mL, respectively. Hence, we defined the high cut-off for Roche and Abbott as 3.00 COI/Index and for DiaSorin as 150.0 AU/mL. The lowering of positivity thresholds improves sensitivity; the high cut-off prevents unnecessary retesting of clearly positive samples. Moreover, the high cut-off avoids possible erroneous exclusion by the confirmatory test. The newly defined interval between the reduced threshold for positivity and the high cut-off is the retesting zone (figure 1A). The initial antibody test (screening test) is then followed by a confirmatory test, whereby positive samples from the retesting zone of the screening test are retested. Also, for the confirmatory test, sensitivity-adapted assay thresholds are needed (figure 1A,B). As false-positive samples are usually only positive in one test system (Fig. S1), false positives can be identified, and specificity markedly restored with minimal additional testing as most samples do not fall within the retesting zone.16 20 A flowchart of the testing strategy and the applied cut-off levels and their associated quality criteria are presented in figure 1B,C.
Test strategy evaluationOn the derivation cohort, we compared the overall performance of the following SARS-CoV-2 antibody testing strategies: (1) testing using single assays, (2) simple lowering of thresholds, (3) classical OTA and (4) our newly developed SIT2 algorithm at assumed seroprevalences of 5% and 20%. As part of the derivation, we then compared the performance of OTAs and SIT2 against the results of a virus neutralisation assay. On the validation cohort, we then compared the performance of OTAs and SIT2. Finally, we used data from this cohort to evaluate the performance of SIT2 versus single tests at seroprevalences of 5%, 10%, 20% and 50% if the Abbott and DiaSorin assays (ie, assays with varying degrees of discrepancies in sensitivity and specificity) were used.
Statistical analysisUnless otherwise indicated, categorical data are given as counts (percentages), and continuous data are presented as median (IQR). Total test errors were compared by Mann-Whitney tests or, in case they were paired, by Wilcoxon tests. 95% CIs for sensitivities and specificities were calculated according to Wilson, 95% CI for predictive values were computed according to Mercaldo-Wald unless otherwise indicated. Sensitivities and specificities were compared using z-scores. P values <0.05 were considered statistically significant. All calculations were performed using Analyse-it V.5.66 (Analyse-it Software, Leeds, UK) and MedCalc V.19.6 (MedCalc bvba, Ostend, Belgium). Graphs were drawn using Microsoft Visio (Armonk, USA) and GraphPad Prism V.7.0 (La Jolla, USA).
ResultsIn the derivation cohort of 1117 prepandemic sera and 64 sera from convalescent patients with COVID-19 (80% non-hospitalised, 20% hospitalised), the Roche NC, Abbott NC and DiaSorin S1/S2 antibody assays gave rise to 7/64, 10/64 and 11/64 false-negative, as well as to 3/1117, 9/1117 and 20/1117 false-positive results. Assuming a seroprevalence of 20%, this led to 2180, 3120 and 3440 false-negative results per 100 000 tests, and 240, 650 and 1440 false-positive results per 100 000 tests, respectively (figure 2A, right panel).
False-positives (FP)/false-negatives (FN) (A) and total error (B) of single tests, tests with reduced thresholds according to,23–25 orthogonal testing algorithms (OTAs) and the sensitivity improved two-test (SIT2) algorithm at 5 and 20% estimated seroprevalence. Data in (B) were compared by Mann-Whitney tests (unpaired) or Wicoxon tests (paired). *p<0.05; **p<0.01; ***p<0.001.
Effects of threshold lowering on sensitivity and specificityLowering the positivity thresholds for the Roche NC, Abbott NC and Diasorin S1/S2 to 0.165 COI, 0.55 Index and 9 AU/mL increased the sensitivity significantly and reduced false-negative results to 1/64, 2/64, and 7/64 (320, 620 and 2180 per 100 000 tests at a seroprevalence of 20%), but substantially increased false-positive results to 18/1117, 27/1117 and 31/1117, respectively (1280, 1920 and 2240 per 100 000 tests, at an assumed seroprevalence of 20%; online supplemental table S4, figure 2A, right panel).
Classical OTA compared with SIT2Subsequently, we evaluated 12 OTA combinations using the fully automated SARS-CoV-2 antibody tests from Roche NC, Abbott NC and DiaSorin S1/S2 as screening tests, each combined with one of the other fully automated assays or a commercially available NC or RBD-specific ELISA as a confirmation test. Combining these tests as classical OTAs significantly increased specificity and reduced false positives to 0 (0–1)/1117. However, the rate of false negatives was 14 (12–16)/64 (1095 (955–1230) per 100 000 tests at 20% seroprevalence), and, therefore, considerably higher than for single testing strategies. In contrast, the SIT2 algorithm minimised false positives to 0 (0–2)/1117 (0 (0–140) per 100 000 tests at 20% seroprevalence) while also reducing false negatives to 5 (3–8)/64 (1560 (940–2420) per 100 000 tests at 20% seroprevalence, figure 2A right panel; online supplemental table S5).
Reduction of total error rates by the SIT2Of all the methods assessed, SIT2 reached the lowest total error rates per 100 000 tests under both 5% and 20% assumed seroprevalence (455 (235–685) and 1600 (940–2490) per 100 000 tests) (figure 2B). At a seroprevalence of 5%, OTA on average performed better than individual tests, and the total error rates of the single tests were higher for the Abbott NC and DiaSorin S1/S2 assay (OTA 1095 (955–1325) vs 830 (Roche NC), 1540 (Abbott NC) and 2570 (DiaSorin S1/S2) per 100 000 tests). But with a seroprevalence of 20%, performance of OTAs, worsened compared with single tests (OTA 4380 (3820–5000) vs 1600 (Roche), 2540 (Abbott) and 4420 (DiaSorin) per 100 000 tests) (figure 2B). Therefore, at both 5% and 20% seroprevalence, SIT2 resulted in the lowest overall errors. Compared with OTAs, SIT2 yielded a similar improvement in specificity while not suffering from the significant sensitivity reduction (online supplemental figure S2). Since the better overall performance of SIT2 compared with OTAs was not due to increased specificity but improved sensitivity, we subsequently set out to examine these differences in more detail.
Sensitivities of single tests, OTA and SIT2 in relation to NTNext, we compared the sensitivities of the three screening tests as single tests and in both two-test methods (OTA and SIT2), benchmarking them using the Austrian sensitivity cohort (n=64) simultaneously evaluated with an authentic SARS-CoV-2 VNT. Regardless of the screening test used (Roche NC, Abbott NC, or DiaSorin S1/S2), OTAs had lower sensitivities than single tests (80.5% (78.5–83.6), 78.1% (75.8–82.8) or 75.8% (71.5–78.9) vs 89.1%, 84.4% or 82.8%, respectively), and SIT2 showed the best sensitivities of all methods (95.3% (93.0–96.5), 93.8% (92.2–96.5) or 87.5% (85.1–88.7)) (figure 3). SIT2 algorithms, including the Roche NC and Abbott NC assays, achieved similar or even higher sensitivities than VNT (figure 3, VNT reference line), made possible by the unique retesting zone of SIT2 (Fig. S3).
Sensitivities of single tests, orthogonal testing algorithms (OTAs) and the sensitivity improved two-test (SIT2) algorithm. The dotted line indicates the sensitivity of virus neutralisation test (VNT).
Validation of the SIT2 using an independent cohortTo confirm the improved sensitivity of SIT 2 compared with OTA, we analysed the sensitivities of OTAs and SIT2 in an independent validation cohort of 976 prepandemic samples and 536 post-COVID samples. Out of 20 combinations using the assays Roche NC (total antibody), Abbott NC (IgG), DiaSorin S1/S2 (IgG), Siemens RBD (total antibody) and Oxford trimeric-S (IgG), a statistically significant improvement in sensitivities over OTAs was shown for SIT2 in 18 combinations (figure 4). The performance was comparable for the remaining two combinations (Siemens RBD with Oxford trimeric-S and vice versa). Still, no statistically significant improvement could be achieved due to the high pre-existing sensitivities of these assays on this particular sample cohort.
Differences in sensitivity and specificity (mean±95% CI) between the sensitivity improved two-test (SIT2) algorithm and standard orthogonal testing algorithms (OTAs) within the UK validation cohort. *p<0.05; **p<0.01; ***p<0.001; ****p<0.0001.
To further illustrate the effect of SIT2 on the outcome of SARS-CoV-2 antibody testing, we compared single testing versus SIT2 with the Abbott and DiaSorin assays at varying assumed seroprevalences (5, 10, 20 and 50%), given that the Abbott NC assay is a highly specific (99.9%), but moderately sensitive test (92.7%), and the DiaSorin S1/S2 assay has the most limited specificity (98.7%) of all evaluated assays but an acceptable sensitivity (96.3%). Regardless of whether a lack of specificity (DiaSorin S1/S2) or sensitivity (Abbott NC) had to be compensated for, SIT2 improved the overall error rate compared with the individual tests in all four combinations and at all four assumed seroprevalence levels (figure 5).
Comparing false-positives (FP), false-negatives (FN) and total error (TE) for two selected test systems, (A) Abbott, (B) DiaSorin, between different sensitivity improved two-test (SIT2) combinations and the respective single test within the UK validation cohort for different estimated seroprevalences.
DiscussionSerology is a commonly used, multi-purpose analytical method.1–6 However, not all serologic assays have appropriate sensitivities and specificities, especially in low-prevalence settings. The SARS-CoV-2 pandemic prompted the simultaneous development of several antibody tests and, which is rare otherwise, allowed to evaluate these tests with both confirmed positive and negative cases, the latter derived from biobank collections established before the virus emerged. In the case of SARS-CoV-2, false-positive samples are usually not simultaneously reactive in different test systems.16 20 This led to the hypothesis that reducing the threshold for positivity in screening and confirmation tests would increase the specificity without impairing the sensitivity. A further improvement in sensitivity was possible by defining a high cut-off for the screening test, above which, due to the excellent reliability of high test results, no further confirmation (and, thereby, a possible false-negative result in the confirmation test) was necessary.
In the early waves of the SARS-CoV-2 pandemic, many commercially available SARS-CoV-2 antibody tests did not provide sufficient specificity to achieve acceptable PPVs, for example, at a seroprevalence rate of 1–5%.15 20 Lowering positivity thresholds might improve test sensitivity23–25 and conventional orthogonal testing can maximise specificity.11 26 27 The latter might increase the PPV, but PPV will only be relevantly increased at low seroprevalences. However, since seroprevalence is often neither known and varies widely from region to region, it is difficult to judge whether a less specific or less sensitive test is the lesser of two evils.
Based on actual data related to SARS-CoV-2, we propose a new, universally adaptable two-test system that could, in the case of SARS-CoV-2, perform better than any other known approach regardless of the actual seroprevalence: the sensitivity-improved two-test or SIT2. For this, we established the algorithm in our COVID-19 cohort (including 1181 samples, 1117 prepandemic negative, and 64 confirmed post-COVID-positive samples) and validated it in a completely independent UK cohort (including 1512 samples, 976 negatives and 536 positives). So, the associations found were neither exclusively related to a particular cohort nor the analysing institutions. All Austrian cohort samples were tested with the following assays: Roche, Abbott, DiaSorin S1/S2, Technozym RBD and Technozym NP. The UK cohort we used for validation included a complete data set of all samples analysed with the Roche, Abbott, DiaSorin S1/S2, Siemens, Oxford assays. Hence, the Austrian and the UK cohorts shared three test systems (Roche, Abbott and DiaSorin S1/S2) but differed regarding specific characteristics of the included negative and positive samples. Besides these three overlapping test systems, each cohort included data of two more exclusive SARS-CoV-2 antibody assays in the analysis. The use of these different combinations should underscore the universality of SIT².
Its generalisability can be inferred further in detail from the following features: (1) the adapted cut-offs used to optimise sensitivity were determined in various independent studies and were not explicitly calculated for our cohort23–25, (2) SIT2 was effective, although with different efficiencies, in a total of 32 different test combinations and (3) SIT2 was successfully validated in an independent cohort, which was profoundly different from the derivation cohort. The robustness of a diagnostic algorithm regarding analytical variability (lot-to-lot variability, instrument-dependent variability or method-specific confounders) is essential. Based on our study design with three overlapping assays (Roche, Abbott and DiaSorin) tested at two sites with two different cohorts but without lot matching, we did not find any adverse effects on the robustness of our algorithm by these potential confounders. Moreover, we estimated the SIT² robustness to between-lot variability simulating how the algorithm’s performance would change if results would vary according to their respective reference change values. For this, we used an SIT²-algorithm consisting of Roche and Abbott as an example and could conclude that expectable between-assay variability might only marginally affect the algorithm (data not shown). Therefore, SIT2 does not require a particular infrastructure, the availability of high-performance individual test systems or specific reagent lots to work but can optimise the performance of any available test system.
Our SIT2 strategy can rescue the specificity with minimal repeat testing required (see online supplemental table S6). For example, when applying the Roche NC as a screening test to our cohort, only 27 out of 1181 samples needed confirmation testing with the Abbott NC test to correctly identify 62/64 true positives. Simultaneously, all false-positive results were eliminated, including those added by lowering the cut-offs (online supplemental table S4 and figure S1). Additionally, it was more sensitive than VNT, which identified only 60/64 clinical positives (figure 3). This result is not completely surprising as it is known that not all patients who recovered from COVID-19 show detectable levels of neutralising antibodies.28 Nevertheless, it should be noted that although antibody binding assays may have a higher sensitivity than neutralisation assays, they only partially reflect the functional activity of SARS-CoV-2 reactive antibodies.29 30 The sensitivity of SARS-CoV-2 tests may change over time, as prominently shown in a Brazilian study, where pronounced antibody waning led to an apparent decrease in seroprevalence already a few months after a SARS-CoV-2 corona wave.17 However, this was mainly caused by the strongly decreasing sensitivity of the test system used. The measured seroprevalence decreased from 46.3% in June 2020 to only 20.7% in October 2020, when the standard manufacturer cut-off of 1.4 was used for the Abbott NC test. When the same data were analysed with a reduced cut-off of 0.4, the values changed from 54.3% in June to 44.6% in October, so the apparent decrease in seroprevalence was much less pronounced. Lowering the cut-off to increase the sensitivity of a test system (and, therefore, also to compensate for such time-dependent sensitivity losses) is the first step of our SIT2 algorithm. As this cut-off lowering reduces the specificity of a test (so with the 0.4 cut-off, the seroprevalence rate in June was 8% higher than with the 1.4 cut-off, including more false positives), it is necessary to rescue this loss of specificity by a second test (also highly sensitive by cut-off lowering). This should illustrate that while there are test systems whose sensitivity changes more rapidly over time, and there is physiologically a time-dependent decrease in antibody levels, SIT2 offers a strategy to counteract this development with an increase in sensitivity by cut-off lowering and subsequent correction of specificity. Thus, these time-dependent sensitivity changes are not a significant problem for SIT2. Accordingly, there are far-reaching potential applications. Regarding SARS-CoV-2, on the one hand, the use of an algorithm of this kind could increase the reliability of seroprevalence analyses, especially in low-prevalence areas. On the other hand, its use in routine clinical diagnostics is also conceivable. In the case of SARS-CoV-2, the emergence of new viral variants particularly affects test sensitivity.31 This could be counteracted by increasing sensitivity through modified cut-offs, and specificity would subsequently be restored by a second test. For SARS-CoV-2 testing, it must be further emphasised that different mechanisms of immunisation induce different humoral responses, whereas an infection usually leads to both antinucleocapsid and antispike antibodies, the immune response to an mRNA-based, vector-based or protein-based vaccine that introduces only the spike-protein lacks the anti-nucleocapsid antibody.32 Accordingly, among the vaccinated, tests assessing antispike antibodies might not be useful in detecting individuals after SARS-CoV-2 infection, as the measured amount would have at least partly been induced by the vaccine. However, add-on infection could boost antispike levels.33 These conditions must be considered when searching for the optimal combination of tests for a SIT2 approach.
Our study has both strengths and limitations. One strength is the size of the cohorts examined, both in deriving the SIT2 algorithm (N=1181) and validating it (N=1512). The composition of our specificity cohort is also unique: it consists of three subcohorts with selection criteria to further challenge analytical specificity. The lower cut-offs used to increase sensitivity were not modelled within our data sets but were derived from ROC analyses data of independent studies.23–25 Furthermore, we were able to test the performance of the two-test systems in a total of 32 combinations, 12 in the derivation cohort and another 20 combinations in the validation cohort. As a limitation, in the Austrian cohort, only samples≥14 days after symptom onset were included. Therefore, no conclusions on the sensitivity of the early seroconversion phase can be made from these data. Furthermore, mild and asymptomatic cases were under-represented in the British cohort, perhaps leading to an observed higher sensitivity of the test systems. Moreover, the analysis did only include samples collected during the first wave; therefore, positive individuals were most likely infected by the wildtype virus. However, as stated above, the emergence of new variants challenges a test system’s sensitivity even more, which only reinforces the need to increase sensitivity without harming specificity, as we propose here by using SIT2.
In conclusion, we describe the novel two-test algorithm SIT2, which makes it possible to maintain or even significantly improve sensitivity while approaching 100% specificity.
Comments (0)