Reliability of [F]FDG PET/CT for post-treatment surveillance of non-small cell lung cancer: agreement among multiple centers

Population and observers

This interobserver variability study utilized [18F]FDG PET/CT scans from patients with stage IA-IIIC NSCLC who had completed curative treatment. The scans were obtained approximately six months after treatment as part of scheduled surveillance, with no prior suspicion of recurrence. The patients were part of the PET/CT arm of the SUPE_R trial (ClinicalTrials.gov NCT03740126). Full details of the SUPE_R trial and primary results have been published previously [10].

Sample size was determined using the confidence interval approach of Donner and Rotondi [12], using the R package kappaSize [13]. Based on an expected κ of 0.73 [14] an acceptable lower confidence limit of 0.50, and an expected marginal distribution of 80% benign scans, 10% equivocal, and 10% suspicious for recurrence, with α = 0.05 and power of 80%, a minimum of 147 subjects were required. One hundred fifty scans from 150 patients were randomly selected, with one scan per patient obtained at the 6-month time point. The selection was stratified by recurrence suspicion based on the original report, ensuring that approximately 15% of the selected scans had findings suspicious for recurrence, matching the expected rate in the population. The 6-month surveillance time point was chosen as the risk of recurrence is highest during the first 12 months after curative treatment, with the hazard rate peaking at approximately 6 months [15].

Nine teams from seven nuclear medicine departments across Denmark participated in the study. Each team consisted of a board-certified nuclear medicine physician and a board-certified radiologist, both with extensive experience in PET imaging and lung cancer evaluation. Two teams were randomly assigned to independently review each scan, with the constraint that no two teams from the same department reviewed the same scan and no team rated scans originating from their own department. This was done to ensure independent review and avoid recall bias.

The scans were centrally anonymized and assigned unique study IDs. For each scan, the corresponding baseline scan (obtained for staging prior to therapy) was also anonymized and made available for comparison. No other imaging was provided. The anonymized scans were distributed to the selected departments and loaded into their respective Picture Archiving and Communication Systems (PACS) used for routine clinical practice.

Scoring methods

Teams reviewed their assigned scans according to their usual practice, individually or side by side. The final rating for each team was based on a consensus reading between the two team members. Ratings were recorded in a REDCap database [16, 17], where selected patient characteristics, including stage, treatment, histology, age, and gender, were presented to the reviewers.

Each scan was scored in three steps. First, teams evaluated all abnormal findings by anatomical site, scoring them for both FDG uptake and suspicion of recurrence. FDG-uptake was classified on a 4-point scale (normal, mildly increased, moderately increased, markedly increased), with teams free to use quantitative and qualitative assessment methods. Suspicion of recurrence was rated on a 5-point scale: benign, most likely benign, either benign or recurrence, most likely recurrence or definite recurrence.

Next, teams assessed the overall suspicion of recurrence for each scan on a 3-point scale, considering both PET and CT findings equally. The three categories were: no suspicion of recurrence, equivocal for recurrence, or suspicious for recurrence. For this method, referred to as the “conventional assessment”, teams were instructed to apply their usual clinical criteria for diagnosing recurrence without pre-defined study-specific guidelines.

Finally, teams applied the Hopkins criteria for evaluation of lung cancer recurrence [18]. This approach involved a qualitative evaluation of FDG uptake in abnormal findings across three regions: primary tumor site, regional mediastinal nodes, and distant sites (including contralateral lung). For each region, teams assessed the FDG uptake using a 5-point scale, visually comparing the activity to the mediastinal blood pool and liver uptake (Table 1). Teams were instructed to disregard incidental findings likely unrelated to lung cancer. The final recurrence assessment was based on the highest score among all three regions, with scores of 4 or 5 considered positive for recurrence.

Table 1 Hopkins criteria. Each anatomical site (primary tumor, mediastinum and metastasis) is scored on a 5-point scale, and the final score is positive if any site score is above 3. FDG = Fluorine-18 Fluorodeoxyglucose

To compare the relative accuracy of each rating method, we evaluated the results against a common reference standard. This standard was defined as the presence or absence of recurrence diagnosed within six months after the evaluated scan, as recorded in the SUPE_R study. Recurrence diagnoses were based on histological confirmation when possible or determined through multidisciplinary team assessment and imaging follow-up.

Image acquisition

PET/CT scans were performed using standardized protocols drawn up at each department, adhering to European Association of Nuclear Medicine (EANM) recommendations [19]. Patients fasted for at least 4 h prior to the examination, followed by an [18F]FDG injection of 3–4 MBq/kg. Images were acquired 60 min post-injection, with scans covering the area from vertex to mid-thigh. Contrast-enhanced CT was used when not contraindicated.

Statistical analysis

The primary endpoint was overall agreement and Fleiss’ kappa (κ) for recurrence suspicion using the 3-point conventional assessment scale. Secondary endpoints included agreement using Hopkins criteria, agreement by anatomical site, and comparison of the accuracy of each rating method against the reference standard.

Interobserver agreement was assessed using overall agreement and Fleiss’ weighted kappa with linear weights [20, 21]. Fleiss’ kappa was chosen as it allows for a non-fully crossed study design with non-unique raters, giving the possibility of including a larger sample of scans compared to a fully crossed design, given the same total number of ratings. The agreement was categorized according to Landis and Koch (1977): 0.00–0.20 slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, and 0.81–1.00 almost perfect agreement [22]. Overall agreement was defined as the proportion of scans with identical ratings. Receiver Operating Characteristic (ROC) analysis was performed to evaluate the accuracy of each scoring method compared to the reference standard.

All statistical analyses were conducted using R software (version 4.4.1), with the irrCAC package for estimation of Fleiss’ kappa [23]. Two-sided p-values were used, with statistical significance set at p < 0.05.

Comments (0)

No login
gif