External validation of an artificial intelligence tool for fracture detection in children with osteogenesis imperfecta: a multireader study

Ethical approval

Ethical approval was provided by the National Health Service (NHS) Health Research Authority (HRA) (IRAS ID: 274278, REC reference 22/PR/0334). Our methodology follows the latest CLAIM [12] guidelines.

Imaging case selection

A retrospective review of all consecutive appendicular and pelvic radiographs at Great Ormond Street Hospital for Children was performed over a 13-month study period (1/4/2021–30/4/2022) for children < 18 years of age with genetically confirmed diagnosis of osteogenesis imperfecta (OI).

Radiographs were extracted and anonymised from the local PACS system and uploaded to a cyber-secure, image viewing, multireader platform (Collective Minds Radiology®). Additional details are in the Electronic supplementary material.

Data collection and labelling

The ground truth was determined by a consensus opinion of two consultant paediatric radiologists (with > 10 years of paediatric radiology experience each), with knowledge of the clinical indication and availability of prior examinations, where present. If a fracture was present, the abnormal region on the radiograph was highlighted with a bounding box and labelled as either acute or healing (i.e., healing was defined as the presence of a callus or periosteal reaction, with visible residual fracture line). Where there was no longer a visible fracture line, the bone was considered to be ‘remodelled’, and this was not labelled as a fracture.

Multireader trial design

Seven radiologists (5 consultants (each with > 5 years paediatric radiology experience), 2 trainees (each with 6 months paediatric radiology experience)), with a special interest in paediatric radiology, were recruited from 5 different institutions across the UK. None had reported any of the OI cases in our multi-case dataset, and we did not provide the radiologists with any clinical information or prior images for comparison, apart from informing them of the study design. Radiographic examinations in this study were initially extracted as DICOM files from the hospital PACS System. DICOM tags that contained any patient identifiable information (e.g., name, hospital number, age, sex, weight, address, etc.) were removed, leaving only relevant technical information regarding the image and mode of acquisition (e.g., resolution, pixel size) in order to allow for AI analysis.

The multireader study was conducted in two separate rounds, with individual radiologists acting as their own controls across the rounds of image interpretation. Definitions for acute and healing fractures were discussed, and an online tutorial on how to use the imaging platform was provided to all radiologists prior to the study. The first reading round (30/11/2022 to 30/1/2023) involved the radiologists independently reviewing and annotating relevant radiographs with bounding boxes where a fracture was suspected, which included labelling of the fracture as acute or healing.

An interval washout period of 8 weeks was then ensured to avoid any recall. In the second round (16/4/2023 to 16/6/2023), radiologists were assisted by AI. The same randomly re-ordered set of images was re-annotated by the radiologists with the output of the AI tool (Milvue Suite, SmartUrgences®) available for guidance. The output generated by the AI included bounding boxes around areas of high probability of fracture. However, the AI did not have the capability to describe the fractures as acute or healing. Radiologists were asked to look at the AI output before re-annotating the same images again.

Interventions

Milvue Suite-SmartUrgences® (v1.26) is a CE marked MDR Class 2a medical device, developed and commercialised by Milvue. The training and development of the AI tool has already been published elsewhere in detail [13,14,15]; but briefly, it was conducted on a multicentric dataset of more than 600,000 chest and musculoskeletal radiographs across seven key pathologies (fracture, pleural effusion, lung opacification, joint effusion, lung nodules, pneumothorax, and joint dislocation). While the exact population used for the training of this model is not public, personal correspondence with the vendor informs us that the proportion of the training dataset dedicated to children included 8.2% aged 0–12 years old and 13% aged 13–25 years old. The vendor confirms the product is intended for adult and paediatric use, with no defined lower age limit.

For each positive finding, the AI provides a binary certainty score (that is, certain/positive or uncertain/doubtful). For the purposes of this study, all positive findings, regardless of the assigned certainty, were considered positive. Only bounding boxes generated for the presence of bone fractures were considered.

Data analysis—diagnostic accuracy

This study collected four sets of results for the identification of fractures: (1) a “ground truth” defined by agreement between two consultant paediatric radiologists and used as a reference to assess the correctness of the other results, (2) the AI diagnoses alone, (3) the first diagnosis round by radiologists without AI assistance, and (4) the second diagnosis round by radiologists with AI assistance. After both rounds were completed, the results from radiologists without AI assistance, radiologists with AI assistance, and the AI alone were compared against the ground truth by calculating the intersection between the bounding boxes. Overlap of radiologist, or AI, and ground truth bounding boxes by at least 40% was considered a true positive; less than 40% was considered a false positive. Failure to place a bounding box on an image where a fracture was present was considered a false negative. Details on how the threshold of 40% was decided for the intersection are provided in the Electronic Supplementary Material, including information on what is meant by overlap.

The results were computed at a per-fracture, per-image and per-examination level for each radiologist.

At the per-fracture level, correct identification of a fracture was simplified to any bounding box drawn by a radiologist that has over 40% overlap with a gold standard bounding box (most strict definition of fracture detection).

At the per-image level, any image upon which a radiologist drew a bounding box was counted as positive, and correctness was determined by assessing whether that image contained at least one fracture bounding box from ground truth, irrespective of accurate location or number of fractures.

At the per-examination level, examinations were determined positive by radiologists when an examination had at least one image classified as positive by that radiologist, irrespective of correct projection or image (loosest and most generous definition of fracture detection).

The confidence intervals for diagnostic accuracy, sensitivity, and specificity were “exact” Clopper-Pearson confidence intervals [16], while the PPV and NPV used a variation of the Wilson confidence interval [17] computed using the efficient-score method developed by Robert Newcombe [18], at each of the three levels of results.

Inter and intra-reader variability

Intra-reader agreement was assessed through the changes in diagnoses made by a radiologist between rounds 1 and 2.

At the per-examination and per-image level, there is a consistent number of examinations and images for each radiologist between rounds, enabling a direct comparison. We further analysed the changes by assessing whether a change was from a positive diagnosis to a negative diagnosis or vice versa and whether that change was made in alignment with the result from the AI (meaning the radiologist’s classification in round 2 was the same as the AI).

The per-fracture level has an inconsistent number of results for each radiologist between rounds, impeding the ability to conduct this analysis in the same manner. Therefore, changes to false positive fracture identifications were only considered when the amount within an image changed between rounds, with each added or removed false positive considered one change from or to a true negative, respectively.

The correlation between radiologists’ results were calculated at the per examination and per image level using the implementation of Cohen’s Kappa in the Python SciPy Metrics library [19], as per the method suggested by Hallgren [20] suggesting computing the Cohen’s Kappa for each reader-pair and taking the arithmetic mean to arrive at a single value for measuring the inter-reader agreement. The same function was used at a per-fracture level with the adjustment that only the first instance of false-positive fracture within an image was considered, due to the same limitation as when calculating the per-fracture intra-reader agreement.

The agreement between the radiologists and the AI was determined in a similar method, taking the mean Cohen’s Kappa for each radiologist with AI.

Failure analysis

Cases where a majority, defined as five or more radiologists, initially detected either the presence or absence of a fracture differently to the AI were of particular interest and isolated to provide a clearer understanding of the relative sensitivities.

These were identified through the comparison of predictions from radiologists in round 1 versus the AI acting standalone, at per-fracture level.

Instances where the AI was correct and five or more radiologists were incorrect, and vice versa, were counted, with the limitation that false positives were not required to be in the same location. Changes radiologists made to their diagnosis of an image between rounds were also analysed further to identify any trends in AI influence on radiologists’ decision-making processes.

Comments (0)

No login
gif