Validation of an MRI-based classification of peroneus brevis tendon morphology: a four-type system with high inter-rater reliability

Ethical considerations

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki Declaration and its later amendments or comparable ethical standards. The National Ethical Authority approved this study and waived informed consent (Dnr 2024–07283-02).

Study design and observer validation framework

This study evaluated the inter-rater agreement regarding a proposed classification of the peroneus brevis tendon form on the transverse section at the level of the lateral malleolus. This study was conducted and reported in accordance with the Guidelines for Reporting Reliability and Agreement Studies (GRRAS) [25].

MRI examinations

The dataset was compiled retrospectively from MRI examinations performed at our institution in 2021–2024 with a 3 T machine (Philips Ingenia) due to sports injuries or pain related to physical activity. MRI examinations were performed using a dedicated ankle coil, which ensures consistent positioning of the ankle joint. Wedge-shaped cushions placed inside the coil prevent movement or changes in the ankle during the examination. Axial proton density–weighted images were acquired with a turbo spin echo sequence using an echo time of 45 ms and a repetition time of 2800–5000 ms. The voxel dimensions were 0.45 × 0.53 × 3.0 mm, with a slice thickness of 3 mm and a field of view of 14 cm. No interslice gap was used in the protocol. We retrospectively reviewed 439 ankle MRI examinations performed at our institution between January 2021 and December 2024. All examinations followed our standard ankle MRI protocol and had been conducted due to trauma or pain in the ankle. Only one MRI per patient was included. Exclusions were examinations showing abnormal T2 or proton density signal, tenosynovitis, or tendinosis—were excluded. Additional exclusion criteria included the presence of fractures, infections, tumors, transverse tendon ruptures, postoperative changes, metal or motion artifacts, duplicate scans from the same patient, age under 18 years, or missing clinical information (Flowchart in Supplementary material Figure S1).

Raters

Seven independent raters with varying professional backgrounds participated in the classification process. The group included two board-certified musculoskeletal radiologists (Rater 1 with 10 years and Rater 5 with 6 years of experience in musculoskeletal imaging), one radiology resident (Rater 6), one medical doctor (Rater 3), two physiotherapists (Raters 2 and 7), and one fifth-year medical student (Rater 4). All raters underwent a structured training session on the use of the classification system prior to the assessment.

A structured visual assessment approach was used for grading. Each rater independently evaluated the tendons and assigned one of four predefined morphological types. The classification system was developed based on a preliminary study conducted by our team and was further refined through pilot consensus rounds prior to the main study. The reference standard was established through consensus agreement between two experienced musculoskeletal radiologists. To assess inter-rater reliability, each rater independently classified all cases. All other raters performed the classifications independently and were blinded to clinical information, imaging metadata, and each other’s assessments. Intra-rater reliability was assessed after a washout period. To assess intra-rater reliability, a random subset of 30% of cases was re-evaluated by each rater after a three-week washout period [26].

Data acquisition and preprocessingSampling

Convenience sampling was used to select all available MRI scans of the ankle that met the inclusion criteria. A power analysis was performed prior to the study to determine the sample size. With power > 0.8, alpha = 0.05, an effect size of 0.31 based on our preliminary study, and a 10% buffer, 130 participants were required.

Preprocessing

Analysis of the peroneus brevis tendon shape was based on proton density–weighted axial images. Before starting the main study, we reviewed 40 MRI examinations with normal peroneal tendons to estimate the required effect size. In this preliminary study, we observed that the cross-sectional shape of the tendon remained consistent throughout the short retromalleolar segment. Based on these findings, we considered the use of a single, clearly defined axial slice to be a reliable and reproducible approach for classification across cases. A senior musculoskeletal radiologist selected a representative image at the level of the lateral malleolus, inferior to the syndesmosis, at the level of the ankle joint space.

The selected images were exported, and an electronic form was created, including the images and four classification options (general flat, flat with a medial bulge, flat with a lateral bulge, and oval tendon), which was then distributed to the raters. Each rater received detailed instructions including definitions of tendon forms and examples (not including the evaluated cases) regarding tendon forms and assessed tendons using a standardized protocol for the study.

Proposed classification

The proposed classification (nominal variable) is based on the visual assessment of the ratio of thickness (the shortest diameter) to the width of the tendon (the longest diameter). The four types are defined below (Fig. 1).

Fig. 1figure 1

Axial section at the level of the lateral malleolus and the ankle joint space (magnetic resonance images with proton-density weighting, left sides). The four morphological types of the peroneus brevis tendon (arrow) are as follows: a general flat, b flat with a lateral bulge, c flat with a medial bulge, and d oval tendon. The fibula is labeled with f

General flat tendon

The thickness of the tendon (defined as the anteroposterior [AP] dimension) is relatively consistent across its length. The width of the tendon is significantly greater than its thickness. The tendon in this form can be either straight, curved, or folded.

General flat with a lateral bulge

The tendon is flattened, but the lateral edge bulges, increasing the thickness in the lateral part.

General flat with a medial bulge

The tendon is flattened, but the medial edge bulges, resulting in an increase in thickness in the medial part.

Oval tendon

The peroneus brevis tendon takes the shape of an oval, resembling the cross-section of the peroneus longus tendon. The thickness is slightly smaller than the width.

Gold standard

The final result was a consensus between two musculoskeletal radiologists, which was the gold standard.

Generative AI tools

Generative AI tools, specifically ChatGPT (OpenAI, GPT-4 architecture), were used to assist in language correction, improving the clarity and readability of the manuscript. No generative AI tools were used for data analysis, research design, interpretation of results, or the formation of scientific conclusions. The authors carefully reviewed, revised, and approved all AI-generated edits to ensure accuracy and integrity.

Statistical analysis

R version 4.4.3 (R Core Team, Vienna, Austria) and RStudio version 2024.12.1 + 563 (2024.12.1 + 563) were used for statistical analysis and data visualization. The ggplot2 package was used to visualize the data. Inter-rater agreement across the seven raters was assessed based on Fleiss’ kappa with a bootstrapped 95% confidence interval. For pairwise agreement, Cohen’s kappa and Gwet’s AC1 were calculated between all rater pairs. These statistics were also used to assess intra-rater reliability after a 3-week washout period. Gwet’s AC1 was included as a robust alternative to Cohen’s kappa, particularly considering potential prevalence and marginal distribution imbalances.

The F1 score, defined as the harmonic mean of precision and recall, was calculated for each rater to assess the performance of the proposed classification. This metric reflects the balance between correctly identifying tendon classifications (precision) and detecting all cases of a given class (recall). The F2 score (β = 2) was calculated to account for cases where recall was prioritized. In addition, a majority vote approach was applied, where the most frequently assigned tendon form among the seven raters was selected as the predicted class. Then, the majority vote classifications were compared to the gold standard (defined as a consensus between two musculoskeletal radiologists) to calculate the overall classification performance metrics.

Receiver operating characteristic (ROC) curve analysis was conducted by using a cumulative one-versus-rest strategy, comparing each tendon form against all others. Macro-averaged values for precision, recall, the F1 score, and the area under the curve (ROC-AUC) are reported to ensure balanced evaluation across all tendon categories.

To identify misclassification patterns, confusion matrices were generated for each rater and for the majority vote. These matrices were used to visualize common errors and to highlight areas of disagreement or uncertainty between raters.

Comments (0)

No login
gif