CycleGAN models show consistent brain MRI synthesis across datasets supporting downstream tissue characterization in multiple sclerosis

Abstract

Background:

Secondary quantitative analysis of brain magnetic resonance imaging (MRI) can provide valuable information for many neurological diseases, including multiple sclerosis (MS), but it demands complete datasets that are often unavailable clinically. We investigated how image synthesis via deep learning using cycle-consistent generative adversarial networks (CycleGANs) compared with Pix2Pix as a related method, based on T1-weighted and T2-weighted brain MRI in MS, following verification on two streamlined datasets. The synthesized images were also evaluated against the source data.

Methods:

The streamlined datasets involved 1,113 healthy participants from the Human Connectome Project (HCP) and 318 participants from the Parkinson’s Progression Markers Initiative (PPMI). The MS cohort in this study included 105 participants scanned with different protocols. Image synthesis was bidirectional between T1- and T2-weighted MRI using CycleGAN with and without spectral normalization, as well as Pix2Pix. Utility testing focused on T1-weighted MRI that was most often unavailable in MS, and that involved lesion detection, brain volumetry, and lesion texture analysis.

Results:

All CycleGAN models performed competitively, while Pix2Pix performed better, mostly with streamlined datasets (p < 0.001). The average peak signal-to-noise ratio ranged from 24.860–28.570 versus 28.520–31.100, and the structural similarity index ranged from 0.838–0.901 versus 0.924–0.943. With spectral normalization, CycleGAN improved in PPMI but not in HCP and generally not in MS (p < 0.001). Furthermore, the synthesized images showed high similarity to the source data in utility tests, although Pix2Pix T1 images appeared more heterogeneous in lesion texture than source T1 images.

Conclusion:

CycleGAN without spectral normalization appeared feasible for synthesizing common clinical brain MRI, including T1-weighted images usable for subsequent quantitative analysis in MS.

1 Introduction

Magnetic resonance imaging (MRI) is an essential tool for the evaluation and management of neurological diseases such as multiple sclerosis (MS) (Thompson et al., 2018). Clinical MS imaging, including T1-weighted (T1w) and T2-weighted (T2w) brain MRI, provides valuable diagnostic information, such as the number and activity of focal lesions (McGinley et al., 2021). However, these features reflect only part of MS pathology (Fisniku et al., 2008). Secondary analysis of clinical MRI using computational approaches has the potential to characterize both visible and invisible tissue changes, thereby generating competitive data-driven hypotheses (Vázquez-Marrufo et al., 2023). Nonetheless, such approaches demand complete datasets, which are often limited in a clinical setting due to constraints in time or cost. There is a clear need to investigate new methods, such as deep learning techniques, that can synthesize missing imaging sequences from available data.

One common deep learning method for image synthesis is the generative adversarial network (GAN) (Goodfellow et al., 2020; Xu et al., 2022). GANs work by combining two neural networks, namely a generator and a discriminator, which create new images and determine their validity, respectively. These methods have shown promise in different applications, including synthesizing brain MRI to support segmentation of brain tumors (Nema et al., 2020) and MS lesions (Zhang et al., 2018). However, base models of GANs are not without limitations, as demonstrated by training instability in their discriminators due to inefficient learning, especially in cross-domain image translation (Lee and Choi, 2020). Developments are underway to overcome these challenges, including applying weight clipping (Arjovsky et al., 2017) or spectral normalization (weight control) (Lin et al., 2021) to the discriminators, and developing new architectures such as conditional GAN.

Two recognized conditional GANs are Pix2Pix and cycle-consistent GAN (CycleGAN). Pix2Pix targets paired image synthesis and is trained unidirectionally (Isola et al., 2017). CycleGAN, on the other hand, can handle unpaired images, enabled by the inclusion of two pairs of generators and discriminators (Zhu et al., 2017). Various studies demonstrate the utility of these models. With brain MRI of MS, Pix2Pix has assisted in image sequence translation and enhancement for cortical (Finck et al., 2022) and white matter lesion segmentation (La Rosa et al., 2021). CycleGAN has demonstrated competence in creating pseudo-healthy and lesion images (Basaran et al., 2022), and in adapting MRI between centers for segmenting white matter MS lesions (Julián Alberto et al., 2020). Furthermore, spectral normalization has demonstrated value in stabilizing CycleGAN in cross-modality image translation from computed tomography (CT) to MRI (Xu et al., 2020). However, current studies focus mainly on synthesizing images for tissue segmentation rather than structural analysis. Additionally, few studies have investigated synthesizing frequently non-acquired sequences, such as T1w brain MRI in MS, using routine clinical care data. Routine MRI is typically more heterogeneous than streamlined datasets in both scanner type and acquisition parameters (Eche et al., 2021), requiring extra attention.

The purposes of this study were threefold. The first was to investigate how CycleGAN models trained using large, streamlined datasets compared with our heterogeneous clinical MS data in image synthesis between T1w and T2w MRI. The second was to evaluate whether and how spectral normalization impacted CycleGAN performance. The third was to explore the utility of synthesized images versus source images in different analysis tasks, focusing on T1w MRI in MS. In contemporary MS protocols, T1w MRI is not always acquired in routine diagnostic or follow-up examinations (Wattjes et al., 2021); however, many image processing pipelines require a T1w sequence, as seen in co-registration and brain volume measurement (Jenkinson et al., 2012; Fischl, 2012). T2w brain MRI is commonly available in MS protocols (Wattjes et al., 2021), but the acquisition parameters often vary across scanners. Throughout the process, a Pix2Pix model was studied for comparison.

2 Methods2.1 Data characteristics

Two publicly available datasets were included, with each acquired using streamlined protocols. One involved healthy brain MRIs from 1,113 participants (aged 22–36 years, 606 women) in the Human Connectome Project (HCP) (Van Essen et al., 2012). The other comprised brain MRIs from 318 participants (aged 60–70 years, 125 women) in the Parkinson’s Progression Markers Initiative (PPMI) (Marek et al., 2011). Data utilization followed all required practices.

Our local dataset involved clinical brain MRI from a sample of 169 participants with relapsing–remitting MS, who were enrolled in an ongoing cohort study known as Clinical Impact of Multiple Sclerosis (Ethics ID: REB14-1926). Of these, 104 participants (73 women) who had both T1w and T2w sequences available were selected. The mean ± standard deviation age was 36.38 ± 8.89 years, disease duration was 4.81 ± 6.88 years, and the median Expanded Disability Status Scale (EDSS) was 2.0. This study was approved by the Institutional Ethics Review Board, with written informed consent obtained from each participant.

2.2 Imaging protocol

The HCP scans were acquired with a 3T scanner (Magnetom Skyra, Siemens Healthineers, Erlangen, Germany). T1w imaging used a magnetization-prepared rapid gradient-echo (MPRAGE) sequence with a repetition/echo time (TR/TE) of 2,400 ms/2.14 ms, slice thickness of 0.7 mm, field of view (FOV) of 224 mm, and a matrix size of 256 × 320. T2w MRI used the Sampling Perfection with Application optimized Contrasts using different flip-angle evolutions (SPACE) sequence with TR/TE of 3,200 ms/565 ms, slice thickness of 0.7 mm, FOV of 224 mm, and the same matrix size.

The PPMI images were acquired using 1.5T and 3T Siemens scanners, where only baseline scans were used. T1w MRI used an MPRAGE sequence with a TR/TE of 3,000 ms/2.98 ms, slice thickness of 1 mm, FOV of 256 mm, and a matrix size of 240 × 256. T2w MRI used a turbo spin-echo sequence with a TR/TE of 3,000 ms/101 ms, slice thickness of 3 mm, FOV of 256 mm, and a matrix size of 240 × 256.

Our MS imaging involved both Siemens and GE scanners at 1.5T or 3T. T1w MRI used either an MPRAGE or fast spoiled gradient-echo sequence, with TR of 4.3–2,490 ms, TE of 2.08–33.96 ms, slice thickness of 1–3 mm, and matrix size of 192 × 256 to 512 × 512. T2w MRI applied mainly a spin-echo sequence, with TR of 2,970–11,470 ms, TE of 83–122.90 ms, slice thickness of 3 mm, and matrix size of 256 × 256 to 512 × 512 (Figure 1).

Comparison chart displaying MRI brain scan differences between Clinical MS data and HCP data, showing variations in matrix size, slice thickness, and relaxation time, with corresponding sample images and values for each parameter.

Example brain MRI protocols of high and low heterogeneity from clinically standard MS (left 2 columns) and streamlined HCP datasets (right 2 columns), respectively. Shown are T1-weighted (top rows) and T2-weighted (bottom rows) images associated with three common protocol settings: matrix size, slice thickness, and repetition and echo times (TR & TE) that can impact image size, resolution, and anatomical contrast. Under relaxation time, the T2-weighted MRIs of clinical MS also show different matrix sizes of 336 × 384 and 392 × 488, while the HCP images have the same image size.

2.3 Image preprocessing

Four steps were performed to optimize image quality. These included nonuniformity correction using the N4ITK algorithm of Advanced Normalization Tools (ANTs) (Tustison et al., 2010) and brain extraction using FSL (FMRIB Software Library, Oxford, UK). All scans were then linearly co-registered to the standard MNI152 1-mm T1 template using ANTs. Finally, image intensity was normalized to the range 0–1 for subsequent modeling.

2.4 Image synthesis using CycleGAN2.4.1 Model development for Vanilla CycleGAN

We started from a recognized two-dimensional (2D) CycleGAN architecture (Figure 2) (Zhu et al., 2017). Refinement was performed on hyperparameters that often largely impact model stability and output, including the number of down-/up-sampling blocks and the number of residual blocks in the generator, the down-sampling blocks in the discriminator, as well as the initial learning rate and the decay schedule. Candidate settings were chosen based on measures of validation image quality by peak signal-to-noise-ratio (PSNR) and structural similarity index measure (SSIM) across training epochs (Supplementary Figure 1). Accordingly, each model was trained for 200 epochs, where the metrics plateaued, with a batch size of 1 as commonly done in CycleGAN. The initial learning rate was set at 0.0002, which was reduced by a factor of 10 at the 100th epoch. The Adam Optimizer (Kingma and Ba, 2014) was used to minimize loss functions associated with the generators and discriminators. Model training started with individual images at a dimension of 256 × 256 × 1, which was encoded to 16 × 16 × 512 and then gradually decreased to the initial dimension with the generators. The discriminators started with this dimension and decreased to 16 × 16 × 1 eventually. These settings were used in the analysis of all three datasets investigated in this study to ensure consistency. For each study cohort, the prepared data were randomly split at 0.8/0.1/0.1 for training, validation, and testing, respectively, with splitting randomness controlled by a random seed of 42. All training procedures were conducted using an Nvidia A100 80 GB GPU unit with the TensorFlow backend (Abadi et al., 2016).

Diagram illustrating two automated MRI image translation pipelines using arrows, convolutional blocks, and labeled loss functions. Bottom section shows neural network architectures: generator (G) with larger sequence of layers, and discriminator (D) with fewer layers, all labeled by spatial dimensions and channels.

CycleGAN architecture used in the study. The top panels illustrate the procedures for synthesizing T1-weighted (T1w) from T2-weighted (T2w) images (left), and vice versa (right). The bottom panels show the feature numbers corresponding to individual layers of a generator (G) and discriminator (D). C-L, cycle loss between source and reconstructed images; A-L, adversarial loss for a discriminator; GT2w and GT1w, generators for T2w and T1w, respectively; DT1w and DT2w, discriminators for T1w and T2w, respectively.

2.4.2 Model development for CycleGAN with spectral normalization

These models were developed using the same procedures as vanilla CycleGAN, except for the addition of spectral normalization. The latter was added to the convolutional layers of the discriminators by scaling the weight matrices by their maximum singular values (the spectral norms) (Jenkinson et al., 2012). Weight matrices normalized this way were expected to control disproportionate feature amplifications for consistent backpropagation of gradients in the discriminators. To further understand the impact of spectral normalization, imaging contrast variations with or without its addition were also explored based on the healthy HCP data.

2.5 Image synthesis with Pix2Pix

The Pix2Pix model was implemented based on a published article (Isola et al., 2016). In contrast to CycleGAN, this method was composed of one pair of generator and discriminator, with the hyperparameters decided based on established practices for GANs and empirical tuning. Specifically, the model was trained for 200 epochs, with a batch size of 1, and hinge loss optimized by Adam. The initial learning rate was 0.0001. Data splits for training, validation, and testing were conducted in the same way as for CycleGAN with each dataset.

2.6 Model evaluation

Based on the held-out testing datasets, CycleGAN models with or without spectral normalization were evaluated using three common metrics: PSNR, SSIM, and mean absolute error (MAE). These metrics were calculated per image slice and averaged per person per dataset for subsequent analyses.

2.7 Utility evaluation of the synthesized images

Based on the testing T1w brain MRI of MS, utility exploration involved three tasks. The first was lesion detection. Using the TrUE-Net method in FSL (Strain et al., 2024), MS lesions were segmented using T1w and the corresponding FLAIR images, where lesion probabilities below 0.3 were excluded. Segmentation accuracy was assessed using the Dice coefficient and 95th-percentile Hausdorff distance (HD95) for voxel-level overlap of lesions. Lesion count accuracy was measured using the lesion-wise F1 score and the lesion count agreement score. Furthermore, lesion size distribution and detection failure modes were presented for additional understanding of image synthesis methods. The latter included both the failure lesion number and the lesion count variant ratio between synthesized and source images.

The second task was brain volume measurement using T1w MRI following lesion filling as a standard practice, done using the SynthSeg method from FreeSurfer 8.0 (Harvard Medical School, USA). This assessment considered the following four measures: total intracranial volume (ICV), volumes of cortical gray matter, white matter, and cerebrospinal fluid (CSF).

The third was texture analysis around MS lesions using an optimized statistical approach, gray level co-occurrence matrix (GLCM) (Hosseinpour et al., 2022). Using its Scikit implementation in Python, two tested parameters were computed: texture contrast and texture dissimilarity. Both metrics detected subtle structural changes in MS using histology-informed brain MRI (Hosseinpour et al., 2022). Lesion texture was then averaged per subject.

2.8 Statistical analysis

Comparing results across imaging sources was performed using one-way ANOVA when normal distributions were satisfied, followed by Tukey correction for multiple comparisons. Friedman’s test was used otherwise. Statistical analyses used the GraphPad Prism package (version 10.4.1, Boston, Massachusetts, USA) and the SciPy package in Python, with p ≤ 0.05 set as significance. Comparison between CycleGAN modeling results across datasets was not performed, given their inherent differences.

3 Results3.1 Comparable outcomes between synthesized and source images

Qualitatively (Figure 3), anatomical integrity was well preserved in generated images by both CycleGAN-based and Pix2Pix models, including ventricular morphology and contrast. Lesion size distribution was largely similar based on synthesized images from all models to source images, with a minor exception for relatively small lesions (Supplementary Figure 2). Spectral normalization in CycleGAN appeared to decrease imaging sharpness based on experiments with the testing cohort of HCP, as indicated by lower Laplacian energy and higher Gradient MAE (Supplementary Table 1). Quantitative results supported the overall fidelity of synthesized image from each model, with Pix2Pix performing slightly better than CycleGAN models in certain outputs.

Grid of brain MRI scan slices compares four image synthesis methods—Source, CycleGAN, CycleGAN+SN, and Pix2Pix—across two modalities. Left side shows grayscale T1-weighted images with red arrows indicating areas of interest; right side shows T2-weighted images with yellow arrows. Each row represents corresponding anatomical locations.

Representative ground truth and synthesized brain MRI of MS based on different methods. CycleGAN models with or without spectral normalization (SN) both replicated lesion areas well in T1-weighted (hypointensity, red arrows) and T2-weighted (hyperintensity, yellow arrows) images, similar to ground truth. The Pix2Pix-generated images show similar findings with slightly higher contrast than the source in a few lesion areas.

3.1.1 HCP dataset

In synthesizing either T1w or T2w images, CycleGAN with or without spectral normalization achieved high PSNR and SSIM, and low MAE, which improved in Pix2Pix, though the effect sizes varied (Tables 1, 2; Figure 4). Specifically, the average values of CycleGAN images ranged from 26.190 to 28.290 versus 28.520 and 31.100 of Pix2Pix for PSNR, 0.879–0.901 versus 0.933 and 0.938 for SSIM, and 0.014–0.029 versus 0.010 and 0.024 for MAE. In addition, CycleGAN without spectral normalization performed better than within both image synthesis directions, with PSNR and SSIM increased by 2.64–3.19% and 1.63–2.09%, and MAE decreased by 4.76–8.30% (all p ≤ 0.001).

DatasetTranslationMetricCycleGANCycleGAN + SNPix2PixHCP (N = 112)T2w to T1wPSNR26.880 [26.40, 27.37]26.190 [25.75, 26,63]28.520 [27.85, 29.19]SSIM0.901 [0.897, 0.904]0.886 [0.883, 0.890]0.933 [0.930, 0.937]MAE0.028 [0.025, 0.030]0.029 [0.027, 0.031]0.024 [0.021, 0.026]T1w to T2wPSNR28.290 [28.13, 28.44]27.410 [27.26, 27.56]31.100 [30.19, 31.29]SSIM0.898 [0.895, 0.900]0.879 [0.877, 0.882]0.938 [0.936, 0.940]MAE0.014 [0.014, 0.015]0.016 [0.015, 0.016]0.010 [0.010, 0.011]PPMI (N = 32)T2w to T1wPSNR24.860 [24.37, 25.35]25.330 [24.80, 25.85]29.220 [28.41, 30.30]SSIM0.838 [0.832, 0.845]0.860 [0.852, 0.868]0.924 [0.919, 0.930]MAE0.028 [0.027, 0.030]0.029 [0.027, 0.031]0.019 [0.016, 0.021]T1w to T2wPSNR26.190 [25.77, 26.61]27.130 [26.71, 27.54]31.280 [30.58, 31.98]SSIM0.866 [0.860, 0.873]0.888 [0.881, 0.896]0.943 [0.937, 0.948]MAE0.020 [0.019, 0.022]0.018 [0.017, 0.019]0.011 [0.010, 0.012]MS (N = 11)T2w to T1wPSNR28.570 [26.69, 30.46]27.230 [26.00, 28.46]28.260 [26.99, 29.53]SSIM0.899 [0.882, 0.915]0.892 [0.881, 0.903]0.917 [0.904, 0.929]MAE0.025 [0.020, 0.030]0.029 [0.023, 0.030]0.025 [0.021, 0.029]T1w to T2wPSNR26.99 [26.49, 27.49]26.510 [25.92, 27.11]28.010 [27.13, 28.89]SSIM0.884 [0.878, 0.890]0.873 [0.859, 0.886]0.914 [0.906, 0.921]MAE0.019 [0.017, 0.020]0.019 [0.018, 0.021]0.017 [0.014, 0.022]

Mean (95% confidence interval) testing metrics for the CycleGAN-based and Pix2Pix models on reciprocal synthesis between T1- and T2-weighted brain MRIs across datasets.

SN, spectral normalization; PSNR, peak signal-to-noise ratio; SSIM, structural similarity index measure; MAE, mean absolute error.

DatasetModalityMetricFpANOVACohen’s d (GAN-SN)Cohen’s d (SN-Pix2Pix)Cohen’s d (GAN-Pix2Pix)HCPT1wPSNR19.36<0.0010.79−1.32−0.92SSIM187.42<0.0012.37−4.2−3.3MAE6.300.0021−0.350.840.6T2wPSNR534.53<0.0012.725.355.13SSIM631.56<0.0013.736.976.62MAE241.31<0.001−1.69−3.83−3.5PPMIT1wPSNR61.18<0.001−0.63−1.97−2.39SSIM181.04<0.001−2.75−2.98−4.76MAE32.77<0.001−0.111.651.76T2wPSNR109.67<0.001−1.472.662.88SSIM148.35<0.001−2.683.364.7MAE79.89<0.0011.09−2.7−2.99MST1wPSNR1.100.350.67−0.570.15SSIM4.480.020.4−2.03−1.09MAE1.110.34−0.510.47−0.04T2wPSNR6.260.00541.021.771.41SSIM24.19<0.0010.813.155.13MAE0.580.57−0.56−0.41−0.21MAE79.89<0.0011.09−2.7−2.99

Comparison of testing metrics between the two CycleGAN and Pix2Pix models in reciprocal synthesis between T1- and T2-weighted brain MRI across datasets.

GAN, CycleGAN; SN, spectral normalization (along with CycleGAN); PSNR, peak signal-to-noise ratio; SSIM, structural similarity index measure; MAE, mean absolute error.

Nine grouped box plots compare CycleGAN, CycleGAN+SN, and Pix2Pix across PSNR, SSIM, and MAE metrics for HCP, PPMI, and MS datasets in T2w to T1w and T1w to T2w conversions. Statistically significant differences are indicated by asterisks on the right of each plot.

Comparison of CycleGAN models and Pix2Pix by evaluation metric and image synthesis direction. Error bars represent standard deviation. The stars represent significance values: *p ≤ 0.05, **p ≤ 0.01, ***p ≤ 0.001. PSNR, peak signal-to-noise ratio; SSIM, structural similarity index measure; MAE, mean absolute error; SN, spectral normalization.

3.1.2 PPMI dataset

All CycleGAN images showed competitively high PSNR and SSIM, and low MAE, which also showed improvement in Pix2Pix with different effect sizes (Tables 1, 2; Figure 4). The average values of CycleGANs ranged from 24.860 to 27.130 versus 29.200 and 31.280 of Pix2Pix in PSNR, 0.838–0.888 versus 0.924 and 0.943 for SSIM, and 0.018–0.029 versus 0.011 and 0.019 for MAE. Between models, CycleGAN with spectral normalization showed increased PSNR by 1.89–3.57% and SSIM by 2.56–3.63%, and decreased MAE by 11.02% compared with vanilla CycleGAN, with all p ≤ 0.001 except MAE, where p ≤ 0.05.

3.1.3 Local MS dataset

Similarly, both PSNR and SSIM were high, and MAE was low with CycleGAN models, with Pix2Pix improved in select comparisons (Tables 1, 2; Figure 4). The average values of CycleGAN images ranged from 26.510–28.570 versus 28.260 and 28.010 of Pix2Pix for PSNR, 0.873–0.899 versus 0.917 and 0.914 for SSIM, and 0.019–0.029 versus 0.017 and 0.025 for MAE. Adding spectral normalization showed no positive impact on CycleGAN outputs, except for a mild improvement in PSNR at T1w to T2w translation.

3.2 Equivalent utility between CycleGAN-synthesized and source T1w MRI in MS

In lesion detection, all CycleGAN and Pix2Pix T1w images achieved high Dice and low HD95 scores, with no significant differences between these models (mean Dice > 0.7 and mean HD95 < 10; Figure 5). In lesion count, the lesion-wise F1 ranged 0.638–0.734, and lesion count agreement scores ranged 0.664–0.756 for CycleGAN models and Pix2Pix, respectively (Table 3). The lesion failure rate was low and similar across models, with all lesions detected and the count variant ratio close to or greater than one, indicating high fidelity.

Side-by-side box plots compare segmentation performance of three deep learning models (CycleGAN T1w, CycleGAN+SN T1w, and Pix2Pix T1w). Left plot shows Dice Scores and right plot shows 95 percent Hausdorff distance (HD95); CycleGAN+SN T1w achieves higher Dice and lower HD95, indicating better accuracy and precision.

Comparison of lesion detectability between imaging types in multiple sclerosis (MS). The plots show the mean and standard deviation of Dice and HD95 scores. *p < 0.05; **p < 0.01; and SN, spectrum normalization.

MetricCycleGANCycleGAN + SNPix2PixLesion-wise F10.6980.6380.734Lesion count agreement score0.6640.7430.756Failure lesion rate (n/N)0/110/110/11Count variance ratio1.241.650.76

Lesion count accuracy and failure rates.

n refers to the number of lesions failed to detect, and N refers to the total number of lesions that exist in the source images.

In brain volume measurement, CycleGAN T1w from either version was source-equivalent in cortical gray matter and CSF volumes (p > 0.05), but lower than source in ICV and white matter (|Δ| ≈ 1.0–1.5%; p < 0.01). Pix2Pix T1w volumes showed a similar trend, except for a larger CSF volume than the source (p ≤ 0.01; Figure 6).

View original article

FRONTIERS IN NEUROINFORMATICS

Share Bookmark

0 0 0 0 0 0 0

More from this channel

CycleGAN models show consistent brain MRI synthesis across datasets supporting downstream tissue characterization in multiple sclerosis

Comments (0)