Generative AI for developing foundation models in radiology and imaging: engineering perspectives

Chen RJ, Lu MY, Chen TY, Williamson DF, Mahmood F. Synthetic data in machine learning for medicine and healthcare. Nat Biomed Eng. 2021;5(6):493–7.

Article  Google Scholar 

Barr AA, Quan J, Guo E, Sezgin E. Large language models generating synthetic clinical datasets: a feasibility and comparative analysis with real-world perioperative data. Front Artif Intell. 2025;8:1533508.

Article  Google Scholar 

Dehaene-Lambertz G, Spelke ES. The infancy of the human brain. Neuron. 2015;88(1):93–109.

Article  Google Scholar 

Xu F, Tenenbaum JB. Word learning as bayesian inference. Psychol Rev. 2007;114(2):245.

Article  Google Scholar 

Rosenblatt F. The perceptron: a probabilistic model for information storage and organization in the brain. Psychol Rev. 1958;65(6):386.

Article  Google Scholar 

He K, Chen X, Xie S, Li Y, Dollár P, Girshick R. "Masked autoencoders are scalable vision learners." In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022. pp. 16000–16009.

Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877–901.

Google Scholar 

Bommasani R, et al. "On the opportunities and risks of foundation models."2021. arXiv preprint arXiv:2108.07258.

Wang Z, Wu Z, Agarwal D, Sun J. MedCLIP: contrastive learning from unpaired medical images and text. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2022:3876–87. https://doi.org/10.18653/v1/2022.emnlp-main.256.

Zhou Y, et al. A foundation model for generalizable disease detection from retinal images. Nature. 2023;622(7981):156–63.

Article  Google Scholar 

Bannur S, et al. "Learning to exploit temporal structure for biomedical vision-language processing." In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023, pp. 15016–15027.

Bai F, Du Y, Huang T, Meng MQ, Zhao B. "M3d: advancing 3D medical image analysis with multi-modal large language models." 2024. arXiv preprint arXiv:2404.00578.

Google. "Google gemini diffusion." https://deepmind.google/models/gemini-diffusion/ Accessed 29 June 2025.

Tian K, Jiang Y, Yuan Z, Peng B, Wang L. Visual autoregressive modeling: scalable image generation via next-scale prediction. Adv Neural Inf Process Syst. 2024;37:84839–65.

Google Scholar 

Khosravi B, et al. Synthetically enhanced: unveiling synthetic data’s potential in medical imaging research. EBioMedicine. 2024. https://doi.org/10.1016/j.ebiom.2024.105174.

Article  Google Scholar 

NVIDIA. "Addressing medical imaging limitations with synthetic data generation." https://developer.nvidia.com/blog/addressing-medical-imaging-limitations-with-synthetic-data-generation/. Accessed 29 June 2025.

Chen Z, Pekis A, Brown K. "Advancing high resolution vision-language models in biomedicine." 2024. arXiv preprint arXiv:2406.09454.

Razzhigaev A et al. "Kandinsky: an improved text-to-image synthesis with image prior and latent diffusion." 2023. arXiv preprint arXiv:2310.03502.

Goodfellow IJ, et al. Generative adversarial nets. Adv Neural Inf Process Syst. 2014;27:2672–80.

Google Scholar 

Radford A, Metz L, Chintala S. "Unsupervised representation learning with deep convolutional generative adversarial networks." 2015. arXiv preprint arXiv:1511.06434.

Van Den Oord A, Kalchbrenner N, Kavukcuoglu K. Pixel recurrent neural networks. In: International conference on machine learning. 2016: PMLR, pp. 1747–1756.

Kingma DP. "Auto-encoding variational bayes." 2013. arXiv preprint arXiv:1312.6114.

Rezende D and Mohamed S. "Variational inference with normalizing flows." In: International conference on machine learning. 2015: PMLR, pp. 1530–1538.

Sohl-Dickstein J, Weiss E, Maheswaranathan N, and Ganguli S. "Deep unsupervised learning using nonequilibrium thermodynamics." In: International conference on machine learning, 2015: PMLR, pp. 2256–2265.

Ho J, Jain A, Abbeel P. Denoising diffusion probabilistic models. Adv Neural Inf Process Syst. 2020;33:6840–51.

Google Scholar 

Salimans T, Karpathy A, Chen X, and Kingma DP. "Pixelcnn++: improving the pixelcnn with discretized logistic mixture likelihood and other modifications." 2017. arXiv preprint arXiv:1701.05517.

Razavi A, Van den Oord A, Vinyals O. Generating diverse high-fidelity images with vq-vae-2. Adv In Neural Inf Process Syst. 2019; 32.

Rombach R, Blattmann A, Lorenz D, Esser P, and Ommer B. "High-resolution image synthesis with latent diffusion models." In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022. pp. 10684–10695.

OpenAI. "GPT Image 1." https://platform.openai.com/docs/models/gpt-image-1. Accessed 29 June 2025.

Chen T, Kornblith S, Norouzi M, and Hinton G. "A simple framework for contrastive learning of visual representations."In: International conference on machine learning. 2020. pp. 1597–1607. PmLR.

He K, Fan H, Wu Y, Xie S, Girshick R. Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020. pp. 9729–9738.

Dosovitskiy A et al.. "An image is worth 16x16 words." 2020; 7. arXiv preprint arXiv:2010.11929.

Devlin J. "Bert: pre-training of deep bidirectional transformers for language understanding." 2018. arXiv preprint arXiv:1810.04805.

Radford A, et al. Learning transferable visual models from natural language supervision. In: International conference on machine learning 2021. pp. 8748–8763.

Liang Z et al. "A survey of multimodel large language models." In: Proceedings of the 3rd international conference on computer, artificial intelligence and control engineering. 2024. pp. 405–409.

Li J, Li D, Savarese S, Hoi S. "Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models." In: International conference on machine learning. 2023. pp. 19730–19742. PMLR.

Liu H, Li C, Wu Q, Lee YJ. Visual instruction tuning. Adv Neural Inf Process Syst. 2023;36:34892–916.

Google Scholar 

Alayrac J-B, et al. Flamingo: a visual language model for few-shot learning. Adv Neural Inf Process Syst. 2022;35:23716–36.

Google Scholar 

Chen X, et al. "Pali: a jointly-scaled multilingual language-image model." 2022. arXiv preprint arXiv:2209.06794.

Tiu E, Talius E, Patel P, Langlotz CP, Ng AY, Rajpurkar P. Expert-level detection of pathologies from unannotated chest x-ray images via self-supervised learning. Nat Biomed Eng. 2022;6(12):1399–406.

Article  Google Scholar 

Azizi S, et al. "Big self-supervised models advance medical image classification." In: Proceedings of the IEEE/CVF international conference on computer vision. 2021. pp. 3478–3488.

Zhou Z, Sodha V, Pang J, Gotway MB, Liang J. Models genesis. Med Image Anal. 2021;67:101840.

Article  Google Scholar 

Kyung S, et al. "Generative adversarial network with robust discriminator through multi-task learning for low-dose CT denoising," IEEE Transactions on Medical Imaging. 2024.

Armanious K, et al. MedGAN: medical image translation using GANs. Comput Med Imaging Graph. 2020;79:101684.

Article  Google Scholar 

Lee S, et al. Emergency triage of brain computed tomography via anomaly detection with a deep generative model. Nat Commun. 2022;13(1):4251.

Article  Google Scholar 

Baur C, Denner S, Wiestler B, Navab N, Albarqouni S. Autoencoders for unsupervised anomaly segmentation in brain MR images: a comparative study. Med Image Anal. 2021;69:101952.

Article  Google Scholar 

Schlegl T, Seeböck P, Waldstein SM, Langs G, Schmidt-Erfurth U. f-AnoGAN: Fast unsupervised anomaly detection with generative adversarial networks. Med Image Anal. 2019;54:30–44.

Article  Google Scholar 

Hamamci IE, et al. "Developing generalist foundation models from a multimodal dataset for 3D computed tomography." 2024. arXiv preprint arXiv:2403.17834.

Ma J, He Y, Li F, Han L, You C, Wang B. Segment anything in medical images. Nat Commun. 2024;15(1):654.

Article  Google Scholar 

Johnson AE, et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci Data. 2019;6(1):317.

Article  Google Scholar 

Lau JJ, Gayen S, Abacha AB, Demner-Fushman D. A dataset of clinically generated visual questions and answers about radiology images. Sci Data. 2018;5(1):1–10.

Article  Google Scholar 

Zhang X, et al. "Pmc-vqa: Visual instruction tuning for medical visual question answering." 2023. arXiv preprint arXiv:2305.10415.

He K, Zhang X, Ren S, and Sun J. "Deep residual learning for image recognition." In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. pp. 770–778.

Dosovitskiy A, et al. "An image is worth 16x16 words: Transformers for image recognition at scale." 2020. arXiv preprint arXiv:2010.11929.

Chen Z, et al. "Chexagent: towards a foundation model for chest x-ray interpretation." 2024. arXiv preprint arXiv:2401.12208.

Saab K, et al. "Capabilities of gemini models in medicine." 2024. arXiv preprint arXiv:2404.18416.

Xu S et al. "ELIXR: Towards a general purpose X-ray artificial intelligence system through alignment of large language models and radiology vision encoders." 2023. arXiv preprint arXiv:2308.01317.

Li C, et al. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Adv Neural Inf Process Syst. 2023;36:28541–64.

Google Scholar 

Singhal K, et al. Toward expert-level medical question answering with large language models. Nat Med. 2025. https://doi.org/10.1038/s41591-024-03423-7.

Article  Google Scholar 

Moor M, et al. "Med-flamingo: a multimodal medical few-shot learner." In: Machine learning for health (ML4H). 2023: PMLR, pp. 353–367.

Thawkar O, et al. "Xraygpt: chest radiographs summarization using medical vision-language models." 2023. arXiv preprint arXiv:2306.07971.

Zhang Y, Jiang H, Miura Y, Manning CD, and Langlotz CP. "Contrastive learning of medical visual representations from paired images and text." In: Machine learning for healthcare conference. 2022: PMLR, pp. 2–25.

Zhang S, et al. "Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs." 2023. arXiv preprint arXiv:2303.00915.

Zhang X, et al. "Radgenome-chest ct: a grounded vision-language dataset for chest ct analysis." 2024. arXiv preprint arXiv:2404.16754.

Wu C, Zhang X, Zhang Y, Wang Y, and Xie W. "Towards generalist foundation model for radiology by leveraging web-scale 2D and 3D medical data." 2023. arXiv preprint arXiv:2308.02463.

Chen Q and Hong Y. "Medblip: bootstrapping language-image pre-training from 3d medical images and texts." In: Proceedings of the Asian conference on computer vision. 2024. pp. 2404–2420.

Hamamci IE, Er S, Menze B. Ct2rep: automated radiology report generation for 3D medical imaging. In: International conference on medical image computing and computer-assisted intervention. Cham: Springer Nature Switzerland. 2024. pp. 476–486.

P. B. David McCandless TE. "Major large language models (LLMs) ranked by capabilities, sized by billion parameters used for training." https://informationisbeautiful.net/visualizations/the-rise-of-generative-ai-large-language-models-llms-like-chatgpt. Accessed 30 June 2025.

Huang W, et al. Enhancing representation in radiography-reports foundation model: a granular alignment algorithm using masked contrastive learning. Nat Commun. 2024;15(1):7620.

Article  Google Scholar 

Dhariwal P, Nichol A. Diffusion models beat gans on image synthesis. Adv Neural Inf Process Syst. 2021;34:8780–94.

Google Scholar 

Xiao Z, Kreis K, and Vahdat A, "Tackling the generative learning trilemma with denoising diffusion GANs." 2021. arXiv preprint arXiv:2112.07804.

U. S. D. o. H. H. Services. "HIPAA for Professionals." https://www.hhs.gov/hipaa/index.html. Accessed 30 Jan 2025.

Union E. "General data protection regulation (GDPR)." https://gdpr-info.eu/. Accessed 30 Jan 2025.

Dayan I, et al. Federated learning for predicting clinical outcomes in patients with COVID-19. Nat Med. 2021;27(10):1735–43.

Article  Google Scholar 

Lee EH, et al. An international study presenting a federated learning AI platform for pediatric brain tumors. Nat Commun. 2024;15(1):7615.

Article  Google Scholar 

Almufareh MF, Tariq N, Humayun M, Almas B. A federated learning approach to breast cancer prediction in a collaborative learning framework. In: Healthcare. 2023; 11(24): 3185. MDPI.

McMahan B, Moore E, Ramage D, Hampson S, y Arcas BA. Communication-efficient learning of deep networks from decentralized data. In: Artificial intelligence and statistics. 2017. pp. 1273–1282. PMLR.

Park S, Kim G, Kim J, Kim B, Ye JC. Federated split task-agnostic vision transformer for COVID-19 CXR diagnosis. Adv Neural Inf Process Syst. 2021;34:24617–30.

Google Scholar 

Park S, Ye JC. Multi-task distributed learning using vision transformer with random patch permutation. IEEE Trans Med Imaging. 2022;42(7):2091–105.

Article  Google Scholar 

Koike Y, Nakagawa T, Waida H, and Kanamori T. "Scaling-based Data Augmentation for Generative Models and its Theoretical Extension." 2024. arXiv preprint arXiv:2410.20780.

Xu Z, Jain S, and Kankanhalli M. "Hallucination is inevitable: an innate limitation of large language model." 2024.. arXiv preprint arXiv:2401.11817.

Chu YW, Zhang K, Malon C, Min MR. Reducing hallucinations of medical multimodal large language models with visual retrieval-augmented generation. 2025. arXiv preprint arXiv:2502.15040.

Acosta JN, Falcone GJ, Rajpurkar P, Topol EJ. Multimodal biomedical AI. Nat Med. 2022;28(9):1773–84.

Article  Google Scholar 

Comments (0)

No login
gif