Generating synthetic data in digital pathology through diffusion models: a multifaceted approach to evaluation

Abstract

Synthetic data has recently risen as a new precious item in the computational pathologist’s toolbox, supporting several tasks such as helping with data scarcity or augmenting training set in deep learning. Nonetheless, the use of such novel resources requires a carefully planned construction and evaluation, to avoid pitfalls such as the generation of clinically meaningless artifacts.

As the major outcome described in the current manuscript, a novel full stack pipeline is introduced for the generation and evaluation of synthetic pathology data powered by a diffusion model. The workflow features, as characterizing elements, a new multifaceted evaluation strategy with an embedded explainability procedure effectively tackling two critical aspects of the use of synthetic data in health-related domains.

An ensemble-like strategy is adopted for the evaluation of the produced data, with the threefold aim of assessing the similarity of real and synthetic data through a set of well-established metrics, evaluating the practical usability of the generated images in deep learning models complemented by explainable AI methods, and validating their histopathological realism through a dedicated questionnaire answered by three professional pathologists.

The pipeline is demonstrated on the public GTEx dataset of 650 WSIs, including five different tissues, conditioning the training step of the underlying diffusion model. An equal number of tiles from each of these five tissues are then generated. Finally, the reliability of the generated data is assessed using the proposed evaluation pipeline, with encouraging results. We show that each of these evaluation steps are necessary as they provide complementary information on the generated data’s quality.

Overall, all the aforementioned features characterize the proposed workflow as a fully-fledged solution for generative AI in digital pathology representing a potentially useful tool for the digital pathology community in their transition towards digitalization and data-driven modeling.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

The study was partially funded under the National Plan for Complementary Investments to the NRRP, project D34H Digital Driven Diagnostics, prognostics and therapeutics for sustainable Health care (project code: PNC0000001), Spoke 2: Multilayer platform to support the generation of the Patients Digital Twin, CUP: B53C22006170001, funded by the Italian Ministry of University and Research.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

The study used ONLY openly available human data that were originally located at: https://gtexportal.org/home/

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Comments (0)

No login
gif