Evaluation of Gender Bias in the Evaluation of Synthetic Cardiovascular Disease Cases with Open Source LLMs

Abstract

Objective: To systematically evaluate gender bias in open-source large language models (LLMs) for cardiovascular diagnostic decision-making using controlled synthetic case vignettes. Methods: We generated 500 synthetic cardiovascular cases with randomly assigned gender (male/female) and age (45-80 years), keeping all other clinical variables identical. Two structured prompts simulated sequential cardiovascular evaluation stages: initial chest discomfort presentation and post-stress-test evaluation. Three open-source LLMs were evaluated via local Ollama API: Gemma-2b, Phi, and TinyLLaMA. Primary outcomes included coronary artery disease (CAD) likelihood ratings (low/intermediate/high), diagnostic certainty (low/intermediate/high), and test usefulness scores (1-10 scale). Results: Evaluation of 1,500 model responses (500 cases x 3 models) revealed minimal gender-related differences. Only one statistically significant finding emerged: Gemma-2b assigned higher diagnostic certainty to female patients in initial presentations (58% vs. 48%, p=0.031, adjusted p=0.092). No other gender-based differences reached significance after multiple-comparison adjustment. Effect sizes were consistently small across all comparisons (Cohen's h: 0.01-0.18; Cliff's delta: -0.11 to 0.12). Substantial inter-model variability was observed, with Gemma-2b and Phi demonstrating assertive diagnostic patterns while TinyLLaMA showed conservative tendencies. Parsing quality exceeded 95% for all models. Conclusions: Open-source LLMs demonstrated largely gender-neutral outputs in controlled cardiovascular scenarios, contrasting with documented biases in human clinicians and commercial LLMs. The isolated gender effect in Gemma-2b was modest and clinically insignificant. More concerning was substantial inter-model variability in diagnostic confidence and test recommendations, highlighting the critical importance of rigorous model benchmarking before clinical deployment. These preliminary findings suggest that open-source LLMs may offer advantages for equitable healthcare applications, but broader validation across diverse clinical contexts and real-world constraints remains essential.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

This study did not receive any funding

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

All data produced in the present study are available upon reasonable request to the authors

View original article

Medrxiv - Health Informatics

Like

Share Bookmark

0 0 0 0 0 0 0

More from this channel

Evaluation of Gender Bias in the Evaluation of Synthetic Cardiovascular Disease Cases with Open Source LLMs

Comments (0)