How Good Are Large Language Models at Supporting Frontline Healthcare Workers in Low-Resource Settings: A Benchmarking Study & Dataset

Abstract

Large language models (LLMs) have demonstrated strong performance in medical contexts; however, existing benchmarks often fail to reflect the real-world complexity of low-resource health systems accurately. This study developed a dataset of 5,609 clinical questions contributed by 101 community health workers (CHWs) across four Rwandan districts and compared responses generated by five large language models (LLMs) (Gemini-2, GPT-4o, o3 mini, Deepseek R1, and Meditron-70B) with those from local clinicians. A subset of 524 question-answer pairs was evaluated using a rubric of 11 expert-rated metrics, scored on a five-point Likert scale. Gemini-2 and GPT-4o were the best performers (achieving mean scores of 4.49 and 4.48 out of 5, respectively, across all 11 metrics). All LLMs significantly outperformed local clinicians (ps < 0.001) across all metrics, with Gemini-2, for example, surpassing local GPs by an average of 0.83 points on every metric (range: 0.38 – 1.10). While performance degraded slightly when LLMs communicated in Kinyarwanda, the LLMs remained superior to clinicians and were over 500 times cheaper per response. These findings support the potential of LLMs to strengthen frontline care quality in low-resource, multilingual health systems.

Competing Interest Statement

The authors have declared no competing interest.

Funding Statement

This research was supported by the Gates Foundation (INV-068056). The funders had no role in the study design, data collection and analysis, the decision to publish, or the preparation of the manuscript.

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

IRB of Rwanda National Ethics Committee waived ethical approval for this work.

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability Statement

The subset of 524 questions, answers, and individual evaluation results that comprise this benchmarking study is available upon reasonable request and will otherwise be made available upon publication of this work in a peer-reviewed journal. The full dataset has been donated to the Rwanda Biomedical Centre (RBC), the parastatal delivery arm of the Rwandan Ministry of Health, and is hosted in a secure data environment. It will be made available to researchers on request and based on an assessment of ‘fair value exchange’ by stakeholders, to ensure that the indigenous population that generated the information benefits from its exploitation. This arrangement was specifically designed to ensure adherence to the CARE principles. The Centre for the Fourth Industrial Revolution, as the innovation lab for the Rwandan Government, serves as the primary point of contact for researchers seeking to access this data. Prospective users should contact ‘infoc4ir.rw’ to request access.

View original article

Medrxiv - Primary Care Research

Like

Share Bookmark

0 0 0 0 0 0 0

More from this channel

How Good Are Large Language Models at Supporting Frontline Healthcare Workers in Low-Resource Settings: A Benchmarking Study & Dataset

Comments (0)