LLM-based reconstruction of longitudinal clinical trajectories in chronic liver disease.

Abstract

Background & Aims Liver cancer primarily develops in patients with chronic liver disease (CLD), yet most cases are diagnosed at an advanced stage with poor prognosis. While clinical surveillance of patients with CLD generates extensive longitudinal data, its unstructured free-text nature hinders large-scale research. To unlock this real-world evidence, we developed a scalable framework using open-source Large Language Models (LLMs) to transform unstructured clinical text into structured data.

Methods We conducted a multi-stage evaluation of LLM-based extraction from multi-source clinical documentation of liver transplant recipients. A calibration set comprising 507 reports (414 radiology, 65 pathology, and 28 liver transplant assessment reports) from 30 patients was manually annotated to benchmark four open-source LLMs (Llama 3.1 8B, Llama 3.3 70B, Open-BioLLM 70B, DeepSeek R1 8B) against a regular expression baseline across 73 tasks. To ensure structured outputs, we compared constrained decoding (Guidance and Ollama packages) against unconstrained prompting across 5,590 prompt–output pairs. The finalised pipeline was then applied to the full cohort of 835 patients transplanted in our centre over the past decade.

Results Among the models tested, Llama 3.3 70B performed best, exceeding 90% accuracy on 59/73 tasks, outperforming both a medically fine-tuned model (OpenBioLLM 70B) and a smaller variant (Llama 3.1 8B). Constrained decoding achieved >99.9% format adherence, far surpassing unconstrained prompting (87.4%). Applied to the full cohort, the pipeline successfully analysed 22,493 reports to generate 37,125 datapoints (45 variables, 835 patients) without manual annotation. Further analysis confirmed known liver cancer risk factors (male sex, viral hepatitis, smoking, diabetes), and allowed for reconstruction of longitudinal disease timelines.

Conclusions This work provides a scalable blueprint for transforming real-world clinical free-text into structured formats, paving the way for accelerated, data-driven research into complex pre-cancerous diseases like CLD.

Competing Interest Statement

H.P. has received research funding from AstraZeneca. Z.G. has received research funding from GE healthcare. M.A.F. is a current employee and stockholder of AstraZeneca. M.H. has received speakers fees from Sirtex medical, consultancy fees from Quotient Therapeutics, Ensocell, Boston Scientific and Spliceor, in addition to unrestricted grant support from AstraZeneca and Pfizer. M.C.O. is a co-founder and employee of 52 North Health Ltd, has received research funding from GE HealthCare, and speaking fees from GSK. The other authors declare no competing interests.

Funding Statement

This study was funded by AstraZeneca UK Limited and the NIHR BioResource (G127831).

Author Declarations

I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.

Yes

The details of the IRB/oversight body that provided approval or exemption for the research described are given below:

The Health and Social Care Research Ethics Committee A of the Health and Social Care Business Services Organisation gave ethical approval for this work (REC reference 20/NI/0109, IRAS 285521)

I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.

Yes

I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).

Yes

I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.

Yes

Data Availability

The data underlying this article cannot be published due to confidentiality considerations.

AbbreviationsA1ATAlpha-1 AntitrypsinAIHAutoimmune HepatitisArLDAlcohol-related Liver DiseaseBMIBody Mass IndexCIConfidence IntervalCLDChronic Liver DiseaseCTComputed TomographyDMDiabetes MellitusEHRElectronic Healthcare RecordsFNHFocal Nodular HyperplasiaGPTGenerative Pre-trained TransformerHCCHepatocellular CarcinomaHCVHepatitis C VirusLIRADSLiver Imaging Reporting and Data SystemLLMLarge Language ModelLRLiver Imaging Reporting and Data System (LIRADS) GradeLTALiver Transplant AssessmentLVILarge Vessel InvasionMASHMetabolic-associated SteatohepatitisMASLDMetabolic-associated Steatotic Liver DiseaseMRIMagnetic Resonance ImagingMVIMicrovascular InvasionNAFLDNon-alcoholic Fatty Liver DiseaseNASHNon-alcoholic SteatohepatitisNLPNatural Language ProcessingOLTOrthotopic Liver TransplantPATHPathological examination of the explanted liverPBCPrimary Biliary CholangitisPETPositron Emission TomographyPSCPrimary Sclerosing Cholangitis Regex Regular ExpressionRFARadiofrequency AblationSDStandard DeviationT1DMType 1 Diabetes MellitusT2DMType 2 Diabetes MellitusTACETransarterial ChemoembolisationTAETransarterial EmbolisationTIVTumour in VeinTNMTumour, Node, Metastasis (Staging system)UKELDUnited Kingdom Model for End-Stage Liver DiseaseUSUltrasoundVTTVenous Tumour Thrombus

Comments (0)

No login
gif