Our study demonstrates that the new Llama models can extract structured lymphoma disease data from radiology reports with high accuracy in a local, privacy-preserving hospital setting, adhering to the provided template structure in all cases. While both models were highly accurate in extracting nodal and extranodal disease sites, the newer Llama-4-Scout-17B-16E Instruct model significantly outperformed Llama-3.3-70B-Instruct. This performance gap widened for the reasoning tasks of assigning Lugano staging and treatment response classes, with Llama-4-Scout achieving 85%/88% accuracy compared to 60%/65% for Llama-3.3-70B-Instruct. Our systematic error analysis revealed that the highest relative error rates for both models occurred when interpreting the level of disease after treatment (specifically missing residual disease or misjudging progression), while for the generation of Lugano stages, incorrectly assigning a more advanced disease stage was the most common mistake. Notably, neither model produced hallucinations of newly involved nodal or extranodal sites. These findings suggest that while LLMs excel in extracting and structuring data, their performance declines when required to generate new clinical inferences. Our results are consistent with previous studies that have highlighted the potential of LLMs to transform unstructured radiology data into structured formats [13,14,15,16,17,18,19,20,21]. For instance, Adams et al. [15] demonstrated that GPT-4 achieved 100% accuracy in automatically aligning MRI and CT reports from various anatomical regions with their designated templates and structuring them without any errors or the introduction of extraneous findings. Similarly, open-source models such as Vicuna previously achieved high sensitivity and specificity for extracting key radiological features from brain MRI reports [27]. Similarly, our study demonstrates the strength of LLMs in handling clinical information retrieval tasks by highlighting the consistent and accurate extraction of patient lymphoma disease nodal and extranodal data. This capability holds significant potential for optimizing workflows in the evaluation and assessment of often lengthy oncologic disease processes, thereby improving time efficiency and reducing the burden on radiologists [28].
In contrast, the generation of complex clinical outputs based on radiological data, exemplified in our study by the assessment of the Lugano stage and treatment response, seems to remain a challenge for LLMs. These findings align with previous observations that, although LLMs are adept at extracting factual data, they struggle with tasks that require the integration of multifaceted clinical criteria and clinical reasoning [29, 30]. For example, assigning the correct Lugano stage requires not only the correct extraction of all nodal and extranodal sites but also the correct application of these findings to the staging criteria and their modifiers [24]. Similarly, assessing treatment response demands not only evaluating current disease involvement and staging criteria but also comparing it with previous radiological data to determine progression. Prompt engineering can play a critical role in reducing clinical errors of LLMs, and while we have provided the Lugano staging and treatment response criteria in the prompts as guidance and excluded modifiers for B-symptoms (which might not be available in the radiological data), there still remain multiple steps for the LLM to fail, including the incorrect extraction of disease involvement, the incorrect assessment of its spatial or temporal evolution, or the misinterpretation of the guidelines. Nevertheless, the CoT prompting strategy in our study achieved overall high performance, with the highest relative error rates found for hallucinated residual disease, missed residual disease, and misjudged progression. From a clinical point of view, prompting the model to explicitly list baseline findings, then post-treatment findings, and only then to compare them according to Lugano criteria makes its reasoning process also more transparent and allows for easier verification. Incorporating probabilistic scoring or self-critique steps could further down-weight borderline over-staging decisions and flag cases that require radiologist review [31]. For extraction omissions, incorporating retrieval-augmented generation (RAG) could be one solution, where LLMs retrieve information from dynamically updated knowledge bases [32, 33]. For example, a RAG system could first retrieve all document excerpts relevant to disease sites before passing them to the LLM for summarization, which may minimize the risk that relevant findings scattered throughout a long report are overlooked. Given that contemporary lymphoma guidelines, such as those from the European Society for Medical Oncology or the National Comprehensive Cancer Network, are regularly updated, RAG systems could also play an important role in keeping the model responses up-to-date [34, 35]. Without comprehensive, real-time linkage to such guideline repositories, however, LLM outputs risk becoming outdated or generic, potentially leading to suboptimal or incorrect clinical assessments [9].
Furthermore, research indicates that LLMs may struggle with preserving context over extended narratives, leading to potential inaccuracies in complex clinical documentation [36]. In our study, we employed Meta’s Llama-3.3-70B model, which features a context window of 128,000 tokens (i.e., around 96,000 words in English language), and compared it with the recently released Llama-4-Scout-17B-16E-Instruct, with a reported context window of up to 10,000,000 tokens [22]. The widening performance gap on reasoning tasks, whereas data extraction tasks were performed well by both models, suggests the interplay between clinical content complexity, the model’s reasoning capabilities, and the contextual dependencies of medical documentation, as described by Jin et al., may also play a role [37]. Thus, it appears that the capabilities of newer, even smaller LLMs may handle the cognitive demands of complex clinical scenarios involving multiple concepts, interdependencies, and precise terminology more effectively.
While we included two of the newest Llama models, open-source LLMs with reasoning capabilities, such as DeepSeek’s R1, which achieved performance comparable to OpenAI’s proprietary o1 on math, coding, and general reasoning benchmarks, may perform better on these clinical tasks [38]. However, their adoption in real-world clinical settings is limited by their need for computational resources, including high-performance graphics processing units, large storage capacities, and ongoing maintenance, all of which can strain hospital IT infrastructure [39]. On the other hand, while cloud-based solutions can reduce the need for local computing resources, they can also be costly. Cloud platforms such as Amazon Web Services charge based on usage, including model inference time, data storage, and bandwidth, making them expensive for continuous clinical use [40]. In addition, transferring sensitive patient data to cloud servers raises concerns about privacy, security, regulatory compliance (e.g., HIPAA, GDPR), and data sovereignty, adding another layer of complexity to their adoption in clinical environments [41]. Thus, the trade-off between the computational demands of local deployment and the financial and regulatory burdens of cloud-based solutions remains a critical barrier to the widespread clinical integration of advanced LLMs. Ultimately, while domain-specific fine-tuning of open-source LLMs was introduced with the goal to more effectively navigate medical documentation due to the intricate terminology and multifaceted interdependencies inherent in clinical data, recent evidence suggests that this does not lead to a performance increase [42]. Specifically, Dorfner et al. [43] showed that biomedically fine-tuned LLMs performed worse than general-purpose models on clinical tasks such as information extraction, document summarization, and clinical coding.
Our study has limitations. It is single-center and cross-sectional, based on data from a short period and adult lymphoma patients, which reduces its generalizability. Considering that about a quarter of the patients had stage III(2) disease and diffuse large B-cell lymphoma with an overall high prevalence of mediastinal involvement, the performance of LLMs in rarer subtypes remains an open question for future cohort studies. Additionally, we only evaluated two newer, open-source Llama models because of their cost-effectiveness and feasibility for local implementation. Therefore, future studies may evaluate and compare the use of different LLMs on real-world clinical data from more diverse patient populations using privacy-preserving approaches. We also did not explore alternative strategies such as few-shot learning or breaking down tasks into multiple, sequential prompts. A comparative analysis of different prompting techniques could yield further insights into optimizing LLM performance for this clinical application. In addition, future studies should investigate the incorporation of RAG-based approaches for data extraction and generation of disease progression reports, as well as the incorporation of multimodal data, such as imaging, laboratory values, and clinical history parameters, to further contextualize the feasibility of using LLMs in clinical decision-making.
Comments (0)