Basal knowledge in the field of pediatric nephrology and its enhancement following specific training of ChatGPT-4 “omni” and Gemini 1.5 Flash

To evaluate the performance of ChatGPT-4o and Gemini 1.5 in answering pediatric nephrology multiple-choice questions, we adopted a rigorous and structured approach. The multiple-choice questions were collected from the “Educational Review” articles published by the journal Pediatric Nephrology between January 2014 and April 2024 [9]. These articles contain multiple-choice questions at the end of the texts, with the correct answers provided after the bibliography section.

We chose questions up to 2014 for benchmarking to avoid biases from changes in medical knowledge over time. Older questions might have answers considered correct at the time but are now deemed incorrect. This approach ensures a fair assessment of the models’ capabilities.

We tested the models with both Portable Data Format (PDF) and text (TXT) file formats to understand the impact of extraneous metadata and formatting complexities on the performance of LLMs. PDFs preserve visual integrity but contain complex layouts, making structured data extraction challenging. Issues like non-linear text storage and variable font encoding further complicate this [10].

In contrast, TXT files are simpler and lack these formatting complexities, making them easier for LLMs to process. Research indicates LLM performance is better with simpler, less noisy data formats, allowing for more accurate information retrieval and processing [11]. This hypothesis motivated our decision to compare the models’ performance across these two formats.

The questions, along with the multiple-choice answers, were organized into several Excel files, divided by year, and then presented to the ChatGPT-4o and Gemini 1.5 models. We have made the question dataset available on HuggingFace [12], while the codes used are available on GitHub [13].

The prompt used to present the questions to the models was as follows: “I will provide you with multiple-choice questions. You need to read and answer the questions with the letter of the answer you think is correct. Some questions may have more than one correct answer. Print two columns: the first with the question numbers, the second with the corresponding answers. Clear?” The answers provided by the models were then compared with those indicated as correct in the articles, assigning a value of 1 to the correct answers and 0 to the wrong ones. For questions with multiple correct answers, any responses that were only partially correct according to the solution key were considered wrong. We calculated the percentage of correct answers for each year and for the entire ten-year period. Due to the small number of questions in the 2022–2023 biennium, we decided to combine them and treat them as a single year.

Subsequently, we integrated the reference articles, removing the last page containing the correct answers using a Python script. The modified articles were combined into PDF files named “[year] FUSED.pdf” and presented to the models with the prompt: “Read and memorize the content of this file. I will ask you multiple-choice questions which you will need to answer based on the knowledge contained in this file.” The questions were then re-presented and analyzed using the same methodology described earlier.

To further evaluate the models’ effectiveness, the articles were also converted into TXT using another Python script we coded. Again, the models were tested with the multiple-choice questions using the same methodology.

To avoid interference between different interactions, we created a new chat session for each interaction with the models, minimizing learning from previous interactions. Moreover, during the experiments, no correct answers were provided to the models. For this reason, the models would not have had the opportunity to learn anything concrete from the interactions, as they did not receive corrective feedback.

The results obtained before and after presenting the PDF and TXT files were compared to evaluate the impact of the models’ memorization of the articles on the correct response rates.

Statistical analysis

P values < 0.05 were considered statistically significant. Qualitative variables were compared by using the chi-square test. The chi-square test was conducted using the Stat-Graph XVII software for Windows, while the confusion matrix analysis was performed using Python (version 3.12.3), the libraries pandas (version 2.0.3) were used for data manipulation, NumPy (version 1.24.3) for array operations, scikit-learn (version 1.3.0) for creating the confusion matrix, and matplotlib (version 3.7.2) along with seaborn (version 0.12.2) for visualizing the matrix.

Comments (0)

No login
gif