Recent advances in large language models (LLMs) have demonstrated remarkable capabilities across various domains [], thereby increasing the focus on developing multilingual models to serve diverse linguistic communities []. This trend is exemplified by the release of specialized language models, such as Gemma-2-JPN, a Japanese-specific variant of Google’s open LLM Gemma [,]. However, the development of such specialized models critically depends on the availability of high-quality domain-specific datasets in the target language. This requirement becomes particularly crucial in specialized fields, such as medical imaging [-], where the interpretation of diagnostic findings demands both technical precision and linguistic accuracy.
Computed tomography (CT) is indispensable in modern medical diagnostics, facilitating disease staging, lesion evaluation, and early detection. Japan has the highest number of CT scanners per capita and an annual scan volume surpassing that of most developed nations, presenting a vast reservoir of medical imaging data [,]. The extensive use of CT technology has positioned Japan as a pivotal contributor to global medical imaging resources. However, despite the proliferation of multilingual models and growing emphasis on language-specific adaptations, there remains a notable absence of large-scale Japanese radiology report datasets [], which is a critical gap hindering the development of Japanese-specific medical language models.
To address this challenge, we constructed “CT-RATE-JPN,” a Japanese version of the extensive “CT-RATE” dataset [], which consists of CT scans and interpretation reports collected from 21,304 patients in Turkey. Although general academic knowledge benchmarks have been successfully adapted for Japanese, as evidenced by JMMLU (Japanese Massive Multitask Language Understanding) [] and JMMMU (Japanese Massive Multi-discipline Multimodal Understanding) [], which are Japanese versions of MMLU (Massive Multitask Language Understanding) [,] and MMMU (Massive Multi-discipline Multimodal Understanding) [], respectively, and medical benchmarks, such as JMedBench [], have emerged by combining translated English resources and Japanese medical datasets, a large-scale Japanese dataset specifically focusing on radiology reports, remains notably absent.
CT-RATE-JPN uses an innovative dataset construction approach by leveraging LLM-based machine translation to efficiently generate a large volume of training data. This addresses the fundamental challenge of the dataset scale in medical artificial intelligence (AI) development while maintaining quality through a strategic validation approach. A subset of the data undergoes careful revision by radiologists to create a rigorously verified validation dataset. This dual-track methodology, which combines machine-translated training data with specialist-validated evaluation data, establishes a robust pipeline for both training data acquisition and performance evaluation.
Both CT-RATE and CT-RATE-JPN retain licenses that allow free use for research purposes, thereby supporting broader research initiatives in medical imaging and language processing. To demonstrate the practical use of the CT-RATE-JPN dataset, we developed CT-BERT-JPN, a deep learning–based language model specifically designed for extracting structured labels from Japanese radiology reports. By converting unstructured Japanese medical text into standardized, language-agnostic structured labels, CT-BERT-JPN provides a scalable framework for integrating Japanese radiology data into the global medical AI development, thereby addressing a critical need in the rapidly evolving landscape of multilingual medical AI.
CT-RATE is a comprehensive dataset comprising 25,692 noncontrast chest CT volumes of 21,304 unique patients from the Istanbul Medipol University Mega Hospital []. We selected this dataset because it is uniquely positioned as the only publicly available large-scale dataset that pairs CT volumes with radiology reports and permits the redistribution of derivative works. This dataset includes corresponding radiology text reports (consisting of a detailed findings section documenting observations and a concise impression section summarizing key information), multi-abnormality labels, and metadata. The dataset is divided into 2 cohorts, 20,000 patients for the training set and 1304 patients for the validation dataset, allowing robust model training and evaluation across diverse patient cases [,]. The training dataset, comprising 22,778 unique reports, was used in constructing CT-RATE-JPN, a Japanese-translated version of the dataset created using machine translation. Independently, we randomly sampled 150 reports from the validation cohort (n=1304); this validation subset was never used during model training and was reserved solely for inference-time evaluation. Those 150 reports were first machine translated and then manually revised and refined by radiologists. The data selection process is illustrated in .
Figure 1.  Data selection process for CT-RATE-JPN training and validation datasets. The training dataset included all 22,778 radiology reports from CT-RATE without exclusion. For the validation dataset, 150 reports were randomly sampled from 1505 available reports, with 1355 reports excluded to avoid patient overlap between training and validation sets. The CT-RATE dataset was annotated using 18 structured labels covering key findings relevant to chest CT analysis. These labels included “Medical material,” “Arterial wall calcification,” “Cardiomegaly,” “Pericardial effusion,” “Coronary artery wall calcification,” “Hiatal hernia,” “Lymphadenopathy,” “Emphysema,” “Atelectasis,” “Lung nodule,” “Lung opacity,” “Pulmonary fibrotic sequela,” “Pleural effusion,” “Mosaic attenuation pattern,” “Peribronchial thickening,” “Consolidation,” “Bronchiectasis,” and “Interlobular septal thickening.” The creators of the CT-RATE dataset developed a structured findings model based on the RadBERT architecture [,], trained on a manually labeled subset to label the remaining cases. This model achieved an F1-score ranging from 0.95 to 1.00, demonstrating its efficacy in accurately structuring radiological findings from CT reports. This approach underscores the reliability of CT-RATE’s structured annotations for developing high-performance diagnostic models. These structured labels were also used in the development of a Japanese structured findings model for CT-RATE-JPN, enabling the accurate structuring of radiological findings in Japanese CT reports.
Ethical ConsiderationsGiven that CT-RATE is a publicly available dataset with deidentified patient information, and that our study focused on the translation and linguistic analysis of the existing dataset without accessing any additional patient data, institutional review board approval was not required for this research.
Translation for CT-RATE-JPNFor CT-RATE-JPN, machine translation was conducted using GPT-4o mini (API version, “gpt-4o-mini-2024-07-18”) [], which is a lightweight, fast version of OpenAI’s GPT-4o model []. The GPT-4o mini produces high-accuracy translations at an affordable rate, making it suitable for large-scale dataset translations. Each radiological report was processed by translating the findings into short impression sections. The complete translation prompts used for GPT-4o mini are provided in Figures S1 (original Japanese prompt) and S2 (English translation) in . shows the study workflow.
Figure 2.  Workflow for translation and validation in constructing CT-RATE-JPN. The figure outlines machine translation application using GPT-4o mini for the training dataset and 2-phase manual correction for the 150 validation reports. Phase 1 involved initial revisions by radiology residents. Phase 2 consisted of expert review and refinement by board-certified radiologists. To ensure the accuracy and reliability of the evaluation data, we conducted a comprehensive manual correction process on 150 reports from the validation dataset. This process consisted of 2 distinct phases. In the first phase, we assembled a team of 5 radiology residents, all between their fourth and sixth postgraduate years, to conduct an initial review and revision of the machine translations. The reports were systematically distributed among team members to optimize workflow efficiency. We intentionally chose a larger team to incorporate diverse clinical perspectives and minimize potential translation bias during the review process. The second phase involved a thorough expert review by 2 board-certified radiologists with extensive experience (10 and 11 postgraduate years). The senior radiologists divided the revised translations into final confirmation and refinement. The structured approach to task allocation combined with this rigorous 2-step review process ensured that the validation dataset for CT-RATE-JPN met the high-quality standards necessary for a robust model assessment.
For the training dataset, all translations were generated using GPT-4o mini without manual correction. This dataset was specifically designed for machine learning model training. The decision to rely exclusively on machine-translated data for the training set balanced the scale and practical constraints of the manual annotation.
Both CT-RATE and CT-RATE-JPN were released under a Creative Commons Attribution (CC BY-NC-SA) license, allowing free use for noncommercial research purposes with proper citation and shared terms for derivative works.
Development of CT-BERT-JPN for Structured Finding ClassificationFor model training, we randomly split the dataset in a 4:1 ratio, with 80% used for training and 20% for internal evaluation. The pretrained “tohoku-nlp/bert-base-japanese-v3” model from Hugging Face was used []. This model follows the architecture of the original BERT base model with 12 layers, 768 hidden dimensions, and 12 attention heads. It was pretrained on extensive Japanese datasets, including the Japanese portion of the CC-100 corpus [,] and Japanese version of Wikipedia []. BERT-based models have demonstrated significant success in downstream tasks in the medical domain [,], making them a promising choice for our research.
Training was conducted on Google Colaboratory equipped with an NVIDIA L4 GPU using the transformers library (version 4.46.2) with a learning rate of 2×10⁻⁵, batch size of 8 for both training and evaluation, and weight decay of 0.01. Binary cross-entropy loss was applied to optimize the model for multilabel classification. Given the domain shift between the validation dataset (machine-translated) and test dataset (radiologist-revised), hyperparameter tuning was deliberately omitted to avoid overfitting, as such optimization may not contribute to performance improvement under domain shift conditions [,]. The model was trained over 4 epochs with internal evaluation and checkpoint savings at each epoch. The total training time was approximately 51.5 minutes (3090.54 s) across 4 epochs, with an average of 12.9 minutes (772.64 s) per epoch. The best-performing model on the internal evaluation data was selected and subsequently used for testing on the validation dataset of CT-RATE-JPN, to ensure a reliable performance assessment. shows the overall workflow for developing CT-BERT-JPN.
Figure 3.  CT-BERT-JPN development workflow showing data preprocessing and model fine-tuning for structured computed tomography (CT) classification of findings. Translated Radiology Reports Evaluation by Automatic MetricsFor basic text analysis, we examined the structural characteristics of the translated reports, including character, word, and sentence count, as well as lexical diversity. Unlike English, Japanese does not use spaces to delimit words. Therefore, we used MeCab (version 1.0.10) [], one of the most widely used morphological analyzers for Japanese text processing, to accurately segment and count words. These metrics were calculated for both machine-translated and radiologist-revised texts to assess the consistency of textual characteristics across different stages of dataset creation.
For translation quality assessment, we computed the bilingual evaluation understudy (BLEU) [] and recall-oriented understudy for gisting evaluation (ROUGE)-1, ROUGE-2, and ROUGE-L scores [] using NLTK (version 3.9.1) and rouge-score (version 0.1.2) libraries.
The BLEU metric evaluates the accuracy of machine-translated text by comparing it to a reference translation. It measures how many words and short phrases in the machine translation match those in the reference translation, to assess the degree of similarity in wording and phrasing between the two texts.
The ROUGE metric assesses the quality of summaries or translations by measuring the overlap between machine-generated and reference texts. ROUGE-1 considers the overlap of individual words; ROUGE-2 examines the overlap of pairs of consecutive words; and ROUGE-L focuses on the longest matching sequence of words between two texts. These metrics emphasize recall by evaluating how much of the important content from the reference text is captured in the machine-generated text. These metrics were calculated by comparing machine-translated texts with radiologist-revised reference translations in the validation dataset.
Translated Radiology Reports Evaluation by RadiologistsTo assess the quality of machine-translated radiology reports, we implemented a systematic evaluation by radiologists. A subset of reports was randomly assigned to radiologists for review, with each report evaluated by 2 reviewers: 1 senior radiologist (designated as rater 1, PGY 6‐11) and 1 junior radiologist (designated as rater 2, PGY 4‐5). To maintain objectivity, radiologists were never assigned to evaluate reports they had previously reviewed or corrected.
The evaluation used a 5-point Likert scale (1=poor to 5=excellent) across 3 distinct dimensions:
Grammatical correctness: Assessment of linguistic accuracy, including proper Japanese grammar and natural language usage. Presence of nonexistent words, unnatural English terms, or grammatical errors resulted in lower scores. Specialized terminology errors were evaluated separately.Medical terminology: Evaluation of the accuracy and appropriateness of medical and radiological terminology in Japanese. This dimension specifically focused on technical terms and domain-specific language.Overall readability: Subjective assessment of how comprehensible and fluent the translated text appeared to radiologists in clinical practice. While this dimension relates to the previous two categories, it captured the holistic impression of the text’s utility in a clinical setting.Inter-rater reliability was calculated using quadratic weighted kappa (QWK) to ensure consistency in evaluations across different radiologists, as this metric is particularly suitable for ordinal data such as Likert scales. This multi-dimensional evaluation approach provided comprehensive insights into both the technical accuracy and practical usability of machine-translated radiology reports in a Japanese clinical context.
CT-BERT-JPN Performance EvaluationTo evaluate the performance of the classification model in CT findings extraction, we used a test dataset comprising 150 radiology reports revised by radiologists to ensure accuracy. The key metrics calculated for model assessment included accuracy, precision, recall, F1-score, area under the receiver operating characteristic curve (AUC-ROC), and average precision (AP). 95% CIs for these metrics were calculated using 1000 bootstrap resampling iterations. The Scikit-learn library (version 1.5.2) was used for the analyses.
Baseline Comparison With GPT-4oTo establish a baseline, we performed structured labeling using GPT-4o (API version, “gpt-4o-2024-11-20”), as it is widely adopted in various radiology tasks as a representative closed-source commercial LLM. We specifically chose GPT-4o because, to our knowledge, no existing open-source models were available for this particular task at the time of the study. We used a zero-shot prompting strategy that instructed GPT-4o to extract binary labels (0/1) for each of the 18 target findings from Japanese radiology reports and output results in JSON format, without additional prompt engineering or few-shot examples. The input prompts used for GPT-4o are presented in Figures S3 (original Japanese version) and S4 (English translations) in .
Statistical AnalysisThe statistical comparisons were analyzed using the Wilcoxon signed-rank test. This nonparametric test was applied in two contexts: (1) comparing paired predictions from CT-BERT-JPN and GPT-4o on identical validation cases, and (2) comparing CT-BERT-JPN performance between radiologist-revised translated reports versus raw machine-translated reports as input, where each pair consisted of model outputs on the same test sample under different conditions. The Wilcoxon signed-rank test was chosen as it does not require the assumption of normal distribution and is appropriate for comparing paired model outputs. Statistical analysis was conducted using the scipy.stats library (version 1.11.3).
The basic text statistics of the translated reports are summarized in , in separate sections for Findings and Impression. The training (n=22,778) and validation datasets (consisting of 150 machine-translated reports and 150 radiologist-revised reports) showed a consistent text structure across all metrics.
Table 1. Text statistics on CT-RATE-JPN across different datasets, including “Findings” and “Impression.”SectionnCharacters, mean (SD)Words, mean (SD)Sentences, mean (SD)Unique words, mean (SD)FindingsTraining (MT)22,778467.0 (148.0)303.3 (95.4)15.5 (4.6)126.8 (29.5)Validation (MT)150475.0 (130.1)307.0 (83.6)15.7 (3.9)128.9 (27.8)Validation (Refined)150455.6 (122.6)297.7 (80.8)15.7 (4.0)126.2 (27.4)ImpressionTraining (MT)22,77889.1 (68.9)55.7 (43.6)3.1 (2.2)38.1 (23.4)Validation (MT)150101.3 (76.0)63.2 (48.1)3.5 (2.6)41.7 (24.3)Validation (Refined)15097.8 (72.8)61.2 (46.0)3.6 (2.6)40.9 (24.0)aMT: machine-translated text using GPT-4o mini.
bRefined indicates text subjected to radiologist review and refinement.
The Findings section had character counts averaging around 455.6‐475.0 characters, with slightly lower counts in radiologist-revised texts compared to machine translations. Word count followed a similar pattern, averaging approximately 300 words per report across all datasets. The Impression section was notably more concise, as expected from summary statements. Character counts averaged around 89.1‐101.3 characters, with word counts of approximately 55.7‐63.2 words per report. The sentence structure was also more condensed, with approximately 3.1‐3.6 sentences per report.
Notably, in both sections, the overall text structure remained consistent between machine-translated and radiologist-revised versions, with similar patterns in sentence length and organization. Although the refined versions had slightly lower character and word counts than their machine-translated counterparts, the basic structural characteristics of the reports were preserved throughout the translation and refinement processes.
shows that the analysis of the label distributions revealed a significant class imbalance in both the training and validation datasets. In the training set, “Lung nodule” appears most frequently with 10,856 instances, whereas “Interlobular septal thickening” occurs least frequently with only 1702 instances, representing a ratio of approximately 6.4:1. This imbalance is even more pronounced in the validation dataset, where the ratio between the most frequent (82) and least frequent (7) class instances is approximately 11.7:1.
Figure 4.  Bar plots showing the data distribution across different findings, sorted in descending order. Top: Training dataset distribution, Bottom: Validation dataset distribution. The number above each bar represents the number of positive samples for each condition. Translated Radiology Reports EvaluationWe evaluated the quality of machine-translated reports in CT-RATE-JPN using both automated metrics and expert assessments. For automated evaluation, we compared GPT-4o mini translations with radiologist-revised references in the validation dataset. The evaluation metrics for both sections are summarized in . These scores are high, indicating that the machine translation maintained the fundamental structure and meaning of the original reports.
Table 2. Automated evaluation metrics comparing GPT-4o mini translations with radiologist-revised references in the validation dataset.SectionBLEU, mean (SD)ROUGE-1, mean (SD)ROUGE-2, mean (SD)ROUGE-l, mean (SD)Findings0.731 (0.104)0.876 (0.050)0.770 (0.091)0.854 (0.064)Impression0.690 (0.196)0.857 (0.104)0.748 (0.161)0.837 (0.120)aBLEU: bilingual evaluation understudy.
bROUGE: recall-oriented understudy for gisting evaluation.
The evaluation of machine-translated radiology reports showed moderate inter-rater agreement with QWK values of 0.410 for grammar, 0.522 for terminology, and 0.408 for readability. These moderate agreement levels may reflect the inherent subjectivity of Likert-scale-based quality assessments. However, radiologist refinements resulted in substantial score improvements across all evaluation dimensions. The highest agreement between senior radiologists (Rater 1) and junior radiologists (Rater 2) was observed in terminology assessment. shows that both raters consistently scored radiologist-revised translations significantly higher than machine translations across all dimensions. Rater 1’s evaluations were consistently higher than Rater 2’s for both MT and radiologist-revised translations. demonstrates that all improvements between MT and radiologist revisions were statistically significant (P<.001) across all evaluation criteria. While GPT-4o mini translations achieved moderate scores, radiologist revisions provided significant enhancements, particularly in terminology accuracy. This highlights the value of domain expertise in medical translation, even when using advanced language models. In addition, shows the distribution of quality changes, with the majority of radiologist revisions (65%‐91% depending on rater and dimension) resulting in improved ratings compared to machine translation, while deterioration was minimal (0%‐6% across all categories).
Table 3. Evaluation of translated texts comparing mean Likert scale scores of GPT-4o mini translations (MT) and radiologist-revised translations.Evaluation criteriaMT, mean (SD)Radiologist, mean (SD)P valueRater 1Grammar3.44 (0.91)4.64 (0.58)<.001Terminology3.15 (0.81)4.64 (0.65)<.001Readability3.35 (0.91)4.59 (0.69)<.001Rater 2Grammar3.35 (0.71)4.21 (0.64)<.001Terminology2.73 (0.82)4.06 (0.65)<.001Readability2.73 (0.75)3.91 (0.76)<.001aMT: machine-translated text using GPT-4 mini.
bP values were calculated using the Wilcoxon signed-rank test.
Figure 5.  Distribution of evaluation scores for machine-translated (MT) and radiologist-revised translations across 3 dimensions (grammar, terminology, and readability) by 2 rater groups. The upper panel shows evaluations by senior radiologists (Rater 1, postgraduate year 6‐11) and the lower panel shows evaluations by junior radiologists (Rater 2, postgraduate year 4‐5). Scores were assessed using a 5-point Likert scale (1=poor to 5=excellent). Table 4. Distribution of quality changes between machine translation and radiologist-revised translations across grammar, terminology, and readability dimensions.ItemRater 1Rater 2WorseNo changeBetterWorseNo changeBetterGrammar, n (%)4 (2.7)28 (18.7)118 (78.7)9 (6)43 (28.7)98 (65.3)Terminology, n (%)0 (0)14 (9.3)136 (90.7)7 (4.7)34 (22.7)109 (72.7)Readability, n (%)1 (0.7)29 (19.3)120 (80)9 (6)30 (20)111 (74)Despite the high automated metric scores (see ), expert scoring evaluation by radiologists revealed that substantial revisions were necessary for medical terminology, with significant improvements observed between machine translations and radiologist-revised versions. Radiologists’ assessments identified the representative patterns of necessary modifications. This expert review revealed 3 major categories of improvements (see ; the original English reports used are shown in Figure S5 in ), which reflect typical challenges in medical translation: (1) contextual refinement of technically correct but unnatural terms, (2) completion of missing or incomplete translations, and (3) Japanese localization of untranslated English terms. Although not exhaustive, these patterns represent key areas where human expertise complements machine translation in medical contexts.
Figure 6.  Representative examples of (computed tomography) CT report translation improvements in 3 categories: contextual term refinement, completion of missing details, and localization of untranslated terms from English to Japanese radiological equivalents. The first category, contextual refinement, involved replacing technically accurate but clinically uncommon expressions with more natural medical terminology. For instance, direct translations of anatomical conditions were often revised to their proper radiological equivalents when describing vessel status, reflecting standard terminology in Japanese clinical practice. The second category addresses cases in which certain medical terms were either missing or incompletely translated, thereby requiring additional context-specific information. A typical example would be anatomical descriptions lacking specific diagnostic terminology that is common in radiological reporting. The third category focuses on proper localization of English medical terms that were initially left untranslated, such as converting technical descriptors of pathological findings into appropriate Japanese radiological counterparts.
CT-BERT-JPN Performance EvaluationThe CT-BERT-JPN model achieved a micro F1-score of 0.9607 on the internal evaluation data, which served as the model selection criterion. When evaluated on the validation dataset, the model showed a micro F1-score of 0.9695 with machine-translated inputs and 0.9520 with radiologist-revised inputs, representing a performance decrease of 0.0175. presents the performance evaluation results of CT-BERT-JPN across 18 different findings from 150 chest CT radiological reports on the validation dataset. The model achieved perfect scores (1.000) across all evaluation metrics (accuracy, precision, recall, F1-score, AUC-ROC, and AP) for pericardial effusion, hiatal hernia, and mosaic attenuation patterns. It demonstrated high accuracy exceeding 0.950 in 17 out of 18 findings, with AUC-ROC values surpassing 0.98 in all findings. Furthermore, AP remained consistently strong—ranging from 0.873 to 1.000—despite the imbalanced distribution of positive samples (7 cases for interlobular septal thickening up to 82 cases for lung nodules), underscoring the model’s robust discriminative ability across all conditions.
Table 5. Performance evaluation of CT-BERT-JPN across 18 different findings.FindingsAccuracy (95% CI)Precision (95% CI)Recall (95% CI)F1-score (95% CI)AUC-ROC (95% CI)APMedical material0.973 (0.940‐0.993)0.778 (0.538‐0.952)1.000 (1.000‐1.000)0.875 (0.700‐0.976)0.999 (0.995‐1.000)0.990 (0.950‐1.000)Arterial wall calcification0.987 (0.967‐1.000)0.961 (0.898‐1.000)1.000 (1.000‐1.000)0.980 (0.946‐1.000)1.000 (1.000‐1.000)1.000 (1.000‐1.000)Cardiomegaly0.987 (0.967‐1.000)1.000 (1.000‐1.000)0.920 (0.800‐1.000)0.958 (0.889‐1.000)0.999 (0.996‐1.000)0.996 (0.981‐1.000)Pericardial effusion1.000 (1.000‐1.000)1.000 (1.000‐1.000)1.000 (1.000‐1.000)1.000 (1.000‐1.000)1.000 (1.000‐1.000)1.000 (1.000‐1.000)Coronary artery wall calcification0.987 (0.967‐1.000)0.978 (0.927‐1.000)0.978 (0.925‐1.000)0.978 (0.943‐1.000)1.000 (0.999‐1.000)1.000 (0.997‐1.000)Hiatal hernia1.000 (1.000‐1.000)1.000 (1.000‐1.000)1.000 (1.000‐1.000)1.000 (1.000‐1.000)1.000 (1.000‐1.000)1.000 (1.000‐1.000)Lymphadenopathy0.987 (0.967‐1.000)0.973 (0.907‐1.000)0.973 (0.912‐1.000)0.973 (0.929‐1.000)0.994 (0.983‐1.000)0.987 (0.963‐1.000)Emphysema0.980 (0.953‐1.000)0.938 (0.844‐1.000)0.968 (0.889‐1.000)0.952 (0.881‐1.000)0.989 (0.970‐1.000)0.960 (0.888‐1.000)Atelectasis0.993 (0.973‐1.000)0.980 (0.929‐1.000)1.000 (1.000‐1.000)0.990 (0.963‐1.000)1.000 (1.000‐1.000)1.000 (1.000‐1.000)Lung nodule0.967 (0.940‐0.993)0.975 (0.937‐1.000)0.963 (0.921‐1.000)0.969 (0.942‐0.994)0.991 (0.978‐1.000)0.994 (0.985‐1.000)Lung opacity0.953 (0.913‐0.987)0.929 (0.857‐0.983)0.945 (0.880‐1.000)0.937 (0.885‐0.977)0.991 (0.980‐0.999)0.985 (0.963‐0.999)Pulmonary fibrotic sequela0.953 (0.920‐0.987)0.935 (0.860‐1.000)0.915 (0.833‐0.980)0.925 (0.867‐0.974)0.981 (0.960‐0.997)0.973 (0.942‐0.995)Pleural effusion0.987 (0.967‐1.000)0.905 (0.762‐1.000)1.000 (1.000‐1.000)0.950 (0.865‐1.000)1.000 (0.997‐1.000)0.997 (0.983‐1.000)Mosaic attenuation pattern1.000 (1.000‐1.000)1.000 (1.000‐1.000)1.000 (1.000‐1.000)1.000 (1.000‐1.000)1.000 (1.000‐1.000)1.000 (1.000‐1.000)Peribronchial thickening0.960 (0.927‐0.987)1.000 (1.000‐1.000)0.714 (0.529‐0.905)0.833 (0.692‐0.950)0.985 (0.960‐1.000)0.948 (0.869‐0.997)Consolidation0.933 (0.893‐0.973)0.706 (0.533‐0.862)1.000 (1.000‐1.000)0.828 (0.696‐0.926)0.996 (0.988‐1.000)0.985 (0.951‐1.000)Bronchiectasis0.980 (0.953‐1.000)0.870 (0.714‐1.000)1.000 (1.000‐1.000)0.930 (0.833‐1.000)0.990 (0.970‐1.000)0.873 (0.686‐1.000)Interlobular septal thickening0.993 (0.980‐1.000)0.875 (0.600‐1.000)1.000 (1.000‐1.000)0.933 (0.750‐1.000)1.000 (1.000‐1.000)1.000 (1.000‐1.000)aAUC-ROC: area under the receiver operating characteristic curve.
bAP: average precision.
compares the F1-scores of CT-BERT-JPN and GPT-4o across the 18 conditions, displayed as bar charts. Table S1 in presents the detailed results for GPT-4o across the 18 conditions using valid data. CT-BERT-JPN had higher F1-scores than GPT-4o in 11 findings and achieved equivalent performance in 2 findings, specifically the hiatal hernia and mosaic attenuation patterns, both of which attained perfect F1-scores. Notably, CT-BERT-JPN showed a higher performance in lymphadenopathy (0.142 higher) and atelectasis (0.074 higher), with statistically significant differences confirmed by the Wilcoxon signed-rank test (P=.003 and P=.005, respectively). However, the model performed poorly in 5 findings—most notably peribronchial thickening (–0.051) and consolidation (–0.035)—although Wilcoxon signed-rank tests did not detect statistically significant differences. Detailed performance metrics for GPT-4o are presented in Table S1 in .
A detailed analysis of the performance of CT-BERT-JPN was conducted by comparing 2 scenarios: radiologist-revised translated reports versus raw machine-translated reports as input. The differences in the metrics are detailed in (the performance metrics for machine translation can be found in Table S2 in ). The analysis showed that for most findings, performance differences were less than 5% with no significant variations, confirming the robustness of the model. However, peribronchial thickening showed notable decreases, with a decrease in recall of 0.238 and F1-score reduction of 0.143. Statistical analysis using the Wilcoxon signed-rank test revealed a significant difference (P=.025). Similarly, consolidation experienced relatively significant performance declines, with precision decreasing by 0.179 and F1-score by 0.092, with the Wilcoxon signed-rank test showing a significant difference (P=.034).
Table 6. Performance differences between CT-BERT-JPN models trained on radiologist-revised versus machine-translated inputs across multiple metrics.FindingsAccuracyPrecisionRecallF1-scoreAUC-ROCAPP valueMedical material0.000−0.0340.0710.0080.0020.0131.00Arterial wall calcification−0.006−0.0190.000−0.0100.0000.000.32Cardiomegaly−0.0130.000−0.080−0.042−0.001−0.004.16Pericardial effusion0.0070.0000.0830.0430.0000.000.32Coronary artery wall calcification0.0000.0000.0000.0000.0000.000N/AHiatal hernia0.0000.0000.0000.0000.0000.000N/ALymphadenopathy−0.006−0.001−0.027−0.014−0.006−0.012.32Emphysema−0.007−0.0300.000−0.016−0.011−0.039.56Atelectasis−0.007−0.0200.000−0.0100.0000.000.32Lung nodule−0.013−0.0250.000−0.0120.0000.000.16Lung opacity−0.007−0.0160.000−0.008−0.002-0.004.71Pulmonary fibrotic sequela−0.0070.017-0.042−0.013−0.005-0.007.71Pleural effusion−0.006−0.0450.000−0.0240.0020.008.32Mosaic attenuation pattern0.0000.0000.0000.0000.0000.000N/APeribronchial thickening−0.0330.000−0.238−0.143−0.010-0.033.03Consolidation−0.040−0.1790.042−0.0920.0040.020.03Bronchiectasis0.0070.0370.0000.021-0.006-0.099.32Interlobular septal thickening−0.007−0.1250.000−0.0670.0000.000.32aAUC-ROC: area under the receiver operating characteristic curve.
bAP: average precision.
cP values were calculated using the Wilcoxon signed-rank test.
dPositive values indicate higher performance with radiologist-corrected translations, whereas negative values “-” indicate higher performance with raw machine translations.
eN/A indicates that the test was not performed because the predicted values were identical.
Our qualitative analysis of CT-BERT-JPN’s performance revealed several illustrative examples of how translation quality impacts model predictions. For peribronchial thickening, we observed cases where CT-BERT-JPN incorrectly predicted negative results for truly positive cases when using radiologist-revised translations. A notable example involved subtle terminological changes from “bilateral peribronchial thickening is observed” in the machine translation to “bilateral bronchial wall thickening is observed” in the radiologist-revised version, where this minor linguistic variation caused the model to misclassify a positive case as negative, demonstrating how nuanced expression differences can significantly impact the model’s detection performance. Conversely, for pericardial effusion, we found instances where radiologist refinement improved model accuracy, such as cases where “pericardial effusion” was incorrectly translated as “pleural effusion” in machine-translated reports, while radiologist refinement correctly rendered it as “fluid accumulation in the pericardial recess,” leading to accurate CT-BERT-JPN predictions that would have been incorrect with the raw machine translation. In addition, for medical material—the condition with the third-lowest F1-score—we identified systematic prediction errors that persisted regardless of translation quality, particularly in cases containing expressions such as “changes associated with tracheostomy are observed,” where both machine-translated and radiologist-revised versions retained identical phrasing, yet CT-BERT-JPN consistently predicted positive when the correct label should have been negative. While the presence of an endotracheal tube following tracheostomy would warrant a positive prediction for medical material, the description alone does not explicitly indicate the current presence of medical devices, suggesting that the model may be confounded by artificial interventions mentioned in the text, leading to false positive predictions based on procedural context rather than actual device presence.
The 3 key findings of our study regarding the development of Japanese medical imaging resources and analytical capabilities are:
First, an efficient workflow that combines machine translation with expert validation was established to successfully create a large-scale Japanese radiology dataset, maintaining high-quality standards through a radiologist’s focused review.
Second, despite its relatively compact architecture, our specialized CT-BERT-JPN model demonstrated superior performance to GPT-4o in most structured finding extraction tasks, highlighting the effectiveness of domain-specific optimization.
Third, the model maintained robust performance across both machine-translated and radiologist-revised reports, suggesting the viability of machine translation for training data creation in specialized medical domains.
Dataset Development and Quality AssessmentIn the field of medical imaging datasets, several English-language resources have been previously established, including the OpenI dataset [] and MIMIC-CXR [] for chest radiographs, and AMOS-MM [] for chest-to-pelvic CT scans; however, these datasets are exclusively available in English, with limited multilingual adaptations. Although many LLMs and VLMs have multilingual capabilities [,,], their performance is consistently degraded when handling non-English languages, including Japanese []. A significant factor contributing to this performance gap is the scarcity of non-English datasets, making the development of such resources an urgent priority.
Machine-translated reports showed minimal differences in basic structural elements compared with the original texts, with word counts and sentence lengths varying by less than 5%. Evaluation of translation quality using the BLEU and ROUGE metrics also demonstrated high performance, indicating that the overall structure and content of the reports were well preserved. However, a Likert-scale evaluation by radiologists on grammar, medical terminology, and readability revealed that raw GPT-4o mini outputs still required substantial refinement in all 3 areas. Radiologist-led corrections efficiently addressed these shortcomings, significantly boosting translation quality for our validation dataset. This limitation necessitated a rigorous refinement process involving expert radiologists, which proved crucial for establishing a reliable validation dataset.
Our efficient workflow enabled the review and correction of approximately 70,000 characters of machine-translated text, significantly reducing the time and effort required for extensive translation verification. This efficiency was achieved through a structured review process in which radiologists would focus primarily on medical terminology and critical semantic elements rather than reviewing every aspect of the translation. This systematic approach, which combines large language models with domain expert validation, presents a scalable methodology for dataset creation between distinctly different languages, such as English and Japanese. This hybrid process offers a promising framework that can be adapted not only to other languages but also to specialized domains beyond radiology, where precise terminology and domain expertise are critical.
Model Performance and EvaluationWhen combined with CT volumes, CT-BERT-JPN demonstrated significant potential for various text-based applications in medical imaging analysis. Our comprehensive evaluation revealed exceptional performance in structured extraction of findings across diverse radiological conditions, with F1-score consistently above 0.95 and perfect scores in several findings. More notably, CT-BERT-JPN outperformed GPT-4o in 11 out of 18 findings (61%), achieving higher F1-score in significant conditions such as lymphadenopathy (+0.142) and atelectasis (+0.074). This superior performance is particularly remarkable considering that CT-BERT-JPN has approximately 110 million parameters, whereas recent large language models often employ hundreds of billions of parameters []. Our results demonstrate that a specialized, compact language model can achieve state-of-the-art performance in domain-specific tasks, even when compared with more sophisticated general-purpose models. While promising, our results should be interpreted cautiously given the limited validation dataset of 150 reports and notable class imbalance for certain findings (eg, 7 cases for interlobular septal thickening, 12 cases for pericardial effusion). Although we used both F1-scores and AP metrics to address data imbalance, the 95% CIs reveal substantial variability for rare findings (eg, interlobular septal thickening: 0.933 [0.750‐1.000]), where even high F1-scores may be statistically unstable due to small sample sizes. Larger-scale validation studies will be necessary for more definitive performance assessment.
Furthermore, the model maintained stable performance across both unmodified machine translations and radiologist-revised reports. While some conditions showed moderate performance variations between machine-translated and radiologist-revised reports, particularly in findings where machine translation errors were common (F1-score differences of -0.143 for peribronchial thickening and −0.092 for co
Comments (0)