Advancing Italian Biomedical Information Extraction with Transformers-based Models: Methodological Insights and Multicenter Practical Application

The ubiquity of digital technologies is increasingly encompassing every aspect of our lives, and healthcare is no exception. In the last years there has been a rapid adoption of digital health tools [1]. This new technological paradigm has led to a dramatic increase in digitized medical text data in the everyday medical routine of healthcare institutions (e.g., discharge letters, examination results, medical notes) [2]. These documents, while very informative, are unstructured and not harmonized, creating a barrier that leads to insufficient use and under-exploitation. This lowers the efficiency of the clinical and research environments, since the extraction of such information into structured databases is time-consuming: physicians spend about 35% of their time documenting patient data [3].

Artificial Intelligence (AI), and in particular Natural Language Processing (NLP), could provide useful tools to overcome these limitations. NLP is a collection of techniques and tools for processing human language written texts. Some examples of NLP tasks are: Named Entity Recognition (NER), which assigns words to predefined categories (e.g., person, location); Relation Extraction (RE), which connects named entities in a text through semantic relations; and Question Answering (QA), whose goal is to find answers to questions written by humans. In the last decade, NLP has shifted to Deep Learning (DL) approaches, and a large number of models have been implemented. The advent of the Transformer architecture [4] unlocked the creation of highly performing models, and the famous Bidirectional Encoder Representations from Transformers (BERT) [5], developed by Google in 2019, established itself as the de-facto state-of-the-art. Several BERT-based models followed shortly after. These models are usually created in a two-step process:

The first step is pre-training, an unsupervised procedure in which the model is fed with a huge amount of unlabeled text (e.g., the BERT corpus is composed of 3.3 billion words). The pre-training is based on the mechanism of Masked Language Modeling (MLM): a random portion of the words in a sentence is masked, and the model tries to predict them based on the surrounding context. At the end of the pre-training, the model has a general knowledge of the language. Texts used for pre-training are usually referred to as corpora (i.e., a large collections of written texts).

The second step is fine-tuning, a supervised training in which the model is fed with a relatively small set of labeled training examples (e.g., a famous QA dataset, the Stanford Question Answering Dataset (SQuAD) [6], is composed of 100 thousand examples), and learns to perform a specific task. When speaking about fine-tuning, we usually refer to datasets, which are a structured collection of data used for a specific purpose.

One of the main limitations of this process is that it requires a considerable amount of text in the pre-training phase to achieve good results. For this reason, the models available in literature are often trained on generic corpora (e.g., BERT main corpus is the English Wikipedia), and they have difficulties when it comes to specific topics.

However, efforts have been made to overcome this limitation. Biomedical BERT (BioBERT [7]), is one of the best known and most successful models. This model was developed using the same approach as the original BERT, with the key difference that the pre-training corpus consists of PubMed abstracts and full-text articles, totaling 18 billion words. BioBERT performs better than the original BERT when applied to various NLP tasks involving biomedical documents. This result proves that the use of a topic-specific pre-training corpus is a crucial factor for high performance in a specific domain such as biomedicine. Another example is SciBERT [18], which exploits the original BERT architecture but trains the model from scratch on a different dataset (1.14 million scientific papers from Semantic Scholar). Thanks to this, SciBERT is able to incorporate a custom dictionary that reflects the in-domain word distribution more accurately. Along this path, several biomedical NLP models have recently been proposed to address the aforementioned NLP tasks:

BioNER [8], [9], used to identify specific medical entities in a text (e.g., drugs, medical tests, dosages, scores). Dihn et al. [19] developed a tool able to identify antibody and antigen entities and achieved a F1-score of 81.44%. Li et al. [21] compared four biomedical BERT models (BioBERT, SciBERT, BlueBERT [22], PubMedBERT [23]) and two open-domain models (BERT and SpanBERT [24]) by fine-tuning them on three clinical datasets, showing that the domain-specific models outperformed the open-domain ones with the best model achieving an F1-score of 83.6%. Yeung et al [47] developed a BioBERT-based tool to identify metabolites in cancer-related metabolomics articles with an F1-score of 90.9%. Dang et al. [48] created a Long Short-Term Memory (LSTM [49]) network and tested it on the NCBI disease dataset [40] with an F1-score of 84.41%, while Cho et al. [50] performed the same test with their bi-direction LSTM-based tool and achieved an F1-score of 85.68%. Finally, it is worth noting that although the vast majority of these tools were developed using English corpora, some work has been done for other languages as well. Chen et al. [51] developed a BERT-based hybrid network, and Li et al [52] developed a DL model incorporating dictionary features. These systems were tested on the China Conference on Knowledge Graph and Semantic Computing dataset 2017 version, and achieved F1-scores of 94.22% and 91.60%, respectively.

BioRE [10], [11], used after NER, in order to connect medical entities (e.g., drugs and their dosages).

BioQA [12], [13], aimed at finding answers to specific questions in a medical text.

BioNER, BioRE, and BioQA are tools used in Information Extraction (IE), one of the NLP main subtasks. IE has the goal of making the semantic structure of a text explicit, so that we can make use of it [14]; an example is available in Supplementary notes, in the section “Information Extraction example”.

Some missing points need to be highlighted. The so-called “less-resourced languages”, e.g., Italian, are underrepresented in this scenario. Indeed, models for very specific medical topics in these languages are lacking, although some examples can be found in literature ([15], [16]). This is due to the fact that, also for the biomedical topic, the vast majority of models are trained on English corpora, mainly because it is difficult to find a sufficiently large medical corpus in these languages [17].

In this paper we try to overcome these limitations by using Italian biomedical BERT models and fine-tuning them for the NER task on a specific medical topic, namely neuropsychiatry. The models we have created could be used to implement IE tools, avoiding lengthy and repetitive procedures by highly specialized clinical staff.Statement of significance

Comments (0)

No login
gif