Obtaining Clinical Term Embeddings from SNOMED CT Ontology

Deep learning based methods have shown success on several natural language processing (NLP) tasks, including in the clinical domain [1]. A critical component of all deep learning based NLP methods is the representation of words in numerical vector forms, also known as word embeddings. Given that neural networks can only take input in numerical form, word embeddings provide a suitable mechanism to give words, which are otherwise symbolic, as input to neural networks. Additionally, from machine learning perspective, they also provide a way to generalize from words seen during training to those not seen during training by leveraging the fact that words with similar meanings have similar word embeddings. Word embeddings are commonly obtained using corpus-based methods [2] which work on the basic premise that words found in similar contexts would have similar meanings and hence should have similar word embeddings. Although this is a reasonable premise, corpus-based methods expect all words to occur frequently enough in the corpus so that their embeddings can be suitably learned. However, this is not always true, especially in the clinical domain where the names of many diseases or medications may not occur frequently in a corpus. For example, consider disease names “pneumonia” and “pneumoconiosis” which are both inflammatory disorders of lungs and hence have similar meanings. However, for a corpus-based method to learn similar word embeddings for them, these words will need to be present in the corpus in similar contexts multiple times, which may not happen in a clinical corpus. In addition, embedding for each synonym of the disease will have to be learned independently. Given that there are more than a million clinical terms, it is not surprising that corpus-based embeddings were not found to do well on clinical term similarity prediction task [3].

Ontologies encode knowledge of a domain in the form of a graph, with concepts as nodes and relations between them as edges [4]. Medical ontologies, such as SNOMED CT [5], directly encode semantic properties of medical concepts in the graph. For example, the concepts of “pneumonia” and “pneumoconiosis” are both linked by “is-a” relation to the concept of “disorder of respiratory structure”, are linked by “finding site” relation to the concept “lung structure”, and are linked by “associated morphology” relation to the concept of “inflammation”. Given that they share multiple relations with other concepts, it can be explicitly and directly inferred from SNOMED CT that the concepts of “pneumonia” and “pneumoconiosis” are similar. In contrast, this can only be learned implicitly and indirectly by corpus-based methods from their contexts that too if they occur frequently enough, as pointed out earlier. Hence knowledge from ontologies could be used as an alternate resource for learning word embeddings.

In the general domain, WordNet ontology has been used to learn word embeddings using graph-based methods [6]. However, unlike WordNet in which words themselves are the nodes of the graph, in medical ontologies medical concepts are the nodes of the graph. A medical concept (typically denoted by an identifier in an ontology) may be associated with multiple terms, each with multiple words. For example, there is a concept of viral meningitis (id=58170007) in SNOMED CT, with associated clinical terms (also known as descriptions in SNOMED CT) “viral meningitis”, “abacterial meningitis” and “aseptic meningitis, viral”. While a graph-based method will obtain embedding for the concept 58170007, it will not obtain embeddings for words such as “meningitis”, “viral”, etc. It will also not give embeddings for previously unseen clinical terms even though they may be composed of previously seen words, such as “bacterial meningitis”.

In this paper, we present a novel method to obtain clinical term and word embeddings from SNOMED CT ontology. After obtaining embeddings for concepts using a graph-based method, a deep learning network is trained to map clinical terms to these embeddings. In this process, the network learns the embeddings of clinical terms and words, including their synonyms, as well as learns to obtain embeddings for previously unseen clinical terms. To the best of our knowledge, this is the first method that obtains clinical term embeddings from clinical concept embeddings. Using standard benchmark datasets, the method was evaluated on clinical term similarity prediction task and on clinical term normalization task. Both corpus-based and ontology-based embeddings suffer from the limitation that they tend to obtain similar embeddings for terms with opposite or analogous meaning, for example, “left kidney” and “right kidney”. Although, they seem similar, “left kidney” and “right kidney” clinically mean very different things and hence should not be treated as similar. To counter this limitation of embeddings, we also introduce a method to automatically learn patterns from UMLS [7] which indicate whether two clinical terms would have same meaning or not. Not only these patterns showed improvement in normalization performance for both corpus-based and ontology-based embeddings, but they could also be used as a resource to further improve clinical term embeddings in future.

Comments (0)

No login
gif