The augmented value of using clinical notes in semi-automated surveillance of deep surgical site infections after colorectal surgery

Study design, setting and study population

This is a retrospective, observational cohort study including patients undergoing colorectal surgery (i.e., primary or secondary colorectal resections, incisions or anastomosis) performed at the Karolinska University Hospital (KUH) Sweden, between January 1, 2015 and August 31, 2020. KUH is a tertiary care academic centre with 1,100 beds divided between two hospitals (Huddinge and Solna), which serves the population of Region Stockholm (2.3 million inhabitants). The original semi-automated algorithm [18] – based on structured data – was validated in KUH [27]. The same 225 randomly selected surgeries were also used as validation cohort for this study. The semi-automated algorithm was subsequently applied to the remaining colorectal surgeries and a random sample of 250 high-probability records were selected for the development cohort (Fig. 1). Model results were compared with the reference standard, which is the traditional manually annotated surveillance. The study was approved by the Regional Ethical Review Board in Stockholm, Sweden (2018/1030-31).

Fig. 1figure 1

Flow chart of the study

ECDC: European Centre for Disease Prevention and Control; NLP: natural language processing; SSI: surgical site infection.

Semi-automated algorithm as published in Verberk et al. [18].

Outcome

The outcome was deep SSI or organ/space SSI, hereinafter together referred to as deep SSI, versus no deep SSI within 30 days after the colorectal procedure. The outcome was recorded during manual annotation by two experienced ICPs according to the European Centre for Disease Prevention and Control (ECDC) SSI definition and guidelines at the time of this study [6].

Data sources

The 2SPARE (2020 started Stockholm/Sweden Proactive Adverse Events REsearch) database is an SQL-based relational database and a duplicate of prospectively entered data from the EHR system of KUH, containing data on patient characteristics, hospital admission and discharge records, outpatient records, physiological parameters, medication, microbiology, clinical chemistry, radiology, and clinical notes. The clinical notes data includes unstructured, free-text notes such as progress notes, discharge summaries, history and physical examination notes, and telephone encounter notes, all written in the Swedish language. We limited the notes to those written by physicians, residents, surgery assistants, and nurses, and to those written within 1–30 days post-surgery, as these are most likely to contain SSI-relevant information.

Development and validation of NLP-augmented algorithms

The NLP algorithms were developed as an ‘add-on’ component and designed as an additional step following the existing semi-automated algorithm, aiming to reduce false-positive results whilst maintaining high sensitivity. This sequential design will arguably lower implementation thresholds in the future as hospitals can already start implementing the semi-automated algorithm with structured data and may later add the (more advanced and challenging) NLP component (Fig. 2). Several NLP components were developed using the development cohort consisting of high-probability individuals. The final NLP-augmented algorithms were validated using the validation cohort as described above (Fig. 1).

Fig. 2figure 2

Flow diagram of natural language processing-augmented surveillance algorithm for deep surgical site infections

NLP: natural language processing; SSI: surgical site infection.

Schematic overview of the original semi-automated algorithm comprised of structured data (grey frame), augmented with unstructured data from clinical notes (blue frame).

Admissions: Length of stay of index admission ≥ 14 days or 1 readmission to original department or in-hospital mortality within follow-up (FU) time (= 45 days after surgery).

Re-surgery: ≥1 reoperation by original surgery specialty after the index surgery but within FU time.

Radiology: ≥1 orders for CT scan within FU time.

Antibiotics: ≥3 consecutive days of antibiotics (ATC J01) within FU time, starting from day 2 (index surgery = day 0).

Fulfilling 2–4 of the above components classifies surgery as high probability for deep SSI.

Pre-processing of linguistic variables

A list of keywords was compiled by reviewing clinical literature and local case reports, and by expert consultation in the Netherlands and Sweden (i.e., colorectal surgeons, medical microbiologists, ICPs, infectious disease consultants). Next, from the keywords we created a list of lemmatized versions and applied part-of-speech tagging to capture differences in grammatical and spelling versions of the words. This resulted in the overall lexicon list. The keywords given by Dutch experts were translated to Swedish to be able to apply them on Swedish notes, and all keywords were translated to English for the purpose of reporting results.

Feature selection and algorithm development

The original keywords and their lemmatized versions can be considered as ‘features’. All text from the clinical notes were matched with the lexicon list and each feature match was counted. Negation detection using the NegEx algorithm was applied to filter out negated mentions [28]. For example, in case ‘no signs of infection’ is written down, the keyword ‘infection’ is negated and not counted as a keyword match. Subsequently, three input types were considered: a count per keyword, a discretized count with four bins, and a binary model indicating the presence or absence of a keyword. Each input type has its pros and cons: a binary representation benefits from its simplicity, however, cannot capture the case when several mentions of the same keyword corresponds to a stronger deep SSI signal. The count per keyword, on the other hand, captures the number of times each keyword is mentioned, but is more sensitive to writing styles and will have fewer examples of each distinct keyword count in the training data. The discretized count can be viewed as a compromise between the binary model and the counts model, since three of the bins represented a count below, within, or above the expected interquartile range, and the fourth bin represented no occurrences.

During development, we split the development cohort consisting of high-probability records as classified by the original algorithm into training (80%) and testing data sets (20%) to evaluate parameters of the learning algorithms. Two tree-based classification algorithms, a single decision tree (DT) and a random forest (RF) with 500 trees [29], were evaluated for their ability to separate between the two classes, deep SSI and no deep SSI [30, 31]. A DT has the benefit of being interpretable, since the tree can be understood as one set of rules for classifying future patients as belonging to either class. An RF, on the other hand, is a more complex model with multiple sets of rules and therefore lacks interpretability, but often outperforms a DT. Each of the classifiers, DT and RF, was applied to each feature representation (raw counts, discretized counts, or binary counts) resulting in six tree-based models.

For application in semi-automated surveillance, a near-perfect sensitivity is required as false-positive cases are corrected during subsequent chart review, whereas false-negative cases will remain unnoticed. To increase the sensitivity when using the DT classifier, ten small decision trees with slightly different characteristics were inferred from the development data. Subsequently, in the validation cohort, a patient was classified as deep SSI if any of these trees classified the patient as such. This ensemble of DT classifiers could be considered as a miniature forest with a decision threshold of 0.1. Within an RF, each tree classifies each patient in the data set. Generally, for an RF with two classes, a majority decision determined class membership, i.e., the class assigned by a majority of the trees will be assigned to the patient. This corresponds to a decision threshold of 0.5, meaning that 50% of the trees are required to consider a patient as belonging to the class deep SSI for the RF to classify it as such. To increase the sensitivity of the RF the conventional decision threshold of 0.5 was lowered, meaning that fewer trees are required to classify a patient as deep SSI, which will however reduce PPV. Multiple decision thresholds were explored using the development cohort, and the thresholds of 0.3 (for model using raw or discretized counts) and 0.35 (model using binary counts) were selected to ensure a high sensitivity (> 0.95 in the development cohort).

Rule-based NLP component

Furthermore, a rule-based NLP component was developed based on keywords reflecting the deep SSI definition (Fig. 3). This NLP component is more straightforward as no DT or RF techniques are used, i.e., if a keyword match was present for a patient according to the OR/AND-rules as specified in the supplementary file, the patient was classified as probable deep SSI by the algorithm.

Fig. 3figure 3 Analysis

Baseline characteristics were compared between the high probability groups – as defined by the original algorithm – of the development and validation cohorts. Heat maps were created from the development cohort to visualize the presence of keywords between the deep SSI group and the group without. In total, seven surveillance models were compared, as described above, with the original semi-automated model composed of structured data only (model 1): model 1 augmented with the NLP component developed with DT using either raw counts (model 2), discretized counts (model 3) or binary counts (model 4); model 1 augmented with the NLP component developed with RF using either raw counts (model 5), discretized counts (model 6) or binary counts (model 7); and model 1 augmented with the rule-based component (model 8). The performance measures sensitivity (recall), specificity, PPV (precision), negative predictive value (NPV) and workload reduction (WR) with corresponding 95%-confidence intervals (95%CI) were calculated as compared to the reference standard. WR was defined as the difference between the total number of surgeries under surveillance and the proportion of surgeries requiring manual review after algorithm application. 2SPARE data acquisition, management and analysis were performed using R statistical software (version 3.6.1) and Python (version 3.7), and in accordance with current regulations concerning privacy and ethics.

Comments (0)

No login
gif