This work proposes a flexible DL-based framework for the interpretation of lung POCUS videos (Fig. 3), which rests on four core blocks: (1) a pre-processing block; (2) a supervised learning block; (3) a semi-supervised learning (SSL) block; and (4) a model ensemble block.
In brief, the pre-processing block (Sect. “Pre-processing Block”) standardises the input, masks out any information outside the field-of-view (FOV), and splits the video into multiple overlapping clips (each assumed to share the full-video labels). The supervised learning block (Sect. “Supervised Block”) utilises the available labeled data to train a 3D convolutional neural network (CNN) while employing data augmentation (Sect. “Data Augmentation”) and label smoothing regularisation. The trained classifier provides per-clip outputs that are then aggregated into a video-level prediction (Sect. “Video-Level Inference Routine”). In the SSL block (Sect. “Semi-supervised Block”), one employs the previously trained classifier to predict pseudo-labels for the unlabeled dataset and select those with high-confidence and low-uncertainty using the Uncertainty-aware Pseudo-label Selection (UPS) method [22]. Leveraging both labeled and pseudo-labeled samples, the proposed network is trained once more from scratch. The classifier’s performance is further boosted through ensemble modeling (Sect. “Ensemble Modeling”) using a novel strategy that leverages the hierarchy inherent to LUS interpretation.
Pre-processing BlockUltrasound videos have additional information and marks surrounding the sector scan, such as the indication of depth or details about the machine’s settings. This superfluous information may negatively influence training, as the network may focus on it during its learning process. For this reason, all videos were pre-processed to mask out any information outside the FOV using a custom masking algorithm (described in supplementary material). Details concerning the scan sector are also extracted, like the probe’s virtual origin, the radii that originate the sector’s superior and inferior arcs, and its opening angle, which will be used later in the data augmentation phase.
All videos were scaled to 128\(\times \)128 pixels by nearest neighbour interpolation (to minimise smoothing effects) with padding being used when needed. Additionally, pixel values were converted to grey-scale and scaled to the [0,1] range by dividing by 255.
Supervised BlockGiven the necessity for each clip to encompass a full respiratory cycle (typically around 4 s) to ensure adequate LUS assessment, the network’s input was set to 32 frames, with clips sampled at a rate of 8 Hz. In instances where videos contain an insufficient number of frames, empty frames are appended to the clip’s end.
We employ the R2+1D network with 18 layers, proposed in [23]. This network is a ResNet-based architecture and gets its name due to the factorisation of the 3D convolutional filters into spatial (2D) and temporal (1D) operations. The aim is that, by having an additional non-linearity between these operations, the number of non-linear rectification doubles compared to the non-factorised version, R3D, which translates into a model capable of performing more complex functions. Additionally, differently from 3D filters where the dynamics are intertwined, R2+1D provides an easier model to optimise [23]. Note however that, differently from [23], the proposed implementation does not consider an increased number of filters per convolutional layer, which reduces the number of parameters compared to R3D and, consequently, its memory usage and training time.
Data AugmentationTo increase the generalisation capabilities of the model, particularly considering it is a 3D network, data augmentation was utilised.
Although all ultrasound videos have the same beam direction (originating from the probe’s origin), in LUS exams, such fact produces particular findings (e.g. A-lines and B-lines) whose features are impossible to occur in any other direction than the beam one. Hence, there is a need for extra caution when choosing the augmentations to apply, since some could create unrealistic videos and reduce the clinical significance of the resulting classifier. To tackle this, one applies a few vanilla augmentations in the polar space instead of the Cartesian one. To perform these transformations, one needs the probe’s virtual origin (acting as the pole), as well as the sector scan’s opening angle (limits of the angular coordinate) and radii (limits of the radial coordinate). These values, as previously mentioned, were saved when the sector scan’s mask was created.
Specifically, the video clip’s scan sector is first converted to the polar space (with column and rows corresponding to angular and radial axes, respectively). In this space, a 1D scaling transform is applied over the radial axis, using the pole (i.e. the first row of the polar image) as origin. In Cartesian space, it corresponds to varying the image’s axial resolution by artificially modifying its depth. A rotation transformation is then applied through a translation along the angular axis. Again, the probe’s virtual origin is fixed, which in the Cartesian space represents the rocking of the probe. Still in polar coordinates, we apply a linear contrast augmentation. Only afterwards the clip is converted back into Cartesian coordinates using the original pole, opening angle, and radii, which guarantees that the content is still restricted to the scan’s FOV (and that the background intensity is kept unchanged). Finally, a horizontal flip transform is considered. These transformations were implemented using the Solt [24] package.
Video-Level Inference RoutineVideos differ from images in their mutability, as the presence or absence of findings can coexist in different segments of the same video. While the dataset was annotated at the video level, it is possible that the labeled finding is not present throughout the entire video. This can reduce the algorithm’s sensitivity if only one random clip is assessed. To tackle this, the inference process was altered.
Instead of extracting a single clip from the video and obtaining a single prediction, one divides the LUS video into multiple overlapping clips with equal length, starting the first clip in the first frame. To obtain a video-level prediction, an average of the resulting scores of the multiple clips is computed. Besides increasing the method’s sensitivity when compared to a one-clip prediction, it also increases its robustness when compared to a whole-video prediction (i.e. inputting the full video at the inference stage). Indeed, although the latter would allow the assessment of the full video and therefore allow the detection of any visible finding, by relying on multiple predictions, one decreases the method’s uncertainty and ultimately increase its accuracy.
Fig. 4Illustration of the proposed multi-output to high-level ensemble method. The aggregation function is represent by f (sum/maximum function for categorical/multi-label) and the coloured blocks represent an average operation. X\(_\) represents the high-level feature i, while X\(_\) denotes the low-level feature j associated to the high-level feature i
Semi-supervised BlockFor semi-supervision, one proposes to use the UPS method [22], utilising the supervised model as backbone. This is a pseudo-labeling method that, through a more precise label selection, aims to diminish the amount of noise present in the pseudo-labels. Additionally, UPS offers an approach for the use of pseudo-labeling in multi-label scenarios and performs well with video data. Note, however, that no results were originally shown for a multi-label video dataset.
The UPS method consists of three steps: pseudo-label generation, pseudo-label selection, and model training. For the generation of pseudo-labels, the video-level predictions of the supervised model are used as hard labels. However, to create more accurate labels and reduce the noise that is injected into the semi-supervised training, only high-confidence predictions are selected. This is applied in both positive and negative manner, meaning that the network can be confident that the output is of a certain class, attributing a positive label, or confident that the output does not belong to that class, giving a negative label. For this, two thresholds are defined, \(\tau \)p the positive threshold and \(\tau \)n the negative threshold. Besides the confidence thresholds, the prediction’s uncertainty is also integrated into the pseudo-label selection. Rizve et al. [22] showed that when labels are selected with more certainty, the calibration error is reduced. As such, two new limits are set, the uncertainty thresholds for positive and negative labels, kp and kn, respectively. To summarise, in the multi-label setting, one has three types of pseudo-labels: positive, negative, and indeterminate. If the prediction value is higher than \(\tau \)p and the uncertainty is lower than kp, the label is positive. If prediction and uncertainty are lower than \(\tau \)n and kn, respectively, the label is negative. If none of the above, the label is considered indeterminate and does not contribute to the loss function. To know which labels are to be used for calculating the loss, along with the one-hot vector, a new vector is introduced, g. For each class c of each sample i, \(g_c^\) is either 1 (reliable) or 0 (indeterminate). In this work, label smoothing regularisation (LSR) [25] was also integrated into the UPS loss function, considering either its categorical or binary cross-entropy formulation according to the task at hand. The multi-label loss function is presented in Eq. 1, where C represents the number of classes, y the true pseudo-label vector after label smoothing, and \(\hat\) the prediction output. Note that, for labeled samples, all classes in g are set to 1.
$$\begin L_ \!=\! \frac g_^ } \sum _^ g_^ \biggr [ y_^ log \Bigl ( \hat_c^ \Bigl ) \!+\! \Bigl ( 1 \!-\! y_^ \Bigl ) log \Bigl ( 1 \!-\! \hat_c^ \Bigl ) \biggr ] \end$$
(1)
From an implementation point-of-view, first, the supervised model is trained on the labeled dataset. Once trained, the predictions for the unlabeled dataset are calculated. Here, predictions are made to the original unlabeled clip and to nine additional augmented versions of said clip. While the prediction for the original clip is used for confidence thresholding, all ten versions are used for uncertainty estimation (computed as the standard deviation of the ten predictions). After selecting the unlabeled samples and respective pseudo-labels, these are combined with the labeled dataset and used to re-train the proposed network from scratch. After each iteration, new pseudo-labels are generated for all unlabeled samples, using the new classifier and the proposed video-level inference technique. The aim is that the pseudo-labels attributed are more curated, injecting a smaller amount of wrongly classified samples in the subsequent network training, while benefiting from the inclusion of more data.
Table 1 Multi-label label sets considered in the experimentsTable 2 Comparison between supervised and semi-supervised learningEnsemble ModelingThe integration of domain knowledge in DL approaches received considerable acclaim [8, 12, 17, 26]. By leveraging the known characteristics of the data, one may adapt a framework and tailor it accordingly, often leading to performance gains. With that in mind, a novel ensemble modeling technique is here proposed, taking advantage of the hierarchy of LUS findings (Fig. 2).
The proposal consists of an ensemble of models trained to predict a different output label set, according to the dataset hierarchy present in Fig. 2. The multiple models’ outputs comprise both low- and high-level labels. All models must comprehend the same label categories. In other words, a model may either have the high-level category as a label or all of the low-level labels of said category. For example, a model can either have the “other pathologies” label, or both “consolidation” and “pleural effusion” labels. It cannot, however, have labels from a category not included in all models.
The intuition is that the classification of high-level nodes can be improved by considering information from their low-level counterparts. The reason for prioritising high-level nodes is that they are often of high clinical relevance, namely for patient screening and management, but have higher intra-class variability, which can increase the classifier’s uncertainty and undermine its performance. By contrast, low-level labels typically exhibit lower intra-class variability, enabling a classifier trained to predict them to focus on more subtle features and achieve superior performance. As such, models trained on low-level labels can be leveraged to enhance models with high-level labels, making use of knowledge that would otherwise be difficult to capture using a high-level model alone.
The proposed ensemble is obtained by combining the predictions of low- and high-level labels from the multiple models (Fig. 4). For each label, an average is computed. If the label is a low-level one (and the corresponding high-level label is not comprised in any model), the average is straightforward, considering the values inferred by all models. If the label is high-level, the process requires additional processing steps. The average prediction is derived from two sources: predictions for the high-level label from models that include it, and an aggregated score from its associated low-level labels. This aggregation varies depending on whether the problem is categorical or multi-label. In a categorical scenario, where model outputs are processed through a softmax function, the sum of all labels’ values equals 1. Consequently, the probability of a high-level label is calculated by summing the probabilities of its associated low-level labels. For the multi-label scenario, where labels are treated independently, the probability of a high-level label is determined by taking the highest predicted score among its associated low-level labels.
Note that, when the purpose is to train a classifier able to detect low-level labels only, the proposed technique comes down to a traditional ensemble modeling. In short, the same model (the one with the intended label set) is trained multiple times and the results are averaged. Independently of the scenario, the training of each one of the models is done independently.
Table 3 Comparison between model ensemble strategiesTable 4 F1-scores of the proposal and two baselines on the test set for mlS1Implementation DetailsCategorical and binary cross-entropy were employed, respectively, as loss function in the multi-class and multi-label scenarios, with LSR applied with a factor of 0.1. Additionally, in the inference routine, the overlapping clips were extracted with a step of 0.5 s between each other.
In the semi-supervised block, \(\tau \)p, \(\tau \)n, kp, and kn were set to 0.5, 0.05, 0.05, and 0.005, respectively.
For data augmentation, we employed scaling within ±30%, rotations up to ±10°, and a contrast gain up to 0.25. Spatial- and intensity-based transformations had a respective 50% and 15% probability of application. In the pseudo-label generation process, we utilised spatial transformations only but with reduced ranges. Specifically, scaling and rotations of up to ±7.5% and ±2.5°, respectively, were always applied, while the flip transform had a 50% probability of application.
In both supervised and semi-supervised settings, the Adam optimiser [27] was employed, running 100 epochs with a batch size of 4. The learning rate was initialised at 1\(\times \)10-3, and updated using a cosine decay learning schedule [28]. The pipeline was developed in TensorFlow, resorting to a workstation with a Nvidia RTX A6000, 64 GB of RAM, and an Intel(R) Core(TM) i9-12900F CPU.
The labeled dataset was split, at the patient-level, 80% for training and 20% for testing, with the training portion used in a 5-fold cross-validation during algorithm development. Training was performed in a mixed precision setting.
Comments (0)