In a previous study, we collected data on surgeons performing RARP on the RobotiX Mentor Simulator™ (Surgical Science, Gothenburg, Sweden) [12]. Ten novice surgeons with knowledge of the procedure (had assisted one or more RARP but never performed robotic surgery) and seven experienced RARP surgeons (performed ≥ 50 RARP) performed a maximum of three repetitions on each of the three part-procedures of the RARP: Bladder-neck dissection (BND), Neurovascular-bundle dissection (NVBD), and Urethrovesical anastomosis (UVA). Full-length videos of each part of the procedure were automatically recorded on the simulator. All videos were manually annotated by the principal investigator RGO using the event-logging software Behavioral Observation Research Interactive Software (BORIS, version 8.19.4, Torino, Italy, http://www.boris.unito.it) [13]. The videos were annotated with five different surgical gestures: Regular dissection, Hemostatic control, Clip application, Needle handling, and Suturing (Table 1). Further details on the annotation process available in previous publications by Olsen et al. [10, 14].
Table 1 The neural network's classification performance for each class of surgical gestures showed high accuracy and specificity, with a lower performance in sensitivitySystem designA system was created to automatically annotate video sequences with the five different surgical gestures. The model consists of two neural networks in series: a transformer-based encoder network feeding to a recurrent neural network.
Processing, feature extraction, and sequence predictionThis model aimed to extract features from a sequence of frames in the video and, via training, predict one of the five classes of surgical gestures.
The model consisted of two individual networks: a pre-trained feature extractor (VisionTransformer using Imagenet) and a classification head on top. The classification head was the only network updated during the model training.
The feature extractor is an untrained vision transformer used to reduce the number of dimensions in the frames by extracting features from each frame. This makes the image easier to process for the network without losing important features. Different feature extractors were analyzed using the Monte-Carlo cross-validation to find the best-performing one (VIT, INception, Resnet) with a Vision Transformer trained on ‘ImageNet’ performing the best. ImageNet is a database where others have trained the network to assign a class to everyday objects, e.g., animals, furniture, persons, etc. The ImageNet does not contain surgical images but has been used as a basis for feature extractors for pathological images and surgical video analysis in previous studies [7, 15].
The classification head is a trained recurrent neural network with a Long Short-Term Memory (LSTM(128)) architecture with a fully connected layer designed to reduce the LSTM(128) down to a prediction of the five gestures used for analysis.
Data preprocessingBefore feeding the videos to the model, they were pre-processed into input sequences of 25 frames over 1 s (Fig. 1). All frames were normalized in color to match the requirements of the vision transformer and rescaled to a dimension of 299 × 299 pixels and three colors. These sequences were then used as input to the model (Fig. 2).
Fig. 1Pre-processing of the annotated videos from full videos into sequences of 25 frames per 1 s. The data set consisted of part procedures performed by novice or experienced RARP surgeons. Each surgeon performed each part of the procedure thrice. This part procedure was annotated and sampled at 25 frames per second
Fig. 2Overview of the model for automated surgical gesture annotation. Input: image sequences were pre-processed to 25 frames per second; Processing: the feature extractor reduced the number of dimensions in the frames while the classification head predicts the five gestures; Prediction based on the last 10 frames in a sequence, and each sequence is labeled with one of the five gestures. The feature extractor and classification header predict one gesture per input sequence. LSTM: Long Short-Term Memory
Model trainingTo train the network, the data set was split into a training + validation set and a test set with a sevenfold split to ensure the exclusion of video sequences from one experienced surgeon in each training and validation set, reserving the same ratio of experienced/novice videos for the test-set. This means that the dataset was divided into seven equal parts, wherein each fold, six parts were used for training and validation, and the remaining part was used for testing. This process was repeated seven times, ensuring that each subset served as the test set once.
The network was initially updated on the training sets and guided by the model's validation set performance. The test set was held out of the training processes and used to validate model performance. This ensures an unbiased prediction of the surgical gestures based on a dataset the network was not trained on and had never been presented with before.
Model inferenceAt inference, using the model for prediction, the actual usage of the trained model labeled every input sequence of all videos with one of the five surgical gestures (Fig. 2). During inference, we used a sliding window approach with a 15-frame overlap between sequences. For each input sequence (e.g., 25 frames), the model outputs predictions for all frames, but only the predictions for the last 10 frames in the sequence are retained. These are then appended to the overall prediction array. This approach ensures that each frame receives only one prediction, without the need for any additional post-processing to resolve overlaps.
Using Softmax activation, the model predicted the probability of each of the five gestures being present in a sequence (Fig. 3). When all predictions were made, the probability of a gesture being present was updated according to Bayes’ theorem. The gesture with the highest probability in the sequence was labeled, so only one gesture was labeled for each input sequence.
Fig. 3An example of the prediction of a gesture classification class for Regular dissection in NVBD. The top panel shows the model's probability output, P(x), over time. The solid black line represents the smoothed probability signal, while the dashed black line indicates the decision threshold used to classify gestures. Areas shaded green correspond to segments where the probability exceeds the threshold (i.e., predicted presence of the gesture), and areas shaded red indicate segments below the threshold (i.e., absence of the gesture). The second panel displays the Predicted gesture labels over time, where orange segments represent predicted gesture occurrences, and black segments indicate no gesture predicted. The third panel shows the Annotated ground truth labels, with yellow segments representing annotated gesture occurrences and dark blue segments indicating no gesture. The bottom panel visualizes the Overlap, with black bars marking frames where the prediction matches the ground truth and white areas indicating mismatches. The x-axis reflects normalized video progress in per mille units (‰ video).
Post-processing: from sequences to segmented videoThe network's input was the whole video, normalized to a fixed length of 1000 to easily handle videos of varying lengths consistently. The analysis employs an adaptive thresholding approach based on entropy to enhance prediction precision and dynamic event detection (Fig. 3). The entropy-based thresholding operates by calculating a sliding window entropy over the predictions from the model, adjusting the threshold adaptively to account for local variations in prediction uncertainty. Higher entropy results in a more relaxed threshold, while lower entropy enforces stricter criteria. Hysteresis thresholding was applied using dynamic high and low thresholds derived from entropy-adjusted thresholds. The high threshold is set slightly above the entropy-based threshold (+ 0.1), while the low threshold is set slightly below (− 0.1). This two-level thresholding enables robust event detection by accounting for transient, high-frequency fluctuations in predictions.
Visual comparison of correct and incorrect predictionsTo qualitatively assess model performance, we generated visual comparisons between misclassified and correctly classified sequences. For each misclassified example, a correctly classified sequence of the same class label and from the same video was identified. Matching was performed by comparing filename-encoded metadata: class label and video ID were required to be identical, and the sequence number was allowed to vary within a range of ± 10 sequences (250frames, ~ 10 secs) to find a close temporal match. Only unused correct examples were considered to avoid repetition.
Each video sequence was represented as a fixed-size image palette composed of 25 consecutive RGB frames arranged in a grid. These palettes were saved both as standalone images and embedded in a multi-page PDF to support side-by-side visual inspection (Supplementary File).
Statistical analysisThe overall performance of the neural network was assessed by multiple commonly used metrics for multi-label classification: Area Under the Curve (AUC) for assessing true positive rate against false positive rate and F1-score for determining the harmonic mean of precision (Positive predictive value) and sensitivity (True positive rate).
Performance for each of the five surgical gesture classifications was assessed using metrics for single-label classification: Sensitivity (True positive rate), Specificity (True Negative rate), and Accuracy (overall proportion of correct predictions, both true positives and true negatives).
Performance on the video level: The Intersection over Union (IoU) is a common metric used to evaluate the similarity between two arrays, typically in the context of temporal segmentation tasks. In the standard definition, the IoU is calculated as the ratio of the intersection (overlapping positive/true elements) to the union (total unique positive elements) of the two arrays. However, this conventional approach ignores cases where both prediction and annotations contain zero values – a situation that is also desirable. To account for this, we defined Total Agreement, an extended version of IoU, which incorporates both matching positive values and zero matching into the intersection and union calculations.
The models and statistical analysis were performed using the Python programming language (version 3.10.10, Python Software Foundation, Amsterdam, Netherlands, https://www.python.org/ and the libraries OpenCV, Numpy, Scipy, Tensorflow, Keras-vit, and Matplotlib.
Comments (0)