Thyroid nodules, one of the most common diseases in the thyroid gland, pose a significant public health risk. Medical imaging techniques, including ultrasound (US), magnetic resonance imaging (MRI), and computed tomography (CT), can help radiologists diagnose and treat thyroid nodules. Specifically, ultrasound has become the primary choice for clinical examination of thyroid nodules, due to its portability, cost-effective operation, real-time feedback, and lack of radiation exposure or associated toxicity. However, ultrasonic manifestations of thyroid nodules are diverse and easily affected by image artifacts and noise. As such, professional knowledge and clinical experience are vital to accurate clinical diagnosis. The thyroid imaging reporting and data system (TI-RADS) (Tessler et al., 2017) was recently proposed to standardize image acquisition, interpretation, reporting, and data collection for thyroid ultrasound samples. As a widely recognized standard for risk stratification, this protocol summarizes a range of highly suspicious characteristics, including hypoechoic reflectance, halo loss, microcalcifications, increased hardness, nodule inflow, and shape aspect ratio. Additional details are provided in Table 1. The sub-classes in the second column indicate distinct representations of each characteristic, corresponding to individual scores in the third column. The sum of scores across all characteristics provides a risk stratification for thyroid nodules. The relationship between specific risk stratifications and TI-RADS scores is presented in Table 2.
Fine needle aspiration biopsy (FNAB), the current gold standard for thyroid nodule diagnosis, is an invasive technique. This limitation can introduce patient trauma and additional costs, making it unsuitable for large-scale screening. As such, radiologists typically utilize the characteristics of thyroid nodules evident in ultrasound (US) videos to determine whether FNAB is required. This not only provides a preliminary assessment but also effectively reduces unnecessary invasive punctures or surgeries. However, noise, artifacts, inconsistent scanning techniques, and varied personal experience may cause individual radiologists to provide different interpretations of a given sample. Thus, computer-aided diagnosis (CAD) technologies are of broad interest, as they minimize radiologist-dependent influence. CAD techniques generally include two critical tasks: nodule location and classification. Early methods often utilized hand-crafted features and different classifiers for diagnosis, such as random forest (Zhang et al., 2019) and support vector machine (Balasubramanian and Moorthi, 2013). However, collecting features manually is time consuming and labor intensive. The success of convolutional neural networks (CNNs) in various computer vision tasks, including detection (Zhao et al., 2022b), segmentation (He et al., 2017, Wu et al., 2020, Wu et al., 2022, Pu et al., 2022, Lu et al., 2022), classification (Pu et al., 2021a, Pu et al., 2021b), and regression (Gao et al., 2023), has led to their use for automatically learning features from thyroid nodule images.
For example, Chi et al. (2017) utilized a pre-trained GoogLeNet and a cost-sensitive random forest algorithm for classification. Zhu et al. (2017) fine-tuned a ResNet using a novel data augmentation strategy to classify thyroid nodules. Liu et al. (2017) combined hand-crafted features with deep semantic information extracted by VGGNet to achieve classification through feature subset selection. Qin et al. (2020) proposed a novel network used to extract features from conventional US and US elasticity images, fusing them to differentiate benign and malignant nodules. Wang et al. (2020) iteratively selected multiple images, extracting and aggregating their features in one examination to improve diagnostic performance. Thomas and Haertling (2020) calculated the image embedding similarity between a sample and corresponding images in a database. Chen et al. (2021) proposed a multi-view ensemble network that combined GoogleNet, U-Net, and statistical texture features to achieve CAD through a voting mechanism. Liu et al. (2019) proposed a multi-scale region-based detection framework to identify nodules in US images and a multi-branch classification network to integrate multi-view features, achieving automated detection and classification. Song et al. (2019) designed a two-stage framework to locate and classify thyroid nodules ranging from coarse to fine. Shi et al. (2020) extracted domain knowledge from radiologists to guide the synthesis of medical images and improve the robustness of classification. Subsequently, Yang et al. (2021) further utilized domain knowledge and multi-modal US images to improve diagnostic accuracy. Zhao et al. (2022a) proposed a dual-pathway image analysis framework to capture local and global information. Chen et al. (2018) developed a two-level attention-based bidirectional long short-term memory (BiLSTM) network to learn information from US reports. Similarly, Chang et al. (2020) extracted the diagnostic behavior of individual radiologists from accumulated examination reports, thereby improving thyroid nodule classification.
Although the methods discussed above have achieved excellent performance, they mostly focus on static images (with representative features selected manually from scanning videos) or function as a black box for radiologists, directly predicting a final result without any explanation or basis. In addition, the manual selection process is time consuming and labor intensive, and black box systems cannot provide the interpretation necessary to improve model credibility. Furthermore, during clinical examinations, radiologists often base diagnoses on consecutive video frames rather than single images. Since video footage is comprised of continuous frames, the series contains not only abundant spatial features but also temporal information between samples. Hence, a growing number of video classification methods have been proposed, using both spatial and temporal information. For example, Simonyan and Zisserman (2014) first introduced a two-stream paradigm for modeling temporal information (optical flow) and spatial information in videos used to achieve action recognition. Inspired by this, the temporal segment network (TSN) (Wang et al., 2016) utilizes a sparse sampling strategy to capture short snippets along a temporal dimension and reduce computational costs. Subsequently, the temporal relation network (TRN) (Zhou et al., 2018) investigated video frames at different time scales to recognize specific activities. The temporal shift module (TSM) (Lin et al., 2019) shifted channels in a time dimension to exchange information among neighboring frames and efficiently achieve video analysis. The slow-fast algorithm (Feichtenhofer et al., 2019) achieved video recognition by using a slow branch with fewer frames and more channels to learn spatial semantic information. A fast branch with more frames and fewer channels was then applied to learn motion information. Additionally, the temporal difference network (TDN) (Wang et al., 2021) utilized a sparse sampling strategy to model temporal motion in video footage.
The C2D (Karpathy et al., 2014) technique captures spatio-temporal features from large-scale video datasets by extending a CNN in the temporal dimension and introducing a multi-resolution architecture to speed up training. C3D, a subsequent version proposed by Tran et al. (2015), implicitly propagated temporal features across all layers of a 3D ConvNets framework as part of video analysis. The Res3D (Tran et al., 2017) model was also developed to incorporate 3D convolutions into a residual network, thereby reducing parameters and inference time. Similarly, Hussein et al. (2017) proposed a 3D multi-task architecture combining graph sparse representations to achieve risk stratification for lung nodules in CT scans. Shen et al. (2019) also utilized 3D convolutions to extract features from CT images, proposing a two-level classification system to improve the interpretability of results. Subsequently, R2plus1D (Tran et al., 2018) introduced 3D convolutions into spatial 2D convolutional blocks, followed by temporal 1D convolutions, to achieve more accurate results. Similarly, P3D (Qiu et al., 2017) integrated both 2D spatial and 1D temporal convolutions into a residual network. I3D (Carreira and Zisserman, 2017) converted 2D convolutions into 3D using InceptionV1 (Szegedy et al., 2014) and incorporated the process into a two-stream paradigm to achieve action recognition. Inspired by I3D, Xie et al. (2017) proposed S3D-G, which included spatio-temporal separable Conv3D operations with feature gating, to improve classification. Varol et al. (2018) then introduced long-term temporal convolutions (LTCs) into AlexNet to increase the temporal extent of representations and improve accuracy.
LSTM (Ng et al., 2015) and its variants (Cho et al., 2014, Zhou et al., 2016) have been shown to capture long-range dependencies in temporally ordered sequences. As such, Ng et al. (2015) explored feature pooling in LSTM as a means of integrating frame-level features extracted by a CNN in time, to achieve video classification. Subsequently, ConvLSTM (Shi et al., 2015) added cell memory information to the input of each LSTM gate, converting the original time series vector into a matrix to reduce redundant information and enhance spatial correlation. LRCNs (Donahue et al., 2017) use cascaded CNNs to extract features for input to an LSTM, outputting an action recognition result. ARTNet (Wang et al., 2018) stacks multiple generic building blocks to simultaneously model appearance and relation information from RGB input, thereby achieving video classification. Wang et al. (2017) introduced a non-local block to capture long-range dependencies, which can be embedded into any existing network to significantly improve performance.
Although the methods discussed above have achieved great success, they are not suitable for the dataset used in this study, primarily for three reasons. (1) Several interfering frames are present in each section of video footage and the characteristics of individual images differ significantly for a given nodule. This is evident in Fig. 1, where FNAB results in the first row are benign but other frames are included with highly suspicious TI-RADS scores. In contrast, the FNAB result in the second row is malignant but the majority of frames suggest a benign diagnosis, which would degrade model classification performance. (2) The variability of frame quantities in this dataset is relatively large, ranging from 2 to 180. As such, it is difficult to identify a fixed frame length for learning the distribution of nodule features, which may affect model performance. (3) These methods represent a black box for radiologists, as they do not consider the number of thyroid nodules or provide interpretable diagnostic evidence.
To address these challenges, we propose a novel interpretable diagnostic framework for thyroid nodules in US videos, aiming to simulate the manual diagnostic process and provide radiologists with interpretable suggestions. Specifically, we first develop a multi-task learning model to interpret each frame using specific TI-RADS attributes, to improve efficiency and interpretability. We then convert these attributes into a self-defined one-dimensional embedding vector to reduce computational costs and more effectively represent both similarities and differences among images. Finally, classification performance is improved by processing frames of different lengths and dynamically enhancing useful information. This is done by utilizing a BiLSTM to capture bi-directional context information from one-dimensional embedding vector sequences. An attention mechanism is also introduced into the BiLSTM to selectively enhance critical information and improve the classification of thyroid nodules in US videos.
The primary motivation for this study is the development of a computer-aided diagnosis (CAD) system that closely resembles the clinical workflow used by radiologists. This approach not only provides accurate automated diagnostic results (to improve efficiency), it also aims to increase the interpretability of CAD outputs, allowing clinicians to make more informed and intuitive classification decisions. This was achieved primarily through the inclusion of an attention mechanism, to identify distinguishing features in thyroid nodules, as well as the use of a BiLSTM to collect spatio-temporal information between ultrasound video frames. This novel approach was inspired by the manual diagnostic process, as contextual information and the relationship between adjacent ultrasound images often provides critical distinguishing information during nodule assessment. The primary contributions of this study can be summarized as follows:
•To the best of our knowledge, we present the first interpretable CAD framework for thyroid nodules in ultrasound videos, achieving both frame-wise interpretation and video-level classification of each nodule. This process not only enhances interpretability but also has the potential to alleviate radiologist workload through computer-aided diagnosis.
•We develop a self-defined one-dimensional embedding vector to represent discriminative attributes for thyroid nodules extracted by the multi-task model. This not only provides a risk stratification for nodules, it also allows the algorithm to capture contextual information from variable-length video frames. The proposed embedding process can effectively reduce model complexity, decrease computational costs, and differentiate benign and malignant nodules more accurately.
•We conduct a comprehensive evaluation of model performance. Experimental results demonstrated the proposed framework not only outperformed other video classification techniques, it also reduced computational costs and improved result interpretability.
Comments (0)