Medical image segmentation plays a crucial role in delivering effective patient care in diagnostic and treatment practices. Among numerous segmentation approaches, deep convolutional neural networks (DCNNs) have recently become the de facto standard, providing the state-of-the-art (SOTA) performance on many segmentation tasks (Litjens et al., 2017, Hesamian et al., 2019, Wu et al., 2022). Known as data-driven techniques, CNNs require a large scale of accurately annotated images for training, which is indeed time-consuming and resource-expensive to obtain for medical image segmentation tasks. Besides its tremendous cost, manual annotation of medical images can hardly be accurate, since it is highly subjective and relies on observers’ perception and expertise (Vincent et al., 2021, Liao et al., 2022, Fu et al., 2014, Taghanaki et al., 2018, Mirikharaji et al., 2021). For example, three trained observers (two radiologists and one radiotherapist) delineated a lesion of the liver in an abdominal CT image twice with an interval of about one week, resulting in the variation of delineated areas up to 10% per observer and more than 20% between observers (Suetens, 2017). The annotator-related bias in ground truths (GTs) is an ‘inconvenient truth’ in the field of medical image segmentation, whose impact has been rarely discussed.
To reduce the impact of such annotator-related biases, each training sample can be annotated by multiple medical professionals independently (Liao et al., 2022, Fu et al., 2014, Taghanaki et al., 2018, Mirikharaji et al., 2021) (see Fig. 1), and a proxy ground truth is generated via majority voting (Guan et al., 2018), label fusion (Chen et al., 2019, Li et al., 2020, Liu et al., 2020, Zhang et al., 2020c, Zhao et al., 2020, Warfield et al., 2004), or label sampling (Jensen et al., 2019). It is worth noting that, in many cases, the diverse annotations provided by multiple annotators are all reasonable but with different preferences. For instance, a medical professional who advocates for active treatment usually delineates a slightly larger area of a lesion than the area marked by others. To illustrate the annotator’s preference, we show two fundus images from the RIGA dataset and the annotations of optic disc and optic cup given by six annotators in the right part of Fig. 2. The IoU of each annotator’s delineation over the union of six annotations is calculated, and the average IoU values over all training samples (a quantification of annotator preference) are shown in the left part of Fig. 2. It reveals that the annotator A3 prefers to mark larger optic discs, and the annotator A4 and A2 prefer to mark larger optic cups.
In the process of delineating, we assume that there is a latent true segmentation that is the consensus of all annotators. However, due to the annotator preference, annotations from multiple annotators are not fully consistent. Moreover, the delineation could also be affected by stochastic errors. The latent true segmentation is called meta segmentation in this paper. Using proxy ground truths can somewhat diminish the impact of stochastic annotation errors (Monteiro et al., 2020, Mirikharaji et al., 2019), but cannot tackle the annotator’s preference (Ji et al., 2021, Guan et al., 2018, Mirikharaji et al., 2019, Ribeiro et al., 2019, Lampert et al., 2016). Particularly, converting the multiple annotations of each training image into a proxy ground truth not only overlooks the rich information embedded in those annotations, but, more important, may lead the segmentation result to be neither fish nor fowl. Therefore, instead of fitting the proxy ground truths, we advocate simulating the process of delineation by estimating annotators’ preferences and stochastic annotation errors. Thus, a CNN is able to not only produce the meta segmentation but also mimic each annotator to segment medical images with his or her preference (see the green pipeline in Fig. 1).
To this end, we propose a Preference-involved Annotation Distribution Learning (PADL) framework to address the issue of annotator-related bias in medical image segmentation. Under this framework, there is an encoder–decoder backbone, a stochastic error modeling (SEM) module, a series of human preference modeling (HPM) modules, and a series of Gaussian Sampling modules. The encoder–decoder backbone extracts image features. The SEM module uses the features to estimate the meta segmentation map μ and average stochastic error map σ. This module contains an entropy guided attention (EGA) block to guide the estimation of σ. In the r-th HPM module, a preference estimation block is used to estimate the annotator-specific segmentation map μr with the preference of r-th annotator, and an EGA block estimates the corresponding stochastic error map σr. The SEM module and each HPM module are equipped with a Gaussian Sampling module, which samples a probabilistic segmentation map from the Gaussian distribution established by the estimated μ (or μr) and σ (or σr). The loss function is composed of the meta segmentation loss and annotator-specific segmentation loss, each being defined as the cross-entropy loss between the sampled segmentation maps and annotations. We have evaluated the proposed PADL on two medical image segmentation benchmarks, which include multiple imaging modalities and five segmentation tasks and are annotated by multiple medical professionals. To summarize, the contributions of this work are three-fold.
•We highlight the issue of annotator-related biases existed in medical image segmentation tasks, and propose the PADL framework to address it from the perspective of modeling annotator’s preference and stochastic errors so as to produce not only a meta segmentation but also the segmentation possibly made by each annotator.
•We treat the annotation bias as the combination of an annotator’s preference and stochastic errors, and hence design the SEM module and annotator-specific HPM module to characterize each annotator’s preference while diminishing the impact of stochastic errors.
•Our PADL framework achieves superior performance against other methods in tackling this issue on two medical image segmentation benchmarks (five tasks).
Comments (0)