A Recycling Training Strategy for Medical Image Segmentation with Diffusion Denoising Models

Yunguan Fu1,2Orcid, Yiwen Li3Orcid, Shaheer U. Saeed1Orcid, Matthew J. Clarkson1Orcid, Yipeng Hu1Orcid
1: University College London, UK, 2: InstaDeep, UK, 3: University of Oxford, UK
Publication date: 2023/12/07
https://doi.org/10.59275/j.melba.2023-fbe4
PDF · Code · arXiv

Abstract

Denoising diffusion models have found applications in image segmentation by generating segmented masks conditioned on images. Existing studies predominantly focus on adjusting model architecture or improving inference, such as test-time sampling strategies. In this work, we focus on improving the training strategy and propose a novel recycling method. During each training step, a segmentation mask is first predicted given an image and a random noise. This predicted mask, which replaces the conventional ground truth mask, is used for denoising task during training. This approach can be interpreted as aligning the training strategy with inference by eliminating the dependence on ground truth masks for generating noisy samples. Our proposed method significantly outperforms standard diffusion training, self-conditioning, and existing recycling strategies across multiple medical imaging data sets: muscle ultrasound, abdominal CT, prostate MR, and brain MR. This holds for two widely adopted sampling strategies: denoising diffusion probabilistic model and denoising diffusion implicit model. Importantly, existing diffusion models often display a declining or unstable performance during inference, whereas our novel recycling consistently enhances or maintains performance. We show for the first time that, under a fair comparison with the same network architectures and computing budget, the proposed recycling-based diffusion models achieved on-par performance with non-diffusion-based supervised training. Furthermore, by ensembling the proposed diffusion model and the non-diffusion counterpart, significant improvements to the non-diffusion models have been observed across all applications, demonstrating the value of this novel training method. This paper summarizes these quantitative results and discusses their values, with a fully reproducible JAX-based implementation, released at https://github.com/mathpluscode/ImgX-DiffSeg

Keywords

Image Segmentation · Diffusion Model · Recycling · Muscle Ultrasound · Abdominal CT · Prostate MR · Brain MR

Bibtex @article{melba:2023:016:fu, title = "A Recycling Training Strategy for Medical Image Segmentation with Diffusion Denoising Models", author = "Fu, Yunguan and Li, Yiwen and Saeed, Shaheer U. and Clarkson, Matthew J. and Hu, Yipeng", journal = "Machine Learning for Biomedical Imaging", volume = "2", issue = "Special Issue for Generative Models", year = "2023", pages = "507--546", issn = "2766-905X", doi = "https://doi.org/10.59275/j.melba.2023-fbe4", url = "https://melba-journal.org/2023:016" }
RISTY - JOUR AU - Fu, Yunguan AU - Li, Yiwen AU - Saeed, Shaheer U. AU - Clarkson, Matthew J. AU - Hu, Yipeng PY - 2023 TI - A Recycling Training Strategy for Medical Image Segmentation with Diffusion Denoising Models T2 - Machine Learning for Biomedical Imaging VL - 2 IS - Special Issue for Generative Models SP - 507 EP - 546 SN - 2766-905X DO - https://doi.org/10.59275/j.melba.2023-fbe4 UR - https://melba-journal.org/2023:016 ER -

2023:016 cover

Disclaimer: the following html version has been automatically generated and the PDF remains the reference version. Feedback can be sent directly to publishing-editor@melba-journal.org

1 Introduction

Diffusion denoising models, first proposed by Sohl-Dickstein et al. (2015); Ho et al. (2020); Ho and Salimans (2022), are generative models that produce data samples through iterative denoising processes. These models achieved superior performance compared to generative adversarial networks (Goodfellow et al., 2020) and became the foundation for many image generation applications such as DALL\cdotE 2 (Ramesh et al., 2022), stable diffusion, and Midjourney (Rombach et al., 2022), etc. Given the success in computer vision, diffusion models have been adapted in medical imaging in various fields, including image synthesis (Dorjsembe et al., 2022; Khader et al., 2022), image denoising (Hu et al., 2022), anomaly detection (Wolleb et al., 2022a), classification (Yang et al., 2023), segmentation (Wu et al., 2022; Rahman et al., 2023), and registration (Kim et al., 2022). Among these, segmentation is one of the most foundational tasks in medical imaging and a variety of applications have been explored, including liver CT (Xing et al., 2023), lung CT (Zbinden et al., 2023; Rahman et al., 2023), abdominal CT (Wu et al., 2023; Fu et al., 2023), brain MR (Pinaya et al., 2022a; Wolleb et al., 2022b; Wu et al., 2023; Xing et al., 2023; Bieder et al., 2023), and prostate MR (Fu et al., 2023).

For segmentation tasks, although various model architectures and training strategies  (Wang et al., 2022) have been proposed, U-net equipped with attention mechanisms and trained by supervised learning consistently remains the state-of-the-art model and an important baseline. In comparison, divergent observations have emerged: some studies reported superior performance of diffusion-based segmentation models (Amit et al., 2021; Wu et al., 2022, 2023; Xing et al., 2023), while others observed the opposite trend (Pinaya et al., 2022a; Wolleb et al., 2022b; Kolbeinsson and Mikolajczyk, 2022; Fu et al., 2023). This inconsistency may result from different training schemes, network architectures, and application-specific modifications in comparisons, suggesting that challenges persist in applying diffusion models for image segmentation.

Formally, conditioning on an image, diffusion-based segmentation models operate by progressive denoising, starting with random noise and ultimately yielding the corresponding segmentation masks. In comparison to their non-diffusion counterparts, the necessity of supplementary noisy masks as input leads to increased memory demands that can pose challenges, particularly for processing 3D volumetric medical images. To address this, volume slicing (Wu et al., 2023) or patching (Xing et al., 2023; Bieder et al., 2023) has been used to manage memory limitations. However, diffusion model training still requires considerable computation due to its inherent iterative nature, since the same model needs to learn to denoise masks with varying levels of noise. Consequently, enhancing the diffusion model performance while adhering to a fixed compute budget is of significant importance. Empirically, using the reparametrisation (Kingma et al., 2021), the denoising training task has shifted from noise prediction (Wolleb et al., 2022b; Wu et al., 2022) to mask prediction (Fu et al., 2023; Zbinden et al., 2023) due to the superior performance and faster learning. Furthermore, Fu et al. (2023) highlighted a limitation of diffusion models, noting the misalignment between training and inference procedures, since training samples were generated from ground truth masks. This raises concerns of data leakage as discussed in Chen et al. (2022a). However, there have been limited studies in medical image segmentation that rigorously compare diffusion models with their non-diffusion counterparts and examine diffusion training efficiency.

In this work, we present a substantial extension to the preliminary work (Fu et al., 2023) and focus on an improvement in the diffusion denoising model training strategy that applies to 2D and 3D medical image segmentation in different modalities. First, a novel recycling approach has been introduced. Different from Fu et al. (2023), in the first step during training, the input is completely corrupted by noise instead of a partially corrupted ground truth. This seemingly minor adjustment effectively eliminates the ground truth information from model inputs, which further aligns the training strategy toward inference. The proposed diffusion models can refine or maintain segmentation accuracy throughout the inference process. On the contrary, all other diffusion models demonstrate declining or unstable performance trends. Our research showcases the superior performance of our method compared to established diffusion training strategies (Ho et al., 2020; Chen et al., 2022b; Watson et al., 2023; Fu et al., 2023) for both denoising diffusion probabilistic model-based (Ho et al., 2020) and denoising diffusion implicit model-based (Song et al., 2020a) sampling procedures. We also achieved on-par performance with non-diffusion baselines that had not been observed in the previous study (Fu et al., 2023). Second, we introduce an ensemble model that averages the predicted probabilities from the proposed diffusion-based model and non-diffusion counterpart, resulting in significant improvement to the non-diffusion baseline. Third, we extended the experiments to four large data sets – 2D muscle ultrasound with 391039103910 images, 3D abdominal CT with 300300300 images, 3D prostate MR with 589589589 images, and 3D brain MR with 125112511251 images, further demonstrating the robustness of the proposed method against different applications and data types. Lastly, we integrated a Transformer block into our network architecture. This brings our models in line with contemporary state-of-the-art approaches, rendering our findings more pertinent to real-world applications. To mitigate the increased memory consumption resulting from this addition, we employed patch-based training and inference strategies. The JAX-based framework has been released on https://github.com/mathpluscode/ImgX-DiffSeg.

2 Related Works

The diffusion process is a Markov process where data structures are gradually noise-corrupted and eventually destroyed (noising process). A reverse diffusion process (denoising process) can then be learned, where the objective is to gradually recover the data structure. Sohl-Dickstein et al. (2015) first proposed diffusion models which map the disrupted data to a noise distribution. Ho et al. (2020) have shown that such modeling is equivalent to score-matching models, a class of models that estimates the gradient of the log-density (Hyvärinen and Dayan, 2005; Vincent, 2011; Song and Ermon, 2019, 2020). This led to a simplified variational lower bound training objective and a denoising diffusion probabilistic model (DDPM) (Ho et al., 2020). DDPM achieved state-of-the-art performance for unconditional image generation on CIFAR10 at the time. In practice, DDPMs were found suboptimal on log-likelihood estimation and Nichol and Dhariwal (2021) addressed this with a learnable variance schedule, sinusoidal noise schedule, and an importance sampling for time steps. Furthermore, diffusion models were trained with hundreds or thousands of steps, inference with the same number of steps is time-consuming. Therefore, different strategies have been proposed to enable faster sampling. While Nichol and Dhariwal (2021) suggested variance resampling without modifying the probabilistic distribution, Song et al. (2020a) derived a deterministic model, denoising diffusion implicit model (DDIM), which shares the same marginal distribution as DDPM. Liu et al. (2022) further generalized the reverse step of DDIM into an ordinary differential equation and used high-order numerical methods (e.g., Runge-Kutta method) with predicted noise to perform sampling with second-order convergence. Besides, Zheng et al. (2022); Lyu et al. (2022); Guo et al. (2022) accelerated diffusion model training by shortening the noising schedule and only considering a truncated diffusion chain with less noise. These unconditioned denoising diffusion models have been successfully applied in multiple medical imaging applications (Kazerouni et al., 2023), including brain MR image generation (Dorjsembe et al., 2022; Khader et al., 2022), optical coherence tomography denoising (Hu et al., 2022), and chest X-ray pleural effusion detection (Wolleb et al., 2022a).

Guided diffusion models have been developed to generate data in a controllable manner. Song et al. (2020b); Dhariwal and Nichol (2021) used gradients of pre-trained classifiers to bias the sampling process, without modifying the diffusion model training. Ho and Salimans (2022), on the other hand, modified the models to take additional information as input, enabling end-to-end conditional diffusion model training. For medical image synthesis, conditions can be patient biometric information (Pinaya et al., 2022b), genotypes data (Moghadam et al., 2023), or images from different modalities (Saeed et al., 2023). Conditional diffusion models have also been explored for medical image classification (Yang et al., 2023), segmentation (Wu et al., 2022; Rahman et al., 2023), and registration (Kim et al., 2022). Particularly for image segmentation, the diffusion models apply the noising and denoising on the segmentation masks, and the network takes a noisy mask and an image to perform denoising.

In contrast to the continuous spectrum of values found in natural images, image segmentation mask values are categorical and nominal. The Gaussian-based continuous diffusion processes behind DDPM and DDIM cannot be directly applied. Chen et al. (2022b) therefore encoded categories with binary bits and relaxed them to real values for continuous diffusion models. Han et al. (2022); Fu et al. (2023) encoded categories with one-hot embeddings and performed diffusion on scaled values. Li et al. (2022a); Strudel et al. (2022) encoded the discrete data and applied diffusion processes in embedding spaces directly. Alternatively, discrete diffusion models have been proposed to model the transition matrix between categories based on discrete probability distributions, including binomial distribution (Sohl-Dickstein et al., 2015), categorical distribution (Hoogeboom et al., 2021; Austin et al., 2021; Gu et al., 2022), and Bernoulli distribution (Chen et al., 2023). In this work, we follow Fu et al. (2023) to perform diffusion on scaled binary masks.

Originally, DDPM models were trained through noise prediction (Ho et al., 2020), where the loss was calculated between the predicted and sampled noises. Many diffusion-based segmentation models directly adopted this strategy (Wolleb et al., 2022b; Wu et al., 2022). Alternatively, Kingma et al. (2021) derived an equivalent formulation (often known as 𝐱0subscript𝐱0\operatorname{\mathbf{x}}_{0} reparameterization) of the variational lower bound and simplified the loss to a norm between predicted data and the corresponding ground truth. For segmentation, this is equivalent to predicting the segmentation mask for each time step. Compared to noise prediction, multiple studies found that this mask prediction strategy is more efficient (Fu et al., 2023; Wang et al., 2023; Lai et al., 2023). Furthermore, Chen et al. (2022b) suggested self-conditioning to use these predictions as additional input to improve diffusion models for image synthesis. Self-conditioning contains two steps: the first step predicts a noise-free sample given a noise-corrupted sample only; the second step uses the same timestep and inputs the same noise-corrupted sample, as well as the prediction from the first step. This technique was later adopted for protein design (Watson et al., 2023) with an additional reverse step, where the second step performs denoising in a smaller timestep where the noise level is lower. However, in both cases, the noisy samples are directly derived from the ground truth, which is not available during inference. This risks data leakage during training and empirically leads to overfitting and lack of generalization as discussed in Chen et al. (2022a); Kolbeinsson and Mikolajczyk (2022); Lai et al. (2023)Chen et al. (2022a); Young et al. (2022) addressed this issue by controlling the signal-to-noise ratio so that less information is preserved after noising: Chen et al. (2022a) scaled the mask value ranges to implicitly amplify the noise level, and Young et al. (2022) explicitly varied the scale and standard deviation of the Gaussian noise added to the masks. On the other hand, Kolbeinsson and Mikolajczyk (2022) proposed recursive denoising that iterates through each step during training, without using ground truth as input. However, such a strategy extends the training length by a factor of hundreds or more, making it practically infeasible for larger 3D medical image data sets. Following these studies, Fu et al. (2023) concluded that the lack of generalization in diffusion-based segmentation models is due to the misalignment between training and inference processes. Fu et al. (2023) thus presented a two-step recycling training strategy: the first step ingests a partially noisied sample for mask prediction; the predicted mask is then noise-corrupted again for denoising training. Compared to recursive denoising, this method requires a limited training time increase. This method also resembles PD-DDPM (Guo et al., 2022), where a pre-segmentation is used for noising. However, PD-DDPM requires a separate pre-segmentation network and more device memory, thus not suitable for 3D image segmentation applications.

3 Background

3.1 Denoising Diffusion Probabilistic Model

𝐱T\autorightleftharpoons\autorightleftharpoons𝐱t\autorightleftharpoonspθ(𝐱t1𝐱t)q(𝐱t𝐱t1)𝐱t1\autorightleftharpoons\autorightleftharpoons𝐱subscript𝐱𝑇\autorightleftharpoons\autorightleftharpoonssubscript𝐱𝑡\autorightleftharpoonssubscript𝑝𝜃conditionalsubscript𝐱𝑡1subscript𝐱𝑡𝑞conditionalsubscript𝐱𝑡subscript𝐱𝑡1subscript𝐱𝑡1\autorightleftharpoons\autorightleftharpoons𝐱\displaystyle\operatorname{\mathbf{x}}_{T}\autorightleftharpoons{}{}\cdots\autorightleftharpoons{}{}\operatorname{\mathbf{x}}_{t}\autorightleftharpoons{$p_{\theta}(\operatorname{\mathbf{x}}_{t-1}\mid\operatorname{\mathbf{x}}_{t})$}{$q(\operatorname{\mathbf{x}}_{t}\mid\operatorname{\mathbf{x}}_{t-1})$}\operatorname{\mathbf{x}}_{t-1}\autorightleftharpoons{}{}\cdots\autorightleftharpoons{}{}\operatorname{\mathbf{x}}
Definition

The denoising diffusion probabilistic models (DDPM) (Ho et al., 2020) consider a forward process (illustrated from right to left in Section 3.1): given a sample 𝐱0q(𝐱0)similar-tosubscript𝐱0𝑞subscript𝐱0\mathbf{x}_{0}\sim q(\mathbf{x}_{0}), a noise-corrupted sample 𝐱tsubscript𝐱𝑡\mathbf{x}_{t} follows a multivariate normal distribution at timestep t{1,,T}𝑡1𝑇t\in\{1,\cdots,T\}, q(𝐱t𝐱t1)=𝒩(𝐱t;1βt𝐱t1,βt𝐈)𝑞conditionalsubscript𝐱𝑡subscript𝐱𝑡1𝒩subscript𝐱𝑡1subscript𝛽𝑡subscript𝐱𝑡1subscript𝛽𝑡𝐈q(\mathbf{x}_{t}\mid\mathbf{x}_{t-1})=\mathcal{N}(\mathbf{x}_{t};\sqrt{1-\beta_{t}}\mathbf{x}_{t-1},\beta_{t}\mathbf{I}), where βt[0,1]subscript𝛽𝑡01\beta_{t}\in[0,1]. As Gaussians are closed under convolution, given 𝐱0subscript𝐱0\mathbf{x}_{0}, 𝐱tsubscript𝐱𝑡\mathbf{x}_{t} can be directly sampled from 𝐱0subscript𝐱0\mathbf{x}_{0} as follows, q(𝐱t𝐱0)=𝒩(𝐱t;α¯t𝐱0,(1α¯t)𝐈)𝑞conditionalsubscript𝐱𝑡subscript𝐱0𝒩subscript𝐱𝑡subscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡𝐈q(\operatorname{\mathbf{x}}_{t}\mid\operatorname{\mathbf{x}}_{0})=\mathcal{N}(\operatorname{\mathbf{x}}_{t};\sqrt{\bar{\alpha}_{t}}\operatorname{\mathbf{x}}_{0},(1-\bar{\alpha}_{t})\mathbf{I}). Correspondingly, a reverse process (illustrated from left to right in Section 3.1) denoises 𝐱tsubscript𝐱𝑡\mathbf{x}_{t} at each step, for t{T,,1}𝑡𝑇1t\in\{T,\cdots,1\}, pθ(𝐱t1𝐱t)=𝒩(𝐱t1;𝝁θ(𝐱t,t),σt2𝐈)subscript𝑝𝜃conditionalsubscript𝐱𝑡1subscript𝐱𝑡𝒩subscript𝐱𝑡1subscript𝝁𝜃subscript𝐱𝑡𝑡superscriptsubscript𝜎𝑡2𝐈p_{\theta}(\mathbf{x}_{t-1}\mid\mathbf{x}_{t})=\mathcal{N}(\mathbf{x}_{t-1};\bm{\mu}_{\theta}(\mathbf{x}_{t},t),\sigma_{t}^{2}\mathbf{I}), with a predicted mean 𝝁θ(𝐱t,t)subscript𝝁𝜃subscript𝐱𝑡𝑡\bm{\mu}_{\theta}(\mathbf{x}_{t},t) and variance σt2𝐈superscriptsubscript𝜎𝑡2𝐈\sigma_{t}^{2}\mathbf{I}. σtsubscript𝜎𝑡\sigma_{t} is a pre-defined schedule dependent on timestep t𝑡t. In this work, σt2=β~t=1α¯t11α¯tβtsuperscriptsubscript𝜎𝑡2subscript~𝛽𝑡1subscript¯𝛼𝑡11subscript¯𝛼𝑡subscript𝛽𝑡\sigma_{t}^{2}=\tilde{\beta}_{t}=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}\beta_{t} with αt=1βtsubscript𝛼𝑡1subscript𝛽𝑡\alpha_{t}=1-\beta_{t} and α¯t=s=1tαssubscript¯𝛼𝑡superscriptsubscriptproduct𝑠1𝑡subscript𝛼𝑠\bar{\alpha}_{t}=\prod_{s=1}^{t}\alpha_{s}. The mean 𝝁θ(𝐱t,t)=α¯t1βt1α¯t𝐱^0+1α¯t11α¯tαt𝐱tsubscript𝝁𝜃subscript𝐱𝑡𝑡subscript¯𝛼𝑡1subscript𝛽𝑡1subscript¯𝛼𝑡subscript^𝐱01subscript¯𝛼𝑡11subscript¯𝛼𝑡subscript𝛼𝑡subscript𝐱𝑡\bm{\mu}_{\theta}(\mathbf{x}_{t},t)=\frac{\sqrt{\bar{\alpha}_{t-1}}\beta_{t}}{1-\bar{\alpha}_{t}}\hat{\operatorname{\mathbf{x}}}_{0}+\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}\sqrt{\alpha_{t}}\mathbf{x}_{t}, also know as 𝐱0subscript𝐱0\operatorname{\mathbf{x}}_{0} parameterization, where 𝐱^0subscript^𝐱0\hat{\operatorname{\mathbf{x}}}_{0} is an estimation of 𝐱0subscript𝐱0\operatorname{\mathbf{x}}_{0} from a learned neural network 𝐱^0=fθ(𝐱t,t)subscript^𝐱0subscript𝑓𝜃subscript𝐱𝑡𝑡\hat{\operatorname{\mathbf{x}}}_{0}=f_{\theta}(\mathbf{x}_{t},t).

Training

For each step during training, a noise-corrupted sample 𝐱tsubscript𝐱𝑡\operatorname{\mathbf{x}}_{t} is sampled and input to the neural network fθsubscript𝑓𝜃f_{\theta} to predict 𝐱0subscript𝐱0\operatorname{\mathbf{x}}_{0}. The network is then trained with loss Ldenoising(θ)=𝔼t,𝐱0,𝐱tL(𝐱0,𝐱^0)subscript𝐿denoising𝜃subscript𝔼𝑡subscript𝐱0subscript𝐱𝑡𝐿subscript𝐱0subscript^𝐱0L_{\text{denoising}}(\theta)=\operatorname{\mathbb{E}}_{t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}_{t}}L(\operatorname{\mathbf{x}}_{0},\hat{\operatorname{\mathbf{x}}}_{0}) with t𝑡t sampled from 111 to T𝑇T. L(,)𝐿L(\cdot,\cdot) is a loss function in the space of 𝐱𝐱\operatorname{\mathbf{x}}. In this work, importance sampling (Nichol and Dhariwal, 2021) is used for time step t𝑡t, where the weight for t𝑡t is proportional to 𝔼𝐱0,𝐱tL(𝐱0,𝐱^0)subscript𝔼subscript𝐱0subscript𝐱𝑡𝐿subscript𝐱0subscript^𝐱0\operatorname{\mathbb{E}}_{\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}_{t}}L(\operatorname{\mathbf{x}}_{0},\hat{\operatorname{\mathbf{x}}}_{0}).

𝐱tsubscript𝐱𝑡\displaystyle\operatorname{\mathbf{x}}_{t}𝒩(𝐱t;α¯t𝐱0,(1α¯t)𝐈),similar-toabsent𝒩subscript𝐱𝑡subscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡𝐈\displaystyle\sim\mathcal{N}(\operatorname{\mathbf{x}}_{t};\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{0},(1-\bar{\alpha}_{t})\mathbf{I}),(Sampling)Sampling\displaystyle(\text{Sampling})(1a)
𝐱^0subscript^𝐱0\displaystyle\hat{\mathbf{x}}_{0}=fθ(t,𝐱t),absentsubscript𝑓𝜃𝑡subscript𝐱𝑡\displaystyle=f_{\theta}(t,\mathbf{x}_{t}),(Prediction)Prediction\displaystyle(\text{Prediction})(1b)
Ldenoising(θ)subscript𝐿denoising𝜃\displaystyle L_{\text{denoising}}(\theta)=𝔼t,𝐱0,𝐱tL(𝐱0,𝐱^0),absentsubscript𝔼𝑡subscript𝐱0subscript𝐱𝑡𝐿subscript𝐱0subscript^𝐱0\displaystyle=\operatorname{\mathbb{E}}_{t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}_{t}}L(\operatorname{\mathbf{x}}_{0},\hat{\mathbf{x}}_{0}),(loss calculation)loss calculation\displaystyle(\text{loss calculation})(1c)
Inference

At inference time, the denoising starts with a randomly sampled Gaussian noise 𝐱T𝒩(𝟎,𝑰)similar-tosubscript𝐱𝑇𝒩0𝑰\operatorname{\mathbf{x}}_{T}\sim\mathcal{N}(\bm{0},\bm{I}) and the data is denoised step-by-step for t=T,,1𝑡𝑇1t=T,\cdots,1:

pθ(𝐱t1𝐱t)subscript𝑝𝜃conditionalsubscript𝐱𝑡1subscript𝐱𝑡\displaystyle p_{\theta}(\operatorname{\mathbf{x}}_{t-1}\mid\operatorname{\mathbf{x}}_{t})=𝒩(𝐱t1;𝝁θ(𝐱t,t),σt2𝐈)absent𝒩subscript𝐱𝑡1subscript𝝁𝜃subscript𝐱𝑡𝑡superscriptsubscript𝜎𝑡2𝐈\displaystyle=\mathcal{N}(\operatorname{\mathbf{x}}_{t-1};\bm{\mu}_{\theta}(\operatorname{\mathbf{x}}_{t},t),\sigma_{t}^{2}\mathbf{I})
𝝁θ(𝐱t,t\displaystyle\bm{\mu}_{\theta}(\operatorname{\mathbf{x}}_{t},t)=α¯t1βt1α¯t𝐱^0+1α¯t11α¯tαt𝐱t\displaystyle)=\frac{\sqrt{\bar{\alpha}_{t-1}}\beta_{t}}{1-\bar{\alpha}_{t}}\hat{\operatorname{\mathbf{x}}}_{0}+\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}\sqrt{\alpha_{t}}\operatorname{\mathbf{x}}_{t}

Optionally, the variance schedule βtsubscript𝛽𝑡\beta_{t} can be down-sampled to reduce the number of inference steps (Nichol and Dhariwal, 2021). A detailed review of DDPM and the loss has been summarised in Section A and we refer the readers to Sohl-Dickstein et al. (2015); Ho et al. (2020); Nichol and Dhariwal (2021); Kingma et al. (2021) and other literature for in-depth understanding and derivations.

3.2 Diffusion for Segmentation

When applying diffusion models for segmentation, noising and denoising are performed on the segmentation masks. The ground-truth binary mask, where channels correspond to classes that include the background, is denoted by 𝐱0subscript𝐱0\mathbf{x}_{0}. For the i𝑖i-th pixel/voxel, the value for the j𝑗j-th channel is in 111 if it belongs to class j𝑗j and 11-1 otherwise. The training process (illustrated in LABEL:fig:method_comparison) is similar to Equation 1 except that the segmentation network fθ(I,𝐱t,t)subscript𝑓𝜃𝐼subscript𝐱𝑡𝑡f_{\theta}(I,\mathbf{x}_{t},t) now takes the image I𝐼I as an additional input for prediction 𝐱^0subscript^𝐱0\hat{\operatorname{\mathbf{x}}}_{0}. L(,)𝐿L(\cdot,\cdot) is a weighted sum of cross entropy and foreground-only Dice loss.

𝐱tsubscript𝐱𝑡\displaystyle\operatorname{\mathbf{x}}_{t}𝒩(𝐱t;α¯t𝐱0,(1α¯t)𝐈),similar-toabsent𝒩subscript𝐱𝑡subscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡𝐈\displaystyle\sim\mathcal{N}(\operatorname{\mathbf{x}}_{t};\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{0},(1-\bar{\alpha}_{t})\mathbf{I}),(Sampling)Sampling\displaystyle(\text{Sampling})(2a)
𝐱^0subscript^𝐱0\displaystyle\hat{\mathbf{x}}_{0}=fθ(I,t,𝐱t),absentsubscript𝑓𝜃𝐼𝑡subscript𝐱𝑡\displaystyle=f_{\theta}(I,t,\mathbf{x}_{t}),(Prediction)Prediction\displaystyle(\text{Prediction})(2b)
Ldenoising(θ)subscript𝐿denoising𝜃\displaystyle L_{\text{denoising}}(\theta)=𝔼t,𝐱0,𝐱tL(𝐱0,𝐱^0),absentsubscript𝔼𝑡subscript𝐱0subscript𝐱𝑡𝐿subscript𝐱0subscript^𝐱0\displaystyle=\operatorname{\mathbb{E}}_{t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}_{t}}L(\operatorname{\mathbf{x}}_{0},\hat{\mathbf{x}}_{0}),(loss calculation)loss calculation\displaystyle(\text{loss calculation})(2c)

4 Methods

At each training step, the recycling considers a sampled time step t<T𝑡𝑇t<T and a data sample 𝐱0subscript𝐱0\operatorname{\mathbf{x}}_{0}. First, a noise-corrupted sample 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T} at time step T𝑇T is sampled, with α¯T0subscript¯𝛼𝑇0\sqrt{\bar{\alpha}_{T}}\approx 0. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T} is fed to the network fθsubscript𝑓𝜃f_{\theta} to perform a prediction 𝐱^0=fθ(I,T,𝐱T)subscript^𝐱0subscript𝑓𝜃𝐼𝑇subscript𝐱𝑇\hat{\mathbf{x}}_{0}=f_{\theta}(I,T,\mathbf{x}_{T}). This prediction is then noise-corrupted to generate 𝐱tsubscript𝐱𝑡\operatorname{\mathbf{x}}_{t}. A second prediction 𝐱^0=fθ(I,t,𝐱t)subscript^𝐱0subscript𝑓𝜃𝐼𝑡subscript𝐱𝑡\hat{\mathbf{x}}_{0}=f_{\theta}(I,t,\mathbf{x}_{t}) (overriding the 𝐱^0subscript^𝐱0\hat{\mathbf{x}}_{0} for simplicity) is produced and used for loss calculation (see LABEL:fig:method_comparison). Formally, at each timestep t𝑡t, the proposed recycling (denoted as “Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}”) has the following steps.

𝐱Tsubscript𝐱𝑇\displaystyle\operatorname{\mathbf{x}}_{T}𝒩(𝐱T;α¯T𝐱0,(1α¯T)𝐈),similar-toabsent𝒩subscript𝐱𝑇subscript¯𝛼𝑇subscript𝐱01subscript¯𝛼𝑇𝐈\displaystyle\sim\mathcal{N}(\operatorname{\mathbf{x}}_{T};\sqrt{\bar{\alpha}_{T}}\operatorname{\mathbf{x}}_{0},(1-\bar{\alpha}_{T})\mathbf{I}),(rec.𝐱T,step 1, sampling)rec.subscript𝐱𝑇step 1, sampling\displaystyle(\text{rec.}\leavevmode\nobreak\ \operatorname{\mathbf{x}}_{T},\leavevmode\nobreak\ \text{step 1, sampling})(3a)
𝐱^0subscript^𝐱0\displaystyle\hat{\mathbf{x}}_{0}=StopGradient(fθ(I,T,𝐱T)),absentStopGradientsubscript𝑓𝜃𝐼𝑇subscript𝐱𝑇\displaystyle=\text{StopGradient}(f_{\theta}(I,T,\mathbf{x}_{T})),(rec.𝐱T,step 1, prediction)rec.subscript𝐱𝑇step 1, prediction\displaystyle(\text{rec.}\leavevmode\nobreak\ \operatorname{\mathbf{x}}_{T},\leavevmode\nobreak\ \text{step 1, prediction})(3b)
𝐱tsubscript𝐱𝑡\displaystyle\operatorname{\mathbf{x}}_{t}𝒩(𝐱t;α¯t𝐱^0,(1α¯t)𝐈),similar-toabsent𝒩subscript𝐱𝑡subscript¯𝛼𝑡subscript^𝐱01subscript¯𝛼𝑡𝐈\displaystyle\sim\mathcal{N}(\operatorname{\mathbf{x}}_{t};\sqrt{\bar{\alpha}_{t}}\hat{\mathbf{x}}_{0},(1-\bar{\alpha}_{t})\mathbf{I}),(rec.𝐱T,step 2, sampling)rec.subscript𝐱𝑇step 2, sampling\displaystyle(\text{rec.}\leavevmode\nobreak\ \operatorname{\mathbf{x}}_{T},\leavevmode\nobreak\ \text{step 2, sampling})(3c)
𝐱^0subscript^𝐱0\displaystyle\hat{\mathbf{x}}_{0}=fθ(I,t,𝐱t),absentsubscript𝑓𝜃𝐼𝑡subscript𝐱𝑡\displaystyle=f_{\theta}(I,t,\mathbf{x}_{t}),(rec.𝐱T,step 2, prediction)rec.subscript𝐱𝑇step 2, prediction\displaystyle(\text{rec.}\leavevmode\nobreak\ \operatorname{\mathbf{x}}_{T},\leavevmode\nobreak\ \text{step 2, prediction})(3d)
Ldenoising(θ)subscript𝐿denoising𝜃\displaystyle L_{\text{denoising}}(\theta)=𝔼t,𝐱0,𝐱tL(𝐱0,𝐱^0),absentsubscript𝔼𝑡subscript𝐱0subscript𝐱𝑡𝐿subscript𝐱0subscript^𝐱0\displaystyle=\operatorname{\mathbb{E}}_{t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}_{t}}L(\operatorname{\mathbf{x}}_{0},\hat{\mathbf{x}}_{0}),(loss calculation)loss calculation\displaystyle(\text{loss calculation})(3e)

In particular, stop gradient is applied to 𝐱^0subscript^𝐱0\hat{\mathbf{x}}_{0} in the first step to prevent the gradient calculation across two steps, to reduce training time. Optionally, a model with exponential moving averaged weights can be used, but it requires even more memory. Compared to Equation 2, recycling modification only affects training and does not change network architecture. It is independent of the sampling strategy during inference. Therefore, the DDIM sampler can also be used for inference.

The recycling strategy we propose in this work differs from the one introduced in Fu et al. (2023) (denoted as “Diff. rec. 𝐱t+1subscript𝐱𝑡1\operatorname{\mathbf{x}}_{t+1}”), illustrated in LABEL:fig:method_comparison and the equations below,

𝐱t+1subscript𝐱𝑡1\displaystyle\operatorname{\mathbf{x}}_{t+1}𝒩(𝐱t+1;α¯t+1𝐱0,(1α¯t)𝐈),similar-toabsent𝒩subscript𝐱𝑡1subscript¯𝛼𝑡1subscript𝐱01subscript¯𝛼𝑡𝐈\displaystyle\sim\mathcal{N}(\operatorname{\mathbf{x}}_{t+1};\sqrt{\bar{\alpha}_{t+1}}\operatorname{\mathbf{x}}_{0},(1-\bar{\alpha}_{t})\mathbf{I}),(rec.𝐱t+1,step 1, sampling)rec.subscript𝐱𝑡1step 1, sampling\displaystyle(\text{rec.}\leavevmode\nobreak\ \operatorname{\mathbf{x}}_{t+1},\leavevmode\nobreak\ \text{step 1, sampling})(4a)
𝐱^0subscript^𝐱0\displaystyle\hat{\mathbf{x}}_{0}=StopGradient(fθ(I,t+1,𝐱t+1)),absentStopGradientsubscript𝑓𝜃𝐼𝑡1subscript𝐱𝑡1\displaystyle=\text{StopGradient}(f_{\theta}(I,t+1,\mathbf{x}_{t+1})),(rec.𝐱t+1,step 1, prediction)rec.subscript𝐱𝑡1step 1, prediction\displaystyle(\text{rec.}\leavevmode\nobreak\ \operatorname{\mathbf{x}}_{t+1},\leavevmode\nobreak\ \text{step 1, prediction})(4b)
𝐱tsubscript𝐱𝑡\displaystyle\operatorname{\mathbf{x}}_{t}𝒩(𝐱t;α¯t𝐱^0,(1α¯t)𝐈),similar-toabsent𝒩subscript𝐱𝑡subscript¯𝛼𝑡subscript^𝐱01subscript¯𝛼𝑡𝐈\displaystyle\sim\mathcal{N}(\operatorname{\mathbf{x}}_{t};\sqrt{\bar{\alpha}_{t}}\hat{\mathbf{x}}_{0},(1-\bar{\alpha}_{t})\mathbf{I}),(rec.𝐱t+1,step 2, sampling)rec.subscript𝐱𝑡1step 2, sampling\displaystyle(\text{rec.}\leavevmode\nobreak\ \operatorname{\mathbf{x}}_{t+1},\leavevmode\nobreak\ \text{step 2, sampling})(4c)
𝐱^0subscript^𝐱0\displaystyle\hat{\mathbf{x}}_{0}=fθ(I,t,𝐱t),absentsubscript𝑓𝜃𝐼𝑡subscript𝐱𝑡\displaystyle=f_{\theta}(I,t,\mathbf{x}_{t}),(rec.𝐱t+1,step 2, prediction)rec.subscript𝐱𝑡1step 2, prediction\displaystyle(\text{rec.}\leavevmode\nobreak\ \operatorname{\mathbf{x}}_{t+1},\leavevmode\nobreak\ \text{step 2, prediction})(4d)
Ldenoising(θ)subscript𝐿denoising𝜃\displaystyle L_{\text{denoising}}(\theta)=𝔼t,𝐱0,𝐱tL(𝐱0,𝐱^0),absentsubscript𝔼𝑡subscript𝐱0subscript𝐱𝑡𝐿subscript𝐱0subscript^𝐱0\displaystyle=\operatorname{\mathbb{E}}_{t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}_{t}}L(\operatorname{\mathbf{x}}_{0},\hat{\mathbf{x}}_{0}),(loss calculation)loss calculation\displaystyle(\text{loss calculation})(4e)

In the new approach (“Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}”), the first step is consistently executed at the time step T𝑇T instead of t+1𝑡1t+1 as shown in Equation 3. Compared to 𝐱t+1subscript𝐱𝑡1\operatorname{\mathbf{x}}_{t+1} in Equation 4, xTsubscript𝑥𝑇x_{T} is fully noised and contains even less ground truth information during the initial step. Specifically, for a given time step t𝑡t, 𝐱t𝒩(𝐱t;α¯t𝐱0,(1α¯t)𝐈)similar-tosubscript𝐱𝑡𝒩subscript𝐱𝑡subscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡𝐈\operatorname{\mathbf{x}}_{t}\sim\mathcal{N}(\operatorname{\mathbf{x}}_{t};\sqrt{\bar{\alpha}_{t}}\operatorname{\mathbf{x}}_{0},(1-\bar{\alpha}_{t})\mathbf{I}), which can be reparameterized as 𝐱t=α¯t𝐱0+1α¯tϵtsubscript𝐱𝑡subscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡subscriptbold-italic-ϵ𝑡\operatorname{\mathbf{x}}_{t}=\sqrt{\bar{\alpha}_{t}}\operatorname{\mathbf{x}}_{0}+\sqrt{1-\bar{\alpha}_{t}}\operatorname{\bm{\epsilon}}_{t} with ϵt𝒩(𝟎,𝐈)similar-tosubscriptbold-italic-ϵ𝑡𝒩0𝐈\operatorname{\bm{\epsilon}}_{t}\sim\mathcal{N}(\mathbf{0},\mathbf{I}) and α¯t=s=1tαssubscript¯𝛼𝑡superscriptsubscriptproduct𝑠1𝑡subscript𝛼𝑠\bar{\alpha}_{t}=\prod_{s=1}^{t}\alpha_{s}. In this work, αtsubscript𝛼𝑡\alpha_{t} is a monotonically decreasing noise schedule ranging from 0.9990.9990.999 to 0.980.980.98 for t=1𝑡1t=1 to T𝑇T. Correspondingly, α¯tsubscript¯𝛼𝑡\sqrt{\bar{\alpha}_{t}} monotonically decreases from 0.999950.999950.99995 to 0.006320.006320.00632. 𝐱T=α¯T𝐱0+1α¯TϵTsubscript𝐱𝑇subscript¯𝛼𝑇subscript𝐱01subscript¯𝛼𝑇subscriptbold-italic-ϵ𝑇\operatorname{\mathbf{x}}_{T}=\sqrt{\bar{\alpha}_{T}}\operatorname{\mathbf{x}}_{0}+\sqrt{1-\bar{\alpha}_{T}}\operatorname{\bm{\epsilon}}_{T} with α¯T=0.00632subscript¯𝛼𝑇0.00632\sqrt{\bar{\alpha}_{T}}=0.00632 can be considered to contain almost no ground truth information. The information can also be empirically measured by cross entropy and Dice score, and an example is presented in Figure 2 in Section D. This seemingly minor modification removes the ground truth information from model inputs, essentially reducing the risk of data leakage and training overfitting. This adaptation guides the model to learn the denoising task based on its initial prediction, rather than ground truth. Consequently, the model can effectively denoise and refine the provided noisy mask, ultimately predicting the ground truth.

Recycling also differs from the self-conditioning methods proposed in Chen et al. (2022b) (“Diff. sc. 𝐱tsubscript𝐱𝑡\operatorname{\mathbf{x}}_{t}”) and Watson et al. (2023) (“Diff. sc. 𝐱t+1subscript𝐱𝑡1\operatorname{\mathbf{x}}_{t+1}”). Although self-conditioning also requests two forward loops during training, it differs from recycling in multiple aspects. First, noisy samples in self-conditioning are always generated based on ground truth 𝐱0subscript𝐱0\operatorname{\mathbf{x}}_{0}, while the second forward step of recycling does not rely on ground truth for noisy sample generation. Second, self-conditioning provides an additional input 𝐱^0subscript^𝐱0\hat{\operatorname{\mathbf{x}}}_{0}, while recycling does not. Lastly, in self-conditioning, 𝐱^0subscript^𝐱0\hat{\operatorname{\mathbf{x}}}_{0} is replaced by zeros with 50%percent5050\% probabilities, while recycling is applied constantly. The training strategy has been detailed in LABEL:fig:method_comparison and Section C. For further details, we refer the reader to the reference papers (Chen et al., 2022b; Watson et al., 2023).

5 Experiments

5.1 Experiment Setting

A range of experiments have been performed in four data sets (Section 5.2) to evaluate the proposed method and the trained models from different aspects.

5.1.1 Diffusion Training Strategy Comparison

First, the proposed recycling training strategy (“Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}”) was compared with standard diffusion models (“Diff.”) and other diffusion training strategies that require two forward steps to evaluate the training efficiency with identical network architectures and compute budget. The compared diffusion training strategies include the previously proposed recycling method Fu et al. (2023) (“Diff. rec. 𝐱t+1subscript𝐱𝑡1\operatorname{\mathbf{x}}_{t+1}”) and two self-conditioning techniques from Chen et al. (2022b) (“Diff. sc. 𝐱tsubscript𝐱𝑡\operatorname{\mathbf{x}}_{t}”) and Watson et al. (2023) (“Diff. sc. 𝐱t+1subscript𝐱𝑡1\operatorname{\mathbf{x}}_{t+1}”). For each trained model using a different strategy, both DDPM and DDIM samplers were evaluated. Importantly, the predictions at each inference step were assessed to study the variation of performance along the inference process.

5.1.2 Comparison to Non-diffusion Models

The proposed methods were compared with non-diffusion-based models using identical architectures and the same compute budget. An ensemble model was also evaluated, where the predicted probabilities from the diffusion model and non-diffusion model were averaged. Models’ segmentation accuracy was assessed with different granularities: per foreground class or averaged across foreground classes. Balnd-altmann plots were used to analyze the differences between models.

5.1.3 Ablation Studies for Recycling

Ablation studies were performed, including assessing the performance with different lengths of inference and evaluating the stochasticity across different seeds during inference. Compared to the previous work (Fu et al., 2023), the effectiveness of the Transformer architecture and the change of training noise schedule was evaluated.

5.1.4 Evaluation Metrics

Different methods were evaluated using binary Dice score (DS) and 95%percent9595\% Hausdorff distance (HD), averaging over foreground classes on the test sets. Dice score is reported in percentage, between 0%percent00\% and 100%percent100100\%. For Hausdorff distance, the values are in mm for 3D volumes and pixels for 2D images. Paired Student’s t-tests with a significance level of α=0.05𝛼0.05\alpha=0.05 were performed on the Dice score to test statistical significance between model performance.

5.2 Data

5.2.1 Muscle Ultrasound

The data set111https://data.mendeley.com/datasets/3jykz7wz8d/1 (Marzola et al., 2021) provides 391039103910 labeled transverse musculoskeletal ultrasound images, which were split into 253125312531, 666666666, and 713713713 images for training, validation, and test sets, respectively. Images had the shape 480×512480512480\times 512. The predicted masks were post-processed, following Marzola et al. (2021). After filling the holes, multiple morphological operations were performed, including an erosion with a disk of radius 3 pixels, a dilation with a disk of radius 5 pixels, and an opening with a disk of radius 10 pixels. Afterward, only the largest connected component was preserved if the second largest structure was smaller than 75% of the largest one; otherwise, the most superficial (i.e., towards the top of the image) one between the two largest components was preserved. Finally, holes were filled if there were any.

5.2.2 Abdominal CT (AMOS)

The data set222https://zenodo.org/record/7155725#.ZAkbe-zP2rO (Ji et al., 2022) provides 200200200 and 100100100 CT image-mask pairs for 151515 abdominal organs in training and validation sets. The validation set was randomly split into non-overlapping validation and test sets, with 101010 and 909090 images, respectively. The images were first resampled with a voxel dimension of 1.5×1.5×5.01.51.55.01.5\times 1.5\times 5.0 (mm). HU values were clipped to [991,362]991362[-991,362] and images were normalized so that the intensity had zero mean and unit variance. Lastly, images were center-cropped to shape 192×128×128192128128192\times 128\times 128. During training, the patch size was 128×128×128128128128128\times 128\times 128. During inference, the overlap between patches is 64×0×0640064\times 0\times 0, and the predictions on the overlap were averaged.

5.2.3 Prostate MR

The data set333https://zenodo.org/record/7013610#.ZAkaXuzP2rM (Li et al., 2022b) contains 589589589 T2-weighted image-mask pairs for 888 anatomical structures from 777 institutions. The images were randomly split into non-overlapping training, validation, and test sets, with 411411411, 141414, and 164164164 images in each split, respectively. The validation split has two images of each institution. The images were resampled with a voxel dimension of 0.75×0.75×2.50.750.752.50.75\times 0.75\times 2.5 (mm). Afterward, images were normalized so that the intensity had zero mean and unit variance. Lastly, the images were center-cropped to shape 256×256×4825625648256\times 256\times 48. During training, the patch size was 256×256×3225625632256\times 256\times 32. During inference, the overlap between patches was 0×0×1600160\times 0\times 16, and the predictions on the overlap were averaged.

5.2.4 Brain MR (BraTS 2021)

The data set444https://www.kaggle.com/datasets/dschettler8845/brats-2021-task1 (Baid et al., 2021) provides 125112511251MR segmented mpMRI scans for brain tumour. The data set was randomly split into non-overlapping training, validation, and test sets, with 938938938, 313131, and 282282282 samples, respectively. The whole tumor mask was generated as foreground class, including GD-enhancing tumor, the peritumoral edematous/invaded tissue, and the necrotic tumor core. Therefore, the task was a binary segmentation. Four modalities are available, including T1-weighted (T1), post-contrast T1-weighted (T1Gd), T2-weighted (T2), and T2 Fluid Attenuated Inversion Recovery (T2-FLAIR). The voxel dimension was 1.0×1.0×1.01.01.01.01.0\times 1.0\times 1.0 (mm). Images were firstly normalized so that the intensity has zero mean and unit variance. Lastly, images were center-cropped to shape 179×219×155179219155179\times 219\times 155 to remove the common background. During training, the patch size was 128×128×128128128128128\times 128\times 128. During inference, the overlap between patches was 77×37×101773710177\times 37\times 101, and the predictions on the overlap were averaged.

5.3 Implementation Details

Refer to caption
Figure 1: Unet architecture for diffusion and non-diffusion models. The inputs are concatenated when a noisy mask (from diffusion models) or predicted mask (from self-conditioning) is provided. The tensor is enriched with convolution (time-conditioned for diffusion models) and down-sampling layers, then passed into a Transformer with positional encoding, the output is then enriched with convolution and up-sampling layers, and finally, prediction is performed with an additional 1×1111\times 1 convolutional layer. “Pred.” stands for predicted.

2D and 3D U-net variants with attention mechanisms were used for benchmarking the reference performance from cross-data-set non-diffusion models. The architecture is illustrated in Figure 1. U-nets have four layers with 323232, 646464, 128128128, and 256256256 channels, respectively. The numbers of learnable parameters are summarized in Table 7 in Section E. For diffusion-based models, the noise-corrupted masks were concatenated. Time was encoded using sinusoidal positional embedding (Rombach et al., 2022) and used in the convolution layers.

For denoising training, a linear β𝛽\beta schedule between 0.00010.00010.0001 and 0.020.020.02 was used for T=1001𝑇1001T=1001 (illustrated in Figure 2 in Section D). The segmentation-specific loss function is a weighted sum of cross-entropy and foreground-only Dice loss, with weight 202020 and 111 respectively (Kirillov et al., 2023). Random rotation, translation, and scaling were adopted for data augmentation during training. Training hyper-parameters are listed in Table 6 in Section E. Hyper-parameters were configured empirically without extensive tuning.

Models were trained once and checkpoints were saved every 500500500 step. The checkpoint that had the best mean binary Dice score (without background class) in the validation set was used for the testing. For DDIM, the training was the same as DDPM while both validation and testing were performed using DDIM. The variance schedule was down-sampled to 555 steps (Nichol and Dhariwal, 2021). Experiments were carried out using bfloat16 mixed precision on TPU v3-8, which has 16×816816\times 8 GB device memory. However, each device has only 16 GB memory, meaning that the model and data have to fit into 161616 GB memory. The JAX-based framework has been released on https://github.com/mathpluscode/ImgX-DiffSeg.

6 Results and Discussion

6.1 Diffusion Training Strategy Comparison

Table 1: Diffusion training strategies comparison. “Diff.” represents standard diffusion. “Diff. sc. 𝐱tsubscript𝐱𝑡\operatorname{\mathbf{x}}_{t}” and “Diff. sc. 𝐱t+1subscript𝐱𝑡1\operatorname{\mathbf{x}}_{t+1}” represents self-conditioning from Chen et al. (2022b) and Watson et al. (2023), respectively. “Diff. rec. 𝐱t+1subscript𝐱𝑡1\operatorname{\mathbf{x}}_{t+1}” and “Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}” represents recycling from Fu et al. (2023) and the proposed recycling in this work, respectively. The best results are in bold and underline indicates the difference to the second best is significant with p-value <0.05absent0.05<0.05.
MethodDDPMDDIM
DS \uparrowHD \downarrowDS \uparrowHD \downarrow
Diff.86.60 ±plus-or-minus\pm 12.3841.11 ±plus-or-minus\pm 35.4886.18 ±plus-or-minus\pm 12.4142.31 ±plus-or-minus\pm 35.82
Diff. sc. 𝐱tsubscript𝐱𝑡\operatorname{\mathbf{x}}_{t}86.35 ±plus-or-minus\pm 14.1440.42 ±plus-or-minus\pm 37.5385.96 ±plus-or-minus\pm 13.7842.00 ±plus-or-minus\pm 36.76
Diff. sc. 𝐱t+1subscript𝐱𝑡1\operatorname{\mathbf{x}}_{t+1}87.14 ±plus-or-minus\pm 11.4839.24 ±plus-or-minus\pm 32.8386.30 ±plus-or-minus\pm 11.4941.89 ±plus-or-minus\pm 32.72
Diff. rec. 𝐱t+1subscript𝐱𝑡1\operatorname{\mathbf{x}}_{t+1}87.44 ±plus-or-minus\pm 12.3939.68 ±plus-or-minus\pm 36.2187.43 ±plus-or-minus\pm 12.2539.82 ±plus-or-minus\pm 35.39
Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}88.23 ±plus-or-minus\pm 11.6935.37 ±plus-or-minus\pm 31.7988.21 ±plus-or-minus\pm 11.7035.52 ±plus-or-minus\pm 31.91
(a) Muscle Ultrasound
MethodDDPMDDIM
DS \uparrowHD \downarrowDS \uparrowHD \downarrow
Diff.85.25 ±plus-or-minus\pm 5.367.12 ±plus-or-minus\pm 3.8385.59 ±plus-or-minus\pm 5.247.13 ±plus-or-minus\pm 3.98
Diff. sc. 𝐱tsubscript𝐱𝑡\operatorname{\mathbf{x}}_{t}86.04 ±plus-or-minus\pm 5.127.06 ±plus-or-minus\pm 4.2085.50 ±plus-or-minus\pm 5.147.21 ±plus-or-minus\pm 4.16
Diff. sc. 𝐱t+1subscript𝐱𝑡1\operatorname{\mathbf{x}}_{t+1}85.86 ±plus-or-minus\pm 5.276.98 ±plus-or-minus\pm 3.5485.25 ±plus-or-minus\pm 5.427.28 ±plus-or-minus\pm 3.72
Diff. rec. 𝐱t+1subscript𝐱𝑡1\operatorname{\mathbf{x}}_{t+1}86.48 ±plus-or-minus\pm 5.246.69 ±plus-or-minus\pm 4.5986.35 ±plus-or-minus\pm 5.316.75 ±plus-or-minus\pm 4.55
Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}87.45 ±plus-or-minus\pm 5.436.56 ±plus-or-minus\pm 5.4487.45 ±plus-or-minus\pm 5.436.55 ±plus-or-minus\pm 5.43
(b) Abdominal CT
MethodDDPMDDIM
DS \uparrowHD \downarrowDS \uparrowHD \downarrow
Diff.83.61 ±plus-or-minus\pm 4.875.10 ±plus-or-minus\pm 2.4083.11 ±plus-or-minus\pm 4.815.00 ±plus-or-minus\pm 2.35
Diff. sc. 𝐱tsubscript𝐱𝑡\operatorname{\mathbf{x}}_{t}83.47 ±plus-or-minus\pm 4.855.17 ±plus-or-minus\pm 2.6582.49 ±plus-or-minus\pm 4.885.42 ±plus-or-minus\pm 2.70
Diff. sc. 𝐱t+1subscript𝐱𝑡1\operatorname{\mathbf{x}}_{t+1}83.97 ±plus-or-minus\pm 4.854.93 ±plus-or-minus\pm 2.6683.00 ±plus-or-minus\pm 4.895.10 ±plus-or-minus\pm 2.64
Diff. rec. 𝐱t+1subscript𝐱𝑡1\operatorname{\mathbf{x}}_{t+1}84.29 ±plus-or-minus\pm 5.124.59 ±plus-or-minus\pm 2.2184.21 ±plus-or-minus\pm 4.894.96 ±plus-or-minus\pm 2.92
Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}85.54 ±plus-or-minus\pm 5.204.40 ±plus-or-minus\pm 1.9685.54 ±plus-or-minus\pm 5.204.41 ±plus-or-minus\pm 1.96
(c) Prostate MR
MethodDDPMDDIM
DS \uparrowHD \downarrowDS \uparrowHD \downarrow
Diff.90.29 ±plus-or-minus\pm 12.988.46 ±plus-or-minus\pm 15.5589.94 ±plus-or-minus\pm 13.008.55 ±plus-or-minus\pm 15.50
Diff. sc. 𝐱tsubscript𝐱𝑡\operatorname{\mathbf{x}}_{t}90.12 ±plus-or-minus\pm 12.399.55 ±plus-or-minus\pm 17.1889.73 ±plus-or-minus\pm 12.619.67 ±plus-or-minus\pm 16.86
Diff. sc. 𝐱t+1subscript𝐱𝑡1\operatorname{\mathbf{x}}_{t+1}89.11 ±plus-or-minus\pm 14.709.63 ±plus-or-minus\pm 17.4788.75 ±plus-or-minus\pm 14.779.62 ±plus-or-minus\pm 16.97
Diff. rec. 𝐱t+1subscript𝐱𝑡1\operatorname{\mathbf{x}}_{t+1}86.97 ±plus-or-minus\pm 10.949.83 ±plus-or-minus\pm 12.6284.76 ±plus-or-minus\pm 13.4212.52 ±plus-or-minus\pm 15.55
Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}92.29 ±plus-or-minus\pm 8.557.03 ±plus-or-minus\pm 13.4892.29 ±plus-or-minus\pm 8.557.03 ±plus-or-minus\pm 13.48
(d) Brain MR

Our proposed recycling method (Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}) achieved mean Dice scores of 88.23%percent88.2388.23\%, 87.45%percent87.4587.45\%, 85.54%percent85.5485.54\%, and 92.29%percent92.2992.29\% on muscle ultrasound, abdominal CT, prostate MR, and brain MR data sets, respectively. These scores marked absolute improvements of 1.63%percent1.631.63\%, 2.20%percent2.202.20\%, 1.93%percent1.931.93\%, and 2.00%percent2.002.00\% over standard diffusion models, respectively. The relative improvements are 1.88%percent1.881.88\%, 2.58%percent2.582.58\%, 2.31%percent2.312.31\%, and 2.22%percent2.222.22\% respectively. Impressively, this novel strategy consistently outperformed the other three training approaches in terms of both Dice score and Hausdorff distance. The observed differences were significant for all data sets in terms of Dice score (p=0.003𝑝0.003p=0.003 for muscle ultrasound and p<0.001𝑝0.001p<0.001 for other data sets). These findings held for both the DDPM and the DDIM samplers, underscoring the wide applicability of the proposed training strategy.

As depicted in Figure 3 in Section F.1, standard diffusion models often produce segmentation masks in the last step that are less accurate than the initial prediction. Similar challenges were observed with self-conditioning strategies and previously proposed recycling methods. The newly introduced recycling method was the only approach that improved initial segmentation predictions for more than half of the test images. Moreover, the average performance per step has been visualized in LABEL:fig:per_step_ddpm, where diffusion models frequently exhibit gradually declining or unstable performance during inference, in terms of both Dice score and Hausdorff distance. It is interesting to observe that often the optimal prediction emerges not at the final step but rather at an intermediate stage. This has been observed in all diffusion models except the newly proposed diffusion model with the innovative recycling method. In the latter case, the quality of segmentation consistently improved or remained stable throughout the inference process, distinguishing it from the observed trend. A qualitative comparison on an example muscle ultrasound image has been illustrated in LABEL:fig:comparison_muscle_ultrasound, where the proposed diffusion model was able to refine the segmentation mask progressively. Similar observations have been noted with the DDIM sampler as well, as shown in Figure 4 and Figure 5. This finding aligns with the discussions from Kolbeinsson and Mikolajczyk (2022); Lai et al. (2023) that the diffusion-based segmentation model performance is strongly influenced by the prediction of the initial step. For self-conditioning or the previously proposed recycling, the denoising training relies on the ground truth to varying degrees therefore the diffusion models are trained with ground truth-like initial predictions. However, no ground truth is available during inference, and the distributions of initial predictions from the trained models are dissimilar from ground truths. This results in an out-of-sample inference and therefore a declining performance. In contrast, the proposed method ingests model predictions for both the training and inference phases without the bias toward ground truth. These observations reaffirm the importance and benefits of harmonizing the training and inference processes. This alignment is crucial to mitigate data leakage, prevent overfitting, and help generalization.

6.2 Comparison to Non-diffusion Models

Table 2: Segmentation performance comparison to non-diffusion models. “No diff.” represents non-diffusion model. “Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}” represents the diffusion model with proposed recycling. “Ensemble” represents the model averaging the probabilities from “No diff.” and “Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}”. The inference sampler is DDPM. The best results are in bold and underline indicates the difference to non-diffusion model is significant with p-value < 0.050.050.05.
Data SetMethodDS \uparrowHD \downarrow
Muscle UltrasoundNo diff.88.15 ±plus-or-minus\pm 10.7736.86 ±plus-or-minus\pm 30.04
Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}88.23 ±plus-or-minus\pm 11.6935.37 ±plus-or-minus\pm 31.79
Ensemble88.88 ±plus-or-minus\pm 10.5934.01 ±plus-or-minus\pm 28.75
Abdominal CTNo diff.87.59 ±plus-or-minus\pm 5.106.36 ±plus-or-minus\pm 3.86
Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}87.45 ±plus-or-minus\pm 5.436.56 ±plus-or-minus\pm 5.44
Ensemble88.29 ±plus-or-minus\pm 5.215.60 ±plus-or-minus\pm 3.13
Prostate MRNo diff.85.22 ±plus-or-minus\pm 5.184.62 ±plus-or-minus\pm 2.37
Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}85.54 ±plus-or-minus\pm 5.204.40 ±plus-or-minus\pm 1.96
Ensemble85.95 ±plus-or-minus\pm 5.124.32 ±plus-or-minus\pm 2.01
Brain MRNo diff.92.43 ±plus-or-minus\pm 9.105.20 ±plus-or-minus\pm 9.56
Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}92.29 ±plus-or-minus\pm 8.557.03 ±plus-or-minus\pm 13.48
Ensemble92.67 ±plus-or-minus\pm 8.605.03 ±plus-or-minus\pm 8.41

The proposed diffusion models (“Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}”) were compared with their non-diffusion counterparts (“No diff.”), where models with identical architectures were trained under the same scheme with the same compute budget. This provides a fair comparison without application-specific adjustments. For diffusion models, the performance with DDPM was selected. As shown in Table 2, The diffusion models yielded similar performance across all data sets. The difference in Dice score is not significant for muscle ultrasound, abdominal CT, and brain MR, but the diffusion model had a higher Dice score for prostate MR (p=0.001𝑝0.001p=0.001). Furthermore, LABEL:fig:diff_vs_nodiff shows that the proposed diffusion model achieved a higher Dice score on more than 50%percent5050\% samples for muscle ultrasound, abdominal CT, and prostate MR data sets. To the best of our knowledge, this is the first time that diffusion models achieved comparable performance against standard non-diffusion-based models with the same architecture and compute budget.

By ensembling these two models via averaging the probabilities, we achieved mean Dice scores of 88.88%percent88.8888.88\%, 88.29%percent88.2988.29\%, 85.95%percent85.9585.95\%, and 92.67%percent92.6792.67\% on muscle ultrasound, abdominal CT, prostate MR, and brain MR data sets, respectively. The improvements in Dice score were significant across all four data sets (p=0.037𝑝0.037p=0.037 for brain MR and p<0.001𝑝0.001p<0.001 for other data sets). Especially, LABEL:fig:diff_vs_nodiff shows that the ensemble model reached a higher Dice score compared to non-diffusion models on 70.83%percent70.8370.83\%, 91.11%percent91.1191.11\%, 89.02%percent89.0289.02\%, and 64.54%percent64.5464.54\% samples in the test set for muscle ultrasound, abdominal CT, prostate MR data and brain MR, respectively. These scores marked an absolute increase of 19.36%percent19.3619.36\%, 40.00%percent40.0040.00\%, 28.04%percent28.0428.04\%, and 19.44%percent19.4419.44\% compared to the diffusion model alone. Moreover, Abdominal CT and prostate MR are two data sets with multiple classes and their per-class segmentation performances are summarised in Table 8 and Table 9 in Section F.1, respectively. Upon comparing diffusion models and non-diffusion models, neither consistently outperformed the other across all classes. However, the ensemble model reached the best performance across all classes and the improvement of Dice score is significant for 13 out of 15 classes in Abdominal CT data and all classes in prostate MR data (all p-values <=0.01absent0.01<=0.01, excluding Spleen p=0.06𝑝0.06p=0.06 and Gall bladder p=0.876𝑝0.876p=0.876). Multiple examples have also been visualized in LABEL:fig:3d and LABEL:fig:brain_mr_3d for the segmentation error.

We highlight that the value of the competitive performance from alternative methods, in particular a different class of generative model-based approaches, is beyond the replacement of current segmentation algorithms for specific potential applications. Our results demonstrate a consistent improvement by combining diffusion and non-diffusion models across applications, even when they yielded a similar performance individually. This is one of the possible potential uses of the proposed improved diffusion models in addition to the well-established non-diffusion baseline. Future research could explore application-specific tuning for further performance improvements.

6.3 Ablation Studies

6.3.1 Number of sampling steps

Table 3: Diffusion with different number of sampling steps. Sampler is DDPM. Diffusion models were trained using the proposed recycling method (Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}). OOM indicates that out of memory errors were encountered. Best results are in bold.
Data Set# Sampling StepsDice ScoreHausdorff Distance
Muscle Ultrasound288.01 ±plus-or-minus\pm 12.0736.55 ±plus-or-minus\pm 32.66
588.23 ±plus-or-minus\pm 11.6935.37 ±plus-or-minus\pm 31.79
1188.30 ±plus-or-minus\pm 11.2935.25 ±plus-or-minus\pm 30.64
Abdominal CT287.44 ±plus-or-minus\pm 5.436.56 ±plus-or-minus\pm 5.42
587.45 ±plus-or-minus\pm 5.436.56 ±plus-or-minus\pm 5.44
11OOMOOM
Prostate MR285.54 ±plus-or-minus\pm 5.194.40 ±plus-or-minus\pm 1.96
585.54 ±plus-or-minus\pm 5.204.40 ±plus-or-minus\pm 1.96
1185.54 ±plus-or-minus\pm 5.204.40 ±plus-or-minus\pm 1.96
Brain MR292.29 ±plus-or-minus\pm 8.547.03 ±plus-or-minus\pm 13.47
592.29 ±plus-or-minus\pm 8.557.03 ±plus-or-minus\pm 13.48
1192.29 ±plus-or-minus\pm 8.577.02 ±plus-or-minus\pm 13.48

Diffusion models were trained using a thousand steps, yet employing the same number of steps for inference can be cost-prohibitive, particularly for processing 3D image volumes. As a result, practical inference commonly utilizes a condensed schedule with a limited number of steps. While this approach reduces computational expenses, the resulting sample quality might be compromised. An ablation study of the numbers of timesteps during inference has therefore been performed across data sets with the proposed recycling-based diffusion model. DDPM sampler was used. The results have been summarised in Table 3. Notably, increasing the number of steps yielded a higher Dice score for the muscle ultrasound dataset but the difference is not significant (p>=0.05𝑝0.05p>=0.05). For prostate MR and brain MR data sets, the models maintained almost the same performance regardless of the inference length (p>=0.05𝑝0.05p>=0.05). Given that longer inference times and increased device memory usage are associated with more timesteps (e.g. out-of-memory errors were encountered with Abdominal CT at 11 steps), the trade-off between computational resources and performance suggests that a five-step sampling schedule provides the optimal balance.

6.3.2 Inference Variance

Table 4: Diffusion model performance across different inference seeds. For each sample, the maximum difference (ΔΔ\Delta) across five random seeds is calculated. The average across all samples is reported.
Data SetMean ΔΔ\Delta Dice Score
Step 1Step 2Step 3Step 4Step 5
Muscle Ultrasound0.02120.01650.01220.00810.0051
Abdominal CT0.00090.00100.00090.00080.0004
Prostate MR0.00040.00040.00040.00040.0002
Brain MR0.00050.00050.00050.00030.0001
Data SetMean ΔΔ\Delta Hausdorff Distance
Step 1Step 2Step 3Step 4Step 5
Muscle Ultrasound10.05827.00204.74403.17581.8164
Abdominal CT0.14810.13390.12210.07510.0673
Prostate MR0.04470.04260.04990.04310.0209
Brain MR0.07580.07790.06780.06160.0197

Different from deterministic models, the inference process of the diffusion model inherently incorporates stochasticity and models a distribution of the segmentation masks. Using the DDPM sampler with the proposed recycling-based diffusion model, the inference on each data set has been repeated with five different random seeds. Consequently, each sample has five distinct predicted masks. The maximum differences across five predictions were computed for the Dice score and Hausdorff distance, denoted by ΔΔ\Delta Dice score and ΔΔ\Delta Hausdorff distance, respectively. The average of this performance difference across all samples in the test set has been reported in Table 4 for all data sets. While the magnitude of the average difference (mean ΔΔ\Delta) varies across data sets, a common trend was observed where mean ΔΔ\Delta diminished during the sampling process for both metrics. In other words, despite different initial predictions, the model’s predictions gradually converge as the difference across seeds decreases. Moreover, the relative magnitude of the mean ΔΔ\Delta Hausdorff distance (e.g. 1.821.821.82 at the last step for muscle ultrasound represents around 5%percent55\% fluctuation compared to 35.3735.3735.37, the mean Hausdorff distance to ground truth) was larger than the relative magnitude for Dice score (e.g. 0.00510.00510.0051 at the last step for muscle ultrasound was around 0.006%percent0.0060.006\% fluctuation compared to 88.2388.2388.23 the mean Hausdorff distance to ground truth). We hypothesize that the variation among predictions may predominantly revolve around local refinements in mask boundaries, as opposed to significant alterations like expansion or contraction of foreground areas. This may open a direction for further improving diffusion training: instead of performing independent noising per pixel/voxel results in fragmented and disjointed masks, the noising can be morphology-informed such that the noise-corrupted masks expand or contract the foreground with continuous boundaries.

6.3.3 Transformer

Table 5: Segmentation performance without Transformer. “No diff.” represents non-diffusion model. “Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}” represents the diffusion model with proposed recycling. The inference sampler is DDPM. The best results are in bold and underline indicates the difference to non-diffusion model is significant with p-value < 0.050.050.05.
Data SetMethodTransformerDS \uparrowHD \downarrow
Muscle USNo diff.86.66 ±plus-or-minus\pm 13.1645.01 ±plus-or-minus\pm 38.86
88.15 ±plus-or-minus\pm 10.7736.86 ±plus-or-minus\pm 30.04
Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}88.36 ±plus-or-minus\pm 12.6035.67 ±plus-or-minus\pm 34.12
88.23 ±plus-or-minus\pm 11.6935.37 ±plus-or-minus\pm 31.79
Abdominal CTNo diff.87.48 ±plus-or-minus\pm 5.026.63 ±plus-or-minus\pm 4.03
87.59 ±plus-or-minus\pm 5.106.36 ±plus-or-minus\pm 3.86
Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}86.89 ±plus-or-minus\pm 5.496.91 ±plus-or-minus\pm 4.35
87.45 ±plus-or-minus\pm 5.436.56 ±plus-or-minus\pm 5.44
Prostate MRNo diff.84.82 ±plus-or-minus\pm 5.694.55 ±plus-or-minus\pm 2.17
85.22 ±plus-or-minus\pm 5.184.62 ±plus-or-minus\pm 2.37
Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}85.63 ±plus-or-minus\pm 5.194.59 ±plus-or-minus\pm 2.71
85.54 ±plus-or-minus\pm 5.204.40 ±plus-or-minus\pm 1.96
Brain MRNo diff.92.03 ±plus-or-minus\pm 9.675.29 ±plus-or-minus\pm 8.53
92.43 ±plus-or-minus\pm 9.105.20 ±plus-or-minus\pm 9.56
Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}92.04 ±plus-or-minus\pm 9.477.25 ±plus-or-minus\pm 13.76
92.29 ±plus-or-minus\pm 8.557.03 ±plus-or-minus\pm 13.48

Compared to Fu et al. (2023), the model includes a Transformer layer at the bottom encoder of U-net. This component has one layer representing 16% and 6% of the trainable parameters for 2D and 3D networks, correspondingly (see Table 7 in Section E). An ablation study has been performed for the proposed recycling approach and non-diffusion models. The results have been summarised in Table 5. For non-diffusion models, the addition of the Transformer component brought improvement in Dice score across all applications (p<0.001𝑝0.001p<0.001 for muscle ultrasound; p>=0.05𝑝0.05p>=0.05 for abdominal CT; p=0.001𝑝0.001p=0.001 for prostate MR; and p=0.0178𝑝0.0178p=0.0178 for brain MR), making this architecture the stronger reference model. For diffusion, significantly higher Dice scores have been observed for abdominal CT data (p<0.001𝑝0.001p<0.001), and the differences were not significant for other applications (p>=0.05𝑝0.05p>=0.05).

6.3.4 Length of training noise schedule

It’s worth noting that Fu et al. (2023) recommended incorporating a shortened variance schedule during training, mirroring that used during inference, in addition to the recycling technique. This modification resulted in enhanced performance for every training strategy on the muscle ultrasound data set (as detailed in LABEL:tab:timesteps_ablation_muscle). However, this adaptation did not yield enhancements for the proposed training strategies (“Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}”) in the abdominal CT data set (as depicted in LABEL:tab:timesteps_ablation_abdominal). Moreover, not all differences observed were statistically significant. This may suggest that the advantage of the modified training variance schedule may be application-dependent and sensitive to the change of model architectures and hyper-parameters. In this work, the variance schedule was maintained at 100110011001 steps.

7 Conclusion

In this research, we have proposed a novel training strategy for diffusion-based segmentation models. The aim is to remove the dependency on ground truth masks during denoising training. In contrast to the standard diffusion-based segmentation models and those employing self-conditioning or alternative recycling techniques, our approach consistently maintains or enhances segmentation performance throughout progressive inference processes. Through extensive experiments across four medical imaging data sets with different dimensionalities and modalities, we demonstrated statistically significant improvement against all diffusion baseline models for both DDPM and DDIM samplers. Our analysis for the first time identified a common limitation of existing diffusion model training for segmentation tasks. The use of ground truth data for denoising training leads to data leakage. By utilizing the model’s prediction at the initial step instead, we align the training process with inference procedures, effectively reducing over-fitting and promoting better generalization. While existing diffusion models underperformed non-diffusion-based segmentation model baselines, our innovative recycling training strategies effectively bridged the performance gap. This enhancement allowed diffusion models to attain comparable performance levels. To the best of our knowledge, this is the first time diffusion models have achieved such parity in performance while maintaining identical architecture and compute budget. By ensembling the diffusion and non-diffusion models, constant and significant improvements have been observed across all data sets, demonstrating one of its potential values. Nevertheless, challenges remain on the road to advancing diffusion-based segmentation models further. Future work could explore discrete diffusion models that are tailored for categorical data or implement diffusion in latent space to further reduce compute costs. Although the presented experimental results primarily demonstrated methodological development, the fact that these were obtained on four large clinical data sets represents a promising step toward real-world applications. We would like to argue the potential importance of the reported development, which may lead to better clinical outcomes and improved patient care in respective applications. For example, avoiding surrounding healthy structures may be sensitive to their localization in planning imaging, in both the abdominal CT and prostate MR tasks. This sensitivity can be high and nonlinear therefore arguably a perceived marginal improvement might benefit those with smaller targets, such as those in liver resection and focal therapy of prostate cancer, or highly variable ultrasound imaging guidance.


Acknowledgments

This work was supported by the EPSRC grant (EP/T029404/1), the Wellcome/EPSRC Centre for Interventional and Surgical Sciences (203145Z/16/Z), the International Alliance for Cancer Early Detection, an alliance between Cancer Research UK (C28070/A30912, C73666/A31378), Canary Center at Stanford University, the University of Cambridge, OHSU Knight Cancer Institute, University College London and the University of Manchester, and Cloud TPUs from Google’s TPU Research Cloud (TRC).


Ethical Standards

The work follows appropriate ethical standards in conducting research and writing the manuscript, following all applicable laws and regulations regarding treatment of animals or human subjects.


Conflicts of Interest

We declare we do not have conflicts of interest.

References

  • Amit et al. (2021) Tomer Amit, Eliya Nachmani, Tal Shaharbany, and Lior Wolf. Segdiff: Image segmentation with diffusion probabilistic models. arXiv preprint arXiv:2112.00390, 2021.
  • Austin et al. (2021) Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces. Advances in Neural Information Processing Systems, 34:17981–17993, 2021.
  • Baid et al. (2021) Ujjwal Baid, Satyam Ghodasara, Suyash Mohan, Michel Bilello, Evan Calabrese, Errol Colak, Keyvan Farahani, Jayashree Kalpathy-Cramer, Felipe C Kitamura, Sarthak Pati, et al. The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification. arXiv preprint arXiv:2107.02314, 2021.
  • Bieder et al. (2023) Florentin Bieder, Julia Wolleb, Alicia Durrer, Robin Sandkühler, and Philippe C Cattin. Diffusion models for memory-efficient processing of 3d medical images. arXiv preprint arXiv:2303.15288, 2023.
  • Chen et al. (2023) Tao Chen, Chenhui Wang, and Hongming Shan. Berdiff: Conditional bernoulli diffusion model for medical image segmentation. arXiv preprint arXiv:2304.04429, 2023.
  • Chen et al. (2022a) Ting Chen, Lala Li, Saurabh Saxena, Geoffrey Hinton, and David J Fleet. A generalist framework for panoptic segmentation of images and videos. arXiv preprint arXiv:2210.06366, 2022a.
  • Chen et al. (2022b) Ting Chen, Ruixiang Zhang, and Geoffrey Hinton. Analog bits: Generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202, 2022b.
  • Dhariwal and Nichol (2021) Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  • Dorjsembe et al. (2022) Zolnamar Dorjsembe, Sodtavilan Odonchimed, and Furen Xiao. Three-dimensional medical image synthesis with denoising diffusion probabilistic models. In Medical Imaging with Deep Learning, 2022.
  • Fu et al. (2023) Yunguan Fu, Yiwen Li, Shaheer U Saeed, Matthew J Clarkson, and Yipeng Hu. Importance of aligning training strategy with evaluation for diffusion models in 3d multiclass segmentation. arXiv preprint arXiv:2303.06040, 2023.
  • Goodfellow et al. (2020) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  • Gu et al. (2022) Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, 2022.
  • Guo et al. (2022) Xutao Guo, Yanwu Yang, Chenfei Ye, Shang Lu, Yang Xiang, and Ting Ma. Accelerating diffusion models via pre-segmentation diffusion sampling for medical image segmentation. arXiv preprint arXiv:2210.17408, 2022.
  • Han et al. (2022) Xiaochuang Han, Sachin Kumar, and Yulia Tsvetkov. Ssd-lm: Semi-autoregressive simplex-based diffusion language model for text generation and modular control. arXiv preprint arXiv:2210.17432, 2022.
  • Ho and Salimans (2022) Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  • Hoogeboom et al. (2021) Emiel Hoogeboom, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. Argmax flows and multinomial diffusion: Learning categorical distributions. Advances in Neural Information Processing Systems, 34:12454–12465, 2021.
  • Hu et al. (2022) Dewei Hu, Yuankai K Tao, and Ipek Oguz. Unsupervised denoising of retinal oct with diffusion probabilistic model. In Medical Imaging 2022: Image Processing, volume 12032, pages 25–34. SPIE, 2022.
  • Hyvärinen and Dayan (2005) Aapo Hyvärinen and Peter Dayan. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(4), 2005.
  • Ji et al. (2022) Yuanfeng Ji, Haotian Bai, Jie Yang, Chongjian Ge, Ye Zhu, Ruimao Zhang, Zhen Li, Lingyan Zhang, Wanling Ma, Xiang Wan, et al. Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation. arXiv preprint arXiv:2206.08023, 2022.
  • Kazerouni et al. (2023) Amirhossein Kazerouni, Ehsan Khodapanah Aghdam, Moein Heidari, Reza Azad, Mohsen Fayyaz, Ilker Hacihaliloglu, and Dorit Merhof. Diffusion models in medical imaging: A comprehensive survey. Medical Image Analysis, page 102846, 2023.
  • Khader et al. (2022) Firas Khader, Gustav Mueller-Franzes, Soroosh Tayebi Arasteh, Tianyu Han, Christoph Haarburger, Maximilian Schulze-Hagen, Philipp Schad, Sandy Engelhardt, Bettina Baessler, Sebastian Foersch, et al. Medical diffusion–denoising diffusion probabilistic models for 3d medical image generation. arXiv preprint arXiv:2211.03364, 2022.
  • Kim et al. (2022) Boah Kim, Inhwa Han, and Jong Chul Ye. Diffusemorph: unsupervised deformable image registration using diffusion model. In European Conference on Computer Vision, pages 347–364. Springer, 2022.
  • Kingma et al. (2021) Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021.
  • Kirillov et al. (2023) Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  • Kolbeinsson and Mikolajczyk (2022) Benedikt Kolbeinsson and Krystian Mikolajczyk. Multi-class segmentation from aerial views using recursive noise diffusion. arXiv preprint arXiv:2212.00787, 2022.
  • Lai et al. (2023) Zeqiang Lai, Yuchen Duan, Jifeng Dai, Ziheng Li, Ying Fu, Hongsheng Li, Yu Qiao, and Wenhai Wang. Denoising diffusion semantic segmentation with mask prior modeling. arXiv preprint arXiv:2306.01721, 2023.
  • Li et al. (2022a) Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems, 35:4328–4343, 2022a.
  • Li et al. (2022b) Yiwen Li, Yunguan Fu, Iani Gayo, Qianye Yang, Zhe Min, Shaheer Saeed, Wen Yan, Yipei Wang, J Alison Noble, Mark Emberton, et al. Prototypical few-shot segmentation for cross-institution male pelvic structures with spatial registration. arXiv preprint arXiv:2209.05160, 2022b.
  • Liu et al. (2022) Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds. arXiv preprint arXiv:2202.09778, 2022.
  • Lyu et al. (2022) Zhaoyang Lyu, Xudong Xu, Ceyuan Yang, Dahua Lin, and Bo Dai. Accelerating diffusion models via early stop of the diffusion process. arXiv preprint arXiv:2205.12524, 2022.
  • Marzola et al. (2021) Francesco Marzola, Nens van Alfen, Jonne Doorduin, and Kristen M Meiburger. Deep learning segmentation of transverse musculoskeletal ultrasound images for neuromuscular disease assessment. Computers in Biology and Medicine, 135:104623, 2021.
  • Moghadam et al. (2023) Puria Azadi Moghadam, Sanne Van Dalen, Karina C Martin, Jochen Lennerz, Stephen Yip, Hossein Farahani, and Ali Bashashati. A morphology focused diffusion probabilistic model for synthesis of histopathology images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2000–2009, 2023.
  • Nichol and Dhariwal (2021) Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
  • Pinaya et al. (2022a) Walter HL Pinaya, Mark S Graham, Robert Gray, Pedro F Da Costa, Petru-Daniel Tudosiu, Paul Wright, Yee H Mah, Andrew D MacKinnon, James T Teo, Rolf Jager, et al. Fast unsupervised brain anomaly detection and segmentation with diffusion models. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 705–714. Springer, 2022a.
  • Pinaya et al. (2022b) Walter HL Pinaya, Petru-Daniel Tudosiu, Jessica Dafflon, Pedro F da Costa, Virginia Fernandez, Parashkev Nachev, Sebastien Ourselin, and M Jorge Cardoso. Brain imaging generation with latent diffusion models. arXiv preprint arXiv:2209.07162, 2022b.
  • Rahman et al. (2023) Aimon Rahman, Jeya Maria Jose Valanarasu, Ilker Hacihaliloglu, and Vishal M Patel. Ambiguous medical image segmentation using diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11536–11546, 2023.
  • Ramesh et al. (2022) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  • Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  • Saeed et al. (2023) Shaheer U. Saeed, Tom Syer, Wen Yan, Qianye Yang, Mark Emberton, Shonit Punwani, Matthew J. Clarkson, Dean C. Barratt, and Yipeng Hu. Bi-parametric prostate mr image synthesis using pathology and sequence-conditioned stable diffusion. arXiv preprint arXiv:2303.02094, 2023.
  • Sohl-Dickstein et al. (2015) Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
  • Song et al. (2020a) Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
  • Song and Ermon (2019) Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems, 32, 2019.
  • Song and Ermon (2020) Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. Advances in neural information processing systems, 33:12438–12448, 2020.
  • Song et al. (2020b) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
  • Strudel et al. (2022) Robin Strudel, Corentin Tallec, Florent Altché, Yilun Du, Yaroslav Ganin, Arthur Mensch, Will Grathwohl, Nikolay Savinov, Sander Dieleman, Laurent Sifre, et al. Self-conditioned embedding diffusion for text generation. arXiv preprint arXiv:2211.04236, 2022.
  • Vincent (2011) Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011.
  • Wang et al. (2023) Hefeng Wang, Jiale Cao, Rao Muhammad Anwer, Jin Xie, Fahad Shahbaz Khan, and Yanwei Pang. Dformer: Diffusion-guided transformer for universal image segmentation. arXiv preprint arXiv:2306.03437, 2023.
  • Wang et al. (2022) Risheng Wang, Tao Lei, Ruixia Cui, Bingtao Zhang, Hongying Meng, and Asoke K Nandi. Medical image segmentation using deep learning: A survey. IET Image Processing, 16(5):1243–1267, 2022.
  • Watson et al. (2023) Joseph L Watson, David Juergens, Nathaniel R Bennett, Brian L Trippe, Jason Yim, Helen E Eisenach, Woody Ahern, Andrew J Borst, Robert J Ragotte, Lukas F Milles, et al. De novo design of protein structure and function with rfdiffusion. Nature, pages 1–3, 2023.
  • Wolleb et al. (2022a) Julia Wolleb, Florentin Bieder, Robin Sandkühler, and Philippe C Cattin. Diffusion models for medical anomaly detection. In International Conference on Medical image computing and computer-assisted intervention, pages 35–45. Springer, 2022a.
  • Wolleb et al. (2022b) Julia Wolleb, Robin Sandkühler, Florentin Bieder, Philippe Valmaggia, and Philippe C Cattin. Diffusion models for implicit image segmentation ensembles. In International Conference on Medical Imaging with Deep Learning, pages 1336–1348. PMLR, 2022b.
  • Wu et al. (2022) Junde Wu, Huihui Fang, Yu Zhang, Yehui Yang, and Yanwu Xu. Medsegdiff: Medical image segmentation with diffusion probabilistic model. arXiv preprint arXiv:2211.00611, 2022.
  • Wu et al. (2023) Junde Wu, Rao Fu, Huihui Fang, Yu Zhang, and Yanwu Xu. Medsegdiff-v2: Diffusion based medical image segmentation with transformer. arXiv preprint arXiv:2301.11798, 2023.
  • Xing et al. (2023) Zhaohu Xing, Liang Wan, Huazhu Fu, Guang Yang, and Lei Zhu. Diff-unet: A diffusion embedded network for volumetric segmentation. arXiv preprint arXiv:2303.10326, 2023.
  • Yang et al. (2023) Yijun Yang, Huazhu Fu, Angelica Aviles-Rivero, Carola-Bibiane Schönlieb, and Lei Zhu. Diffmic: Dual-guidance diffusion network for medical image classification. arXiv preprint arXiv:2303.10610, 2023.
  • Young et al. (2022) Sean I Young, Adrian V Dalca, Enzo Ferrante, Polina Golland, Bruce Fischl, and Juan Eugenio Iglesias. Sud: Supervision by denoising for medical image segmentation. arXiv preprint arXiv:2202.02952, 2022.
  • Zbinden et al. (2023) Lukas Zbinden, Lars Doorenbos, Theodoros Pissas, Raphael Sznitman, and Pablo Márquez-Neila. Stochastic segmentation with conditional categorical diffusion models. arXiv preprint arXiv:2303.08888, 2023.
  • Zheng et al. (2022) Huangjie Zheng, Pengcheng He, Weizhu Chen, and Mingyuan Zhou. Truncated diffusion probabilistic models. stat, 1050:7, 2022.

A Denoising Diffusion Probabilistic Model

We review the formulation of denoising diffusion probabilistic models (DDPM) from Sohl-Dickstein et al. (2015); Ho et al. (2020); Nichol and Dhariwal (2021).

A.1 Definition

𝐱T\autorightleftharpoons\autorightleftharpoons𝐱t\autorightleftharpoonspθ(𝐱t1𝐱t)q(𝐱t𝐱t1)𝐱t1\autorightleftharpoons\autorightleftharpoons𝐱0subscript𝐱𝑇\autorightleftharpoons\autorightleftharpoonssubscript𝐱𝑡\autorightleftharpoonssubscript𝑝𝜃conditionalsubscript𝐱𝑡1subscript𝐱𝑡𝑞conditionalsubscript𝐱𝑡subscript𝐱𝑡1subscript𝐱𝑡1\autorightleftharpoons\autorightleftharpoonssubscript𝐱0\displaystyle\operatorname{\mathbf{x}}_{T}\autorightleftharpoons{}{}\cdots\autorightleftharpoons{}{}\operatorname{\mathbf{x}}_{t}\autorightleftharpoons{$p_{\theta}(\operatorname{\mathbf{x}}_{t-1}\mid\operatorname{\mathbf{x}}_{t})$}{$q(\operatorname{\mathbf{x}}_{t}\mid\operatorname{\mathbf{x}}_{t-1})$}\operatorname{\mathbf{x}}_{t-1}\autorightleftharpoons{}{}\cdots\autorightleftharpoons{}{}\operatorname{\mathbf{x}}_{0}

Consider a continuous diffusion process (also named forward process or noising process): given a data point 𝐱0q(𝐱0)similar-tosubscript𝐱0𝑞subscript𝐱0\operatorname{\mathbf{x}}_{0}\sim q(\operatorname{\mathbf{x}}_{0}) in Dsuperscript𝐷\operatorname{\mathbb{R}}^{D}, we add noise to 𝐱tsubscript𝐱𝑡\operatorname{\mathbf{x}}_{t} for t=1,,T𝑡1𝑇t=1,\cdots,T with the following multivariate normal distribution:

q(𝐱t𝐱t1)=𝒩(𝐱t;1βt𝐱t1,βt𝐈)𝑞conditionalsubscript𝐱𝑡subscript𝐱𝑡1𝒩subscript𝐱𝑡1subscript𝛽𝑡subscript𝐱𝑡1subscript𝛽𝑡𝐈\displaystyle q(\operatorname{\mathbf{x}}_{t}\mid\operatorname{\mathbf{x}}_{t-1})=\mathcal{N}(\operatorname{\mathbf{x}}_{t};\sqrt{1-\beta_{t}}\operatorname{\mathbf{x}}_{t-1},\beta_{t}\mathbf{I})

where βt[0,1]subscript𝛽𝑡01\beta_{t}\in[0,1] is a variance schedule. Given sufficiently large T𝑇T and a well-defined variance schedule, the distribution of 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T} approximates an isotropic multivariate normal distribution.

q(𝐱t𝐱0)𝒩(𝐱t;𝟎,𝐈)𝑞conditionalsubscript𝐱𝑡subscript𝐱0𝒩subscript𝐱𝑡0𝐈\displaystyle q(\operatorname{\mathbf{x}}_{t}\mid\operatorname{\mathbf{x}}_{0})\rightarrow\mathcal{N}(\operatorname{\mathbf{x}}_{t};\mathbf{0},\mathbf{I})

Therefore, we can define a reverse process (also named denoising process): given a sample 𝐱T𝒩(𝐱T;𝟎,𝐈)similar-tosubscript𝐱𝑇𝒩subscript𝐱𝑇0𝐈\operatorname{\mathbf{x}}_{T}\sim\mathcal{N}(\operatorname{\mathbf{x}}_{T};\mathbf{0},\mathbf{I}), we denoise the data using neural networks 𝝁θ:DD:subscript𝝁𝜃superscript𝐷superscript𝐷\bm{\mu}_{\theta}:\operatorname{\mathbb{R}}^{D}\rightarrow\operatorname{\mathbb{R}}^{D} and 𝚺θ:DD×D:subscript𝚺𝜃superscript𝐷superscript𝐷𝐷\mathbf{\Sigma}_{\theta}:\operatorname{\mathbb{R}}^{D}\rightarrow\operatorname{\mathbb{R}}^{D\times D} as follows:

pθ(𝐱t1𝐱t)=𝒩(𝐱t1;𝝁θ(𝐱t,t),𝚺θ(𝐱t,t))subscript𝑝𝜃conditionalsubscript𝐱𝑡1subscript𝐱𝑡𝒩subscript𝐱𝑡1subscript𝝁𝜃subscript𝐱𝑡𝑡subscript𝚺𝜃subscript𝐱𝑡𝑡\displaystyle p_{\theta}(\operatorname{\mathbf{x}}_{t-1}\mid\operatorname{\mathbf{x}}_{t})=\mathcal{N}(\operatorname{\mathbf{x}}_{t-1};\bm{\mu}_{\theta}(\operatorname{\mathbf{x}}_{t},t),\mathbf{\Sigma}_{\theta}(\operatorname{\mathbf{x}}_{t},t))

In this work, an isotropic variance is assumed with 𝚺θ(𝐱t,t)=σt2𝐈subscript𝚺𝜃subscript𝐱𝑡𝑡superscriptsubscript𝜎𝑡2𝐈\mathbf{\Sigma}_{\theta}(\operatorname{\mathbf{x}}_{t},t)=\sigma_{t}^{2}\mathbf{I}, such that

pθ(𝐱t1𝐱t)=𝒩(𝐱t1;𝝁θ(𝐱t,t),σt2𝐈)subscript𝑝𝜃conditionalsubscript𝐱𝑡1subscript𝐱𝑡𝒩subscript𝐱𝑡1subscript𝝁𝜃subscript𝐱𝑡𝑡superscriptsubscript𝜎𝑡2𝐈\displaystyle p_{\theta}(\operatorname{\mathbf{x}}_{t-1}\mid\operatorname{\mathbf{x}}_{t})=\mathcal{N}(\operatorname{\mathbf{x}}_{t-1};\bm{\mu}_{\theta}(\operatorname{\mathbf{x}}_{t},t),\sigma_{t}^{2}\mathbf{I})

A.2 Variational Lower Bound

Consider 𝐳=𝐱1:T𝐱0𝐳conditionalsubscript𝐱:1𝑇subscript𝐱0\operatorname{\mathbf{z}}=\operatorname{\mathbf{x}}_{1:T}\mid\operatorname{\mathbf{x}}_{0} as latent variables for 𝐱0subscript𝐱0\operatorname{\mathbf{x}}_{0}, we can derive the variational lower bound (VLB) as follows:

logpθ(𝐱0)=subscript𝑝𝜃subscript𝐱0absent\displaystyle\log p_{\theta}(\operatorname{\mathbf{x}}_{0})=DKL(q(𝐳)pθ(𝐳𝐱0))+𝔼q(𝐳)[logpθ(𝐱0,𝐳)q(𝐳)]\displaystyle D_{\text{KL}}(q(\operatorname{\mathbf{z}})\leavevmode\nobreak\ \|\leavevmode\nobreak\ p_{\theta}(\operatorname{\mathbf{z}}\mid\operatorname{\mathbf{x}}_{0}))+\operatorname{\mathbb{E}}_{q(\operatorname{\mathbf{z}})}\left[\log\frac{p_{\theta}(\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{z}})}{q(\operatorname{\mathbf{z}})}\right]
\displaystyle\geq𝔼q(𝐳)[logpθ(𝐱0,𝐳)q(𝐳)]subscript𝔼𝑞𝐳subscript𝑝𝜃subscript𝐱0𝐳𝑞𝐳\displaystyle\operatorname{\mathbb{E}}_{q(\operatorname{\mathbf{z}})}\left[\log\frac{p_{\theta}(\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{z}})}{q(\operatorname{\mathbf{z}})}\right]
=\displaystyle=(𝔼q(𝐱1𝐱0)L0+t=2T𝔼q(𝐱t𝐱0)Lt1+LT)subscript𝔼𝑞conditionalsubscript𝐱1subscript𝐱0subscript𝐿0superscriptsubscript𝑡2𝑇subscript𝔼𝑞conditionalsubscript𝐱𝑡subscript𝐱0subscript𝐿𝑡1subscript𝐿𝑇\displaystyle-\Big{(}\operatorname{\mathbb{E}}_{q(\operatorname{\mathbf{x}}_{1}\mid\operatorname{\mathbf{x}}_{0})}L_{0}+\sum_{t=2}^{T}\operatorname{\mathbb{E}}_{q(\operatorname{\mathbf{x}}_{t}\mid\operatorname{\mathbf{x}}_{0})}L_{t-1}+L_{T}\Big{)}

where

L0subscript𝐿0\displaystyle L_{0}=logpθ(𝐱0𝐱1)absentsubscript𝑝𝜃conditionalsubscript𝐱0subscript𝐱1\displaystyle=-\log p_{\theta}(\operatorname{\mathbf{x}}_{0}\mid\operatorname{\mathbf{x}}_{1})(reconstruction loss)
Lt1subscript𝐿𝑡1\displaystyle L_{t-1}=DKL(q(𝐱t1𝐱t,𝐱0)pθ(𝐱t1𝐱t))\displaystyle=D_{\text{KL}}(q(\operatorname{\mathbf{x}}_{t-1}\mid\operatorname{\mathbf{x}}_{t},\operatorname{\mathbf{x}}_{0})\|p_{\theta}(\operatorname{\mathbf{x}}_{t-1}\mid\operatorname{\mathbf{x}}_{t}))(diffussion loss)
LTsubscript𝐿𝑇\displaystyle L_{T}=DKL(q(𝐱T𝐱0))pθ(𝐱T).absentconditionalsubscript𝐷KL𝑞conditionalsubscript𝐱𝑇subscript𝐱0subscript𝑝𝜃subscript𝐱𝑇\displaystyle=D_{\text{KL}}(q(\operatorname{\mathbf{x}}_{T}\mid\operatorname{\mathbf{x}}_{0}))\leavevmode\nobreak\ \|\leavevmode\nobreak\ p_{\theta}(\operatorname{\mathbf{x}}_{T}).(prior loss)

A.3 Diffusion Loss

In particular, we can derive the closed form Lt1subscript𝐿𝑡1L_{t-1} with

q(𝐱t𝐱0)𝑞conditionalsubscript𝐱𝑡subscript𝐱0\displaystyle q(\operatorname{\mathbf{x}}_{t}\mid\operatorname{\mathbf{x}}_{0})=𝒩(𝐱t;α¯t𝐱0,(1α¯t)𝐈)absent𝒩subscript𝐱𝑡subscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡𝐈\displaystyle=\mathcal{N}(\operatorname{\mathbf{x}}_{t};\sqrt{\bar{\alpha}_{t}}\operatorname{\mathbf{x}}_{0},(1-\bar{\alpha}_{t})\mathbf{I})
q(𝐱t1𝐱t,𝐱0)𝑞conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0\displaystyle q(\operatorname{\mathbf{x}}_{t-1}\mid\operatorname{\mathbf{x}}_{t},\operatorname{\mathbf{x}}_{0})=𝒩(𝐱t1;𝝁~(𝐱t,𝐱0),β~t𝐈)absent𝒩subscript𝐱𝑡1~𝝁subscript𝐱𝑡subscript𝐱0subscript~𝛽𝑡𝐈\displaystyle=\mathcal{N}(\operatorname{\mathbf{x}}_{t-1};\tilde{\bm{\mu}}(\operatorname{\mathbf{x}}_{t},\operatorname{\mathbf{x}}_{0}),\tilde{\beta}_{t}\mathbf{I})

where

αtsubscript𝛼𝑡\displaystyle\alpha_{t}=1βtabsent1subscript𝛽𝑡\displaystyle=1-\beta_{t}
α¯tsubscript¯𝛼𝑡\displaystyle\bar{\alpha}_{t}=s=1tαsabsentsuperscriptsubscriptproduct𝑠1𝑡subscript𝛼𝑠\displaystyle=\prod_{s=1}^{t}\alpha_{s}
𝝁~(𝐱t,𝐱0)~𝝁subscript𝐱𝑡subscript𝐱0\displaystyle\tilde{\bm{\mu}}(\operatorname{\mathbf{x}}_{t},\operatorname{\mathbf{x}}_{0})=α¯t1βt1α¯t𝐱0+1α¯t11α¯tαt𝐱t,absentsubscript¯𝛼𝑡1subscript𝛽𝑡1subscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡11subscript¯𝛼𝑡subscript𝛼𝑡subscript𝐱𝑡\displaystyle=\frac{\sqrt{\bar{\alpha}_{t-1}}\beta_{t}}{1-\bar{\alpha}_{t}}\operatorname{\mathbf{x}}_{0}+\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}\sqrt{\alpha_{t}}\operatorname{\mathbf{x}}_{t},(88)
β~tsubscript~𝛽𝑡\displaystyle\tilde{\beta}_{t}=1α¯t11α¯tβt.absent1subscript¯𝛼𝑡11subscript¯𝛼𝑡subscript𝛽𝑡\displaystyle=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}\beta_{t}.

A.3.1 Noise Prediction Loss (ϵitalic-ϵ\epsilon-parameterization)s

Consider the reparameterization in Ho et al. (2020),

𝐱t(𝐱0,ϵ)subscript𝐱𝑡subscript𝐱0bold-italic-ϵ\displaystyle\operatorname{\mathbf{x}}_{t}(\operatorname{\mathbf{x}}_{0},\operatorname{\bm{\epsilon}})=α¯t𝐱0+1α¯tϵabsentsubscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡bold-italic-ϵ\displaystyle=\sqrt{\bar{\alpha}_{t}}\operatorname{\mathbf{x}}_{0}+\sqrt{1-\bar{\alpha}_{t}}\operatorname{\bm{\epsilon}}
ϵθ(𝐱t,t)subscriptbold-italic-ϵ𝜃subscript𝐱𝑡𝑡\displaystyle\operatorname{\bm{\epsilon}}_{\theta}(\operatorname{\mathbf{x}}_{t},t)=11α¯t𝐱tα¯t1α¯t𝐱0absent11subscript¯𝛼𝑡subscript𝐱𝑡subscript¯𝛼𝑡1subscript¯𝛼𝑡subscript𝐱0\displaystyle=\frac{1}{\sqrt{1-\bar{\alpha}_{t}}}\operatorname{\mathbf{x}}_{t}-\frac{\sqrt{\bar{\alpha}_{t}}}{\sqrt{1-\bar{\alpha}_{t}}}\operatorname{\mathbf{x}}_{0}
𝝁θ(𝐱t,t)subscript𝝁𝜃subscript𝐱𝑡𝑡\displaystyle\bm{\mu}_{\theta}(\operatorname{\mathbf{x}}_{t},t)=1αt(𝐱tβt1α¯tϵθ(𝐱t,t))absent1subscript𝛼𝑡subscript𝐱𝑡subscript𝛽𝑡1subscript¯𝛼𝑡subscriptbold-italic-ϵ𝜃subscript𝐱𝑡𝑡\displaystyle=\frac{1}{\sqrt{\alpha}_{t}}(\operatorname{\mathbf{x}}_{t}-\frac{\beta_{t}}{\sqrt{1-\bar{\alpha}_{t}}}\operatorname{\bm{\epsilon}}_{\theta}(\operatorname{\mathbf{x}}_{t},t))

We can derive a closed form of Lt1subscript𝐿𝑡1L_{t-1}

Lt1(𝐱t,𝐱0)=12σt2βt2αt(1α¯t)ϵϵθ22+Csubscript𝐿𝑡1subscript𝐱𝑡subscript𝐱012superscriptsubscript𝜎𝑡2superscriptsubscript𝛽𝑡2subscript𝛼𝑡1subscript¯𝛼𝑡superscriptsubscriptnormbold-italic-ϵsubscriptbold-italic-ϵ𝜃22𝐶\displaystyle L_{t-1}(\operatorname{\mathbf{x}}_{t},\operatorname{\mathbf{x}}_{0})=\frac{1}{2\sigma_{t}^{2}}\frac{\beta_{t}^{2}}{{\alpha_{t}(1-\bar{\alpha}_{t})}}\|\operatorname{\bm{\epsilon}}-\operatorname{\bm{\epsilon}}_{\theta}\|_{2}^{2}+C

If σt2=β~t=1α¯t11α¯tβtsuperscriptsubscript𝜎𝑡2subscript~𝛽𝑡1subscript¯𝛼𝑡11subscript¯𝛼𝑡subscript𝛽𝑡\sigma_{t}^{2}=\tilde{\beta}_{t}=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}\beta_{t}, using the signal-to-noise ratio (SNR) defined in  Kingma et al. (2021), SNR(t)=α¯t1α¯tSNR𝑡subscript¯𝛼𝑡1subscript¯𝛼𝑡\text{SNR}(t)=\frac{\bar{\alpha}_{t}}{1-\bar{\alpha}_{t}}, the loss can be derived as

Lt1(𝐱t,𝐱0)=(SNR(t1)SNR(t)1)ϵϵθ22+Csubscript𝐿𝑡1subscript𝐱𝑡subscript𝐱0SNR𝑡1SNR𝑡1superscriptsubscriptnormbold-italic-ϵsubscriptbold-italic-ϵ𝜃22𝐶\displaystyle L_{t-1}(\operatorname{\mathbf{x}}_{t},\operatorname{\mathbf{x}}_{0})=(\frac{\text{SNR}(t-1)}{\text{SNR}(t)}-1)\|\operatorname{\bm{\epsilon}}-\operatorname{\bm{\epsilon}}_{\theta}\|_{2}^{2}+C

A.3.2 Sample Prediction Loss (𝐱0subscript𝐱0\operatorname{\mathbf{x}}_{0}-parameterization)

Similar to Eq. 88, consider the parameterization (Kingma et al., 2021),

𝝁θ(𝐱t,t)=α¯t1βt1α¯t𝐱0,θ+1α¯t11α¯tαt𝐱tsubscript𝝁𝜃subscript𝐱𝑡𝑡subscript¯𝛼𝑡1subscript𝛽𝑡1subscript¯𝛼𝑡subscript𝐱0𝜃1subscript¯𝛼𝑡11subscript¯𝛼𝑡subscript𝛼𝑡subscript𝐱𝑡\displaystyle\bm{\mu}_{\theta}(\operatorname{\mathbf{x}}_{t},t)=\frac{\sqrt{\bar{\alpha}_{t-1}}\beta_{t}}{1-\bar{\alpha}_{t}}\operatorname{\mathbf{x}}_{0,\theta}+\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}\sqrt{\alpha_{t}}\operatorname{\mathbf{x}}_{t}

We can derive a closed form of Lt1subscript𝐿𝑡1L_{t-1}

Lt1(𝐱t,𝐱0)=12σt2α¯t1βt2(1α¯t)2𝐱0,θ𝐱022+C.subscript𝐿𝑡1subscript𝐱𝑡subscript𝐱012superscriptsubscript𝜎𝑡2subscript¯𝛼𝑡1superscriptsubscript𝛽𝑡2superscript1subscript¯𝛼𝑡2superscriptsubscriptnormsubscript𝐱0𝜃subscript𝐱022𝐶\displaystyle L_{t-1}(\operatorname{\mathbf{x}}_{t},\operatorname{\mathbf{x}}_{0})=\frac{1}{2\sigma_{t}^{2}}\frac{\bar{\alpha}_{t-1}\beta_{t}^{2}}{(1-\bar{\alpha}_{t})^{2}}\|\operatorname{\mathbf{x}}_{0,\theta}-\operatorname{\mathbf{x}}_{0}\|_{2}^{2}+C.

If σt2=β~t=1α¯t11α¯tβtsuperscriptsubscript𝜎𝑡2subscript~𝛽𝑡1subscript¯𝛼𝑡11subscript¯𝛼𝑡subscript𝛽𝑡\sigma_{t}^{2}=\tilde{\beta}_{t}=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}\beta_{t}, using the signal-to-noise ratio (SNR) defined in  Kingma et al. (2021), SNR(t)=α¯t1α¯tSNR𝑡subscript¯𝛼𝑡1subscript¯𝛼𝑡\text{SNR}(t)=\frac{\bar{\alpha}_{t}}{1-\bar{\alpha}_{t}}, the loss can be derived as

Lt1(𝐱t,𝐱0)=12(SNR(t1)SNR(t))𝐱0,θ𝐱022+C.subscript𝐿𝑡1subscript𝐱𝑡subscript𝐱012SNR𝑡1SNR𝑡superscriptsubscriptnormsubscript𝐱0𝜃subscript𝐱022𝐶\displaystyle L_{t-1}(\operatorname{\mathbf{x}}_{t},\operatorname{\mathbf{x}}_{0})=\frac{1}{2}(\text{SNR}(t-1)-\text{SNR}(t))\|\operatorname{\mathbf{x}}_{0,\theta}-\operatorname{\mathbf{x}}_{0}\|_{2}^{2}+C.

A.4 Training

Empirically, instead of using the variational lower bound, the neural network can be trained on one of the following simplified loss (Ho et al., 2020)

Lsimple,ϵt(θ)subscript𝐿simplesubscriptbold-italic-ϵ𝑡𝜃\displaystyle L_{\text{simple},\operatorname{\bm{\epsilon}}_{t}}(\theta)=𝔼t,𝐱0,ϵtϵtϵt,θ22=𝔼t,𝐱0,ϵtL(ϵt,ϵt,θ),(ϵ-parameterization)formulae-sequenceabsentsubscript𝔼𝑡subscript𝐱0subscriptbold-italic-ϵ𝑡superscriptsubscriptnormsubscriptbold-italic-ϵ𝑡subscriptbold-italic-ϵ𝑡𝜃22subscript𝔼𝑡subscript𝐱0subscriptbold-italic-ϵ𝑡𝐿subscriptbold-italic-ϵ𝑡subscriptbold-italic-ϵ𝑡𝜃bold-italic-ϵ-parameterization\displaystyle=\operatorname{\mathbb{E}}_{t,\operatorname{\mathbf{x}}_{0},\operatorname{\bm{\epsilon}}_{t}}\|\operatorname{\bm{\epsilon}}_{t}-\operatorname{\bm{\epsilon}}_{t,\theta}\|_{2}^{2}=\operatorname{\mathbb{E}}_{t,\operatorname{\mathbf{x}}_{0},\operatorname{\bm{\epsilon}}_{t}}L(\operatorname{\bm{\epsilon}}_{t},\operatorname{\bm{\epsilon}}_{t,\theta}),\leavevmode\nobreak\ (\operatorname{\bm{\epsilon}}\text{-parameterization})
Lsimple,𝐱0(θ)subscript𝐿simplesubscript𝐱0𝜃\displaystyle L_{\text{simple},\operatorname{\mathbf{x}}_{0}}(\theta)=𝔼t,𝐱0,ϵt𝐱0𝐱0,θ22=𝔼t,𝐱0,ϵtL(𝐱0,𝐱0,θ).(𝐱0-parameterization)formulae-sequenceabsentsubscript𝔼𝑡subscript𝐱0subscriptbold-italic-ϵ𝑡superscriptsubscriptnormsubscript𝐱0subscript𝐱0𝜃22subscript𝔼𝑡subscript𝐱0subscriptbold-italic-ϵ𝑡𝐿subscript𝐱0subscript𝐱0𝜃subscript𝐱0-parameterization\displaystyle=\operatorname{\mathbb{E}}_{t,\operatorname{\mathbf{x}}_{0},\operatorname{\bm{\epsilon}}_{t}}\|\operatorname{\mathbf{x}}_{0}-\operatorname{\mathbf{x}}_{0,\theta}\|_{2}^{2}=\operatorname{\mathbb{E}}_{t,\operatorname{\mathbf{x}}_{0},\operatorname{\bm{\epsilon}}_{t}}L(\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}_{0,\theta}).\leavevmode\nobreak\ (\operatorname{\mathbf{x}}_{0}\text{-parameterization})

with t𝑡t uniformly sampled from 111 to T𝑇T and ϵt𝒩(𝟎,𝐈)similar-tosubscriptbold-italic-ϵ𝑡𝒩0𝐈\operatorname{\bm{\epsilon}}_{t}\sim\mathcal{N}(\mathbf{0},\mathbf{I}). L(,)𝐿L(\cdot,\cdot) is a loss function in the space of 𝐱𝐱\operatorname{\mathbf{x}}. With the importance sampling proposed in Nichol and Dhariwal (2021), t𝑡t can be sampled with a probability proportional to 𝔼𝐱0,𝐱tL(𝐱0,𝐱^0)subscript𝔼subscript𝐱0subscript𝐱𝑡𝐿subscript𝐱0subscript^𝐱0\operatorname{\mathbb{E}}_{\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}_{t}}L(\operatorname{\mathbf{x}}_{0},\hat{\operatorname{\mathbf{x}}}_{0}). In other words, a time step t𝑡t is sampled more often if the loss is larger.

As the previous work (Fu et al., 2023) has extensively compared the ϵbold-italic-ϵ\operatorname{\bm{\epsilon}}-parameterization and 𝐱0subscript𝐱0\operatorname{\mathbf{x}}_{0}-parameterization, as well as the benefits of including Dice loss, in this work, we use 𝐱0subscript𝐱0\operatorname{\mathbf{x}}_{0}-parameterization with a weighted sum of cross-entropy and foreground-only Dice loss Kirillov et al. (2023).

A.5 Variance Resampling

Given a variance schedule {βt}t=1Tsuperscriptsubscriptsubscript𝛽𝑡𝑡1𝑇\{\beta_{t}\}_{t=1}^{T} (e.g. T=1001𝑇1001T=1001), a subsequence {βk}k=1Ksuperscriptsubscriptsubscript𝛽𝑘𝑘1𝐾\{\beta_{k}\}_{k=1}^{K} (e.g. K=5𝐾5K=5) can be sampled with {tk}k=1Ksuperscriptsubscriptsubscript𝑡𝑘𝑘1𝐾\{t_{k}\}_{k=1}^{K}. Following Nichol and Dhariwal (2021), we can define βk=1α¯tkα¯tk1subscript𝛽𝑘1subscript¯𝛼subscript𝑡𝑘subscript¯𝛼subscript𝑡𝑘1\beta_{k}=1-\frac{\bar{\alpha}_{t_{k}}}{\bar{\alpha}_{t_{k-1}}} then αk=1βksubscript𝛼𝑘1subscript𝛽𝑘\alpha_{k}=1-\beta_{k} and α¯k=s=1kαssubscript¯𝛼𝑘superscriptsubscriptproduct𝑠1𝑘subscript𝛼𝑠\bar{\alpha}_{k}=\prod_{s=1}^{k}\alpha_{s} can be recalculated correspondingly. In this work, tksubscript𝑡𝑘t_{k} is uniformly downsampled. For instance, if T=1001𝑇1001T=1001 and K=5𝐾5K=5, then {tk}k=1K={1,251,501,751,1001}superscriptsubscriptsubscript𝑡𝑘𝑘1𝐾12515017511001\{t_{k}\}_{k=1}^{K}=\{1,251,501,751,1001\}.

B Denoising Diffusion Implicit Model

Definition

Song et al. (2020a) parameterize q(𝐱t1𝐱t,𝐱0)𝑞conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0q(\operatorname{\mathbf{x}}_{t-1}\mid\operatorname{\mathbf{x}}_{t},\operatorname{\mathbf{x}}_{0}) as follows, with ϵ=𝐱tα¯t𝐱01α¯tbold-italic-ϵsubscript𝐱𝑡subscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡\operatorname{\bm{\epsilon}}=\frac{\operatorname{\mathbf{x}}_{t}-\sqrt{\bar{\alpha}_{t}}\operatorname{\mathbf{x}}_{0}}{\sqrt{1-\bar{\alpha}_{t}}},

q(𝐱t1𝐱t,𝐱0)𝑞conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0\displaystyle q(\operatorname{\mathbf{x}}_{t-1}\mid\operatorname{\mathbf{x}}_{t},\operatorname{\mathbf{x}}_{0})=𝒩(𝐱t1;α¯t1𝐱0+1α¯t1σtϵ,σt2𝐈).absent𝒩subscript𝐱𝑡1subscript¯𝛼𝑡1subscript𝐱01subscript¯𝛼𝑡1subscript𝜎𝑡bold-italic-ϵsuperscriptsubscript𝜎𝑡2𝐈\displaystyle=\mathcal{N}(\operatorname{\mathbf{x}}_{t-1};\sqrt{\bar{\alpha}_{t-1}}\operatorname{\mathbf{x}}_{0}+\sqrt{1-\bar{\alpha}_{t-1}-\sigma_{t}}\operatorname{\bm{\epsilon}},\sigma_{t}^{2}\mathbf{I}).

For any variance schedule σtsubscript𝜎𝑡\sigma_{t}, this formulation ensures q(𝐱t𝐱0)=𝒩(𝐱t;α¯t𝐱0,(1α¯t)𝐈)𝑞conditionalsubscript𝐱𝑡subscript𝐱0𝒩subscript𝐱𝑡subscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡𝐈q(\operatorname{\mathbf{x}}_{t}\mid\operatorname{\mathbf{x}}_{0})=\mathcal{N}(\operatorname{\mathbf{x}}_{t};\sqrt{\bar{\alpha}_{t}}\operatorname{\mathbf{x}}_{0},(1-\bar{\alpha}_{t})\mathbf{I}). Particularly, if σt2=β~tsuperscriptsubscript𝜎𝑡2subscript~𝛽𝑡\sigma_{t}^{2}=\tilde{\beta}_{t}, this represents DDPM. If σt=0subscript𝜎𝑡0\sigma_{t}=0 for t>1𝑡1t>1 and σ1=β~1subscript𝜎1subscript~𝛽1\sigma_{1}=\sqrt{\tilde{\beta}_{1}}, the model is deterministic and named as denoising diffusion implicit model (DDIM).

Inference

For DDIM, at inference time, the denoising starts with a Gaussian noise 𝐱T𝒩(𝟎,𝑰)similar-tosubscript𝐱𝑇𝒩0𝑰\operatorname{\mathbf{x}}_{T}\sim\mathcal{N}(\bm{0},\bm{I}) and the data is denoised step-by-step for t=T,,1𝑡𝑇1t=T,\cdots,1:

pθ(𝐱t1𝐱t)={𝒩(𝐱^0,σ12𝐈)t=1q(𝐱t1𝐱t,𝐱0,θ(𝐱t,t))t>1subscript𝑝𝜃conditionalsubscript𝐱𝑡1subscript𝐱𝑡cases𝒩subscript^𝐱0superscriptsubscript𝜎12𝐈𝑡1𝑞conditionalsubscript𝐱𝑡1subscript𝐱𝑡subscript𝐱0𝜃subscript𝐱𝑡𝑡𝑡1\displaystyle p_{\theta}(\operatorname{\mathbf{x}}_{t-1}\mid\operatorname{\mathbf{x}}_{t})=\begin{cases}\mathcal{N}(\hat{\operatorname{\mathbf{x}}}_{0},\sigma_{1}^{2}\mathbf{I})&t=1\\ q(\operatorname{\mathbf{x}}_{t-1}\mid\operatorname{\mathbf{x}}_{t},\operatorname{\mathbf{x}}_{0,\theta}(\operatorname{\mathbf{x}}_{t},t))&t>1\end{cases}

C Self-conditioning

The self-conditioning methods proposed in Chen et al. (2022b) (“Diff. sc. 𝐱tsubscript𝐱𝑡\operatorname{\mathbf{x}}_{t}” in Equation 89) and Watson et al. (2023) (“sc. 𝐱t+1subscript𝐱𝑡1\operatorname{\mathbf{x}}_{t+1}” in Equation 90) are illustrated below.

𝐱tsubscript𝐱𝑡\displaystyle\operatorname{\mathbf{x}}_{t}𝒩(𝐱t;α¯t𝐱0,(1α¯t)𝐈),similar-toabsent𝒩subscript𝐱𝑡subscript¯𝛼𝑡subscript𝐱01subscript¯𝛼𝑡𝐈\displaystyle\sim\mathcal{N}(\operatorname{\mathbf{x}}_{t};\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{0},(1-\bar{\alpha}_{t})\mathbf{I}),(sc.𝐱t,step 1, sampling)sc.subscript𝐱𝑡step 1, sampling\displaystyle(\text{sc.}\leavevmode\nobreak\ \operatorname{\mathbf{x}}_{t},\leavevmode\nobreak\ \text{step 1, sampling})(89a)
𝐱^0subscript^𝐱0\displaystyle\hat{\mathbf{x}}_{0}=StopGradient(fθ(I,t,𝐱t,𝟎)),absentStopGradientsubscript𝑓𝜃𝐼𝑡subscript𝐱𝑡0\displaystyle=\text{StopGradient}(f_{\theta}(I,t,\mathbf{x}_{t},\mathbf{0})),(sc.𝐱t,step 1, prediction)sc.subscript𝐱𝑡step 1, prediction\displaystyle(\text{sc.}\leavevmode\nobreak\ \operatorname{\mathbf{x}}_{t},\leavevmode\nobreak\ \text{step 1, prediction})(89b)
𝐱^0subscript^𝐱0\displaystyle\hat{\mathbf{x}}_{0}=Dropoutp=50%(𝐱^0),absentsubscriptDropout𝑝percent50subscript^𝐱0\displaystyle=\text{Dropout}_{p=50\%}(\hat{\mathbf{x}}_{0}),(sc.𝐱t,step 2, dropout)sc.subscript𝐱𝑡step 2, dropout\displaystyle(\text{sc.}\leavevmode\nobreak\ \operatorname{\mathbf{x}}_{t},\leavevmode\nobreak\ \text{step 2, dropout})(89c)
𝐱^0subscript^𝐱0\displaystyle\hat{\mathbf{x}}_{0}=fθ(I,t,𝐱t,𝐱^0),absentsubscript𝑓𝜃𝐼𝑡subscript𝐱𝑡subscript^𝐱0\displaystyle=f_{\theta}(I,t,\mathbf{x}_{t},\hat{\mathbf{x}}_{0}),(sc.𝐱t,step 2, prediction)sc.subscript𝐱𝑡step 2, prediction\displaystyle(\text{sc.}\leavevmode\nobreak\ \operatorname{\mathbf{x}}_{t},\leavevmode\nobreak\ \text{step 2, prediction})(89d)
Ldenoising(θ)subscript𝐿denoising𝜃\displaystyle L_{\text{denoising}}(\theta)=𝔼t,𝐱0,𝐱tL(𝐱0,𝐱^0),absentsubscript𝔼𝑡subscript𝐱0subscript𝐱𝑡𝐿subscript𝐱0subscript^𝐱0\displaystyle=\operatorname{\mathbb{E}}_{t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}_{t}}L(\operatorname{\mathbf{x}}_{0},\hat{\mathbf{x}}_{0}),(loss calculation)loss calculation\displaystyle(\text{loss calculation})(89e)
𝐱t+1subscript𝐱𝑡1\displaystyle\operatorname{\mathbf{x}}_{t+1}𝒩(𝐱t+1;α¯t+1𝐱0,(1α¯t+1)𝐈),similar-toabsent𝒩subscript𝐱𝑡1subscript¯𝛼𝑡1subscript𝐱01subscript¯𝛼𝑡1𝐈\displaystyle\sim\mathcal{N}(\operatorname{\mathbf{x}}_{t+1};\sqrt{\bar{\alpha}_{t+1}}\operatorname{\mathbf{x}}_{0},(1-\bar{\alpha}_{t+1})\mathbf{I}),(sc.𝐱t+1,step 1, sampling)sc.subscript𝐱𝑡1step 1, sampling\displaystyle(\text{sc.}\leavevmode\nobreak\ \operatorname{\mathbf{x}}_{t+1},\leavevmode\nobreak\ \text{step 1, sampling})(90a)
𝐱^0subscript^𝐱0\displaystyle\hat{\mathbf{x}}_{0}=StopGradient(fθ(I,t+1,𝐱t+1,𝟎)),absentStopGradientsubscript𝑓𝜃𝐼𝑡1subscript𝐱𝑡10\displaystyle=\text{StopGradient}(f_{\theta}(I,t+1,\mathbf{x}_{t+1},\mathbf{0})),(sc.𝐱t+1,step 1, prediction)sc.subscript𝐱𝑡1step 1, prediction\displaystyle(\text{sc.}\leavevmode\nobreak\ \operatorname{\mathbf{x}}_{t+1},\leavevmode\nobreak\ \text{step 1, prediction})(90b)
𝐱^0subscript^𝐱0\displaystyle\hat{\mathbf{x}}_{0}=Dropoutp=50%(𝐱^0),absentsubscriptDropout𝑝percent50subscript^𝐱0\displaystyle=\text{Dropout}_{p=50\%}(\hat{\mathbf{x}}_{0}),(sc.𝐱t+1,step 2, dropout)sc.subscript𝐱𝑡1step 2, dropout\displaystyle(\text{sc.}\leavevmode\nobreak\ \operatorname{\mathbf{x}}_{t+1},\leavevmode\nobreak\ \text{step 2, dropout})(90c)
𝐱tsubscript𝐱𝑡\displaystyle\operatorname{\mathbf{x}}_{t}𝒩(𝐱t;𝝁~,β~t+1𝐈),similar-toabsent𝒩subscript𝐱𝑡~𝝁subscript~𝛽𝑡1𝐈\displaystyle\sim\mathcal{N}(\operatorname{\mathbf{x}}_{t};\tilde{\bm{\mu}},\tilde{\beta}_{t+1}\mathbf{I}),(sc.𝐱t+1,step 2, sampling)sc.subscript𝐱𝑡1step 2, sampling\displaystyle(\text{sc.}\leavevmode\nobreak\ \operatorname{\mathbf{x}}_{t+1},\leavevmode\nobreak\ \text{step 2, sampling})(90d)
𝝁~~𝝁\displaystyle\tilde{\bm{\mu}}=α¯tβt+11α¯t+1𝐱0+1α¯t1α¯t+1αt+1𝐱t+1absentsubscript¯𝛼𝑡subscript𝛽𝑡11subscript¯𝛼𝑡1subscript𝐱01subscript¯𝛼𝑡1subscript¯𝛼𝑡1subscript𝛼𝑡1subscript𝐱𝑡1\displaystyle=\frac{\sqrt{\bar{\alpha}_{t}}\beta_{t+1}}{1-\bar{\alpha}_{t+1}}\operatorname{\mathbf{x}}_{0}+\frac{1-\bar{\alpha}_{t}}{1-\bar{\alpha}_{t+1}}\sqrt{\alpha_{t+1}}\operatorname{\mathbf{x}}_{t+1}
𝐱^0subscript^𝐱0\displaystyle\hat{\mathbf{x}}_{0}=fθ(I,t,𝐱t,𝐱^0),absentsubscript𝑓𝜃𝐼𝑡subscript𝐱𝑡subscript^𝐱0\displaystyle=f_{\theta}(I,t,\mathbf{x}_{t},\hat{\mathbf{x}}_{0}),(sc.𝐱t+1,step 2, prediction)sc.subscript𝐱𝑡1step 2, prediction\displaystyle(\text{sc.}\leavevmode\nobreak\ \operatorname{\mathbf{x}}_{t+1},\leavevmode\nobreak\ \text{step 2, prediction})(90e)
Ldenoising(θ)subscript𝐿denoising𝜃\displaystyle L_{\text{denoising}}(\theta)=𝔼t,𝐱0,𝐱tL(𝐱0,𝐱^0),absentsubscript𝔼𝑡subscript𝐱0subscript𝐱𝑡𝐿subscript𝐱0subscript^𝐱0\displaystyle=\operatorname{\mathbb{E}}_{t,\operatorname{\mathbf{x}}_{0},\operatorname{\mathbf{x}}_{t}}L(\operatorname{\mathbf{x}}_{0},\hat{\mathbf{x}}_{0}),(loss calculation)loss calculation\displaystyle(\text{loss calculation})(90f)

D Diffusion Noise Schedule

The noise schedule βtsubscript𝛽𝑡\beta_{t} and α¯tsubscript¯𝛼𝑡\sqrt{\bar{\alpha}_{t}} have been visualised in Figure 2. The cross entropy and dice score between 𝐱tsubscript𝐱𝑡\operatorname{\mathbf{x}}_{t} and ground truth 𝐱0subscript𝐱0\operatorname{\mathbf{x}}_{0} have also been visualized to empirically measure the amount of information of ground truth 𝐱0subscript𝐱0\operatorname{\mathbf{x}}_{0} contained in 𝐱tsubscript𝐱𝑡\operatorname{\mathbf{x}}_{t}.

Refer to caption
(a) βtsubscript𝛽𝑡\beta_{t}
Refer to caption
(b) αt¯¯subscript𝛼𝑡\sqrt{\bar{\alpha_{t}}}
Refer to caption
(c) Cross entropy
Refer to caption
(d) Dice score
Figure 2: Information contained in 𝐱tsubscript𝐱𝑡\operatorname{\mathbf{x}}_{t}. Cross entropy and dice score between 𝐱tsubscript𝐱𝑡\operatorname{\mathbf{x}}_{t} and ground truth 𝐱0subscript𝐱0\operatorname{\mathbf{x}}_{0} are used to empirically measure the amount of information of ground truth 𝐱0subscript𝐱0\operatorname{\mathbf{x}}_{0} contained in 𝐱tsubscript𝐱𝑡\operatorname{\mathbf{x}}_{t}. The dashed line represents the information contained in the sampled noise (between noise and ground truth 𝐱0subscript𝐱0\operatorname{\mathbf{x}}_{0}), which is considered to be the limit. The values are calculated using the sample “005095” in prostate MR data set.

E Implementation Details

Table 6: Training Hyper-parameters
ParameterValue
OptimiserAdamW (b1=0.9, b2=0.999, weight_decay=1E-8)
Learning Rate Warmup100 steps
Learning Rate Decay10,000 steps
Learning Rate ValuesInitial = 1E-5, Peak = 8E-4, End = 5E-5
Batch size256 for Muscle Ultrasound and 8 for other data sets
Number of samples320K for Muscle Ultrasound and 100K for other data sets
Table 7: Network Size
DimensionMethodTransformer
2DNo diff.12,586,59410,550,370
Diff.13,335,55411,299,330
3DNo diff.33,385,15431,283,394
Diff.34,135,26632,033,506

F Results

F.1 Diffusion Training Strategy Comparison

Table 8: Per class Dice score comparison. “No diff.” represents non-diffusion model. “Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}” represents the diffusion model with proposed recycling. “Ensemble” represents the model averaging the probabilities from “No diff.” and “Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}”. The inference sampler is DDPM. The best results are in bold and underline indicates the difference to non-diffusion model is significant with p-value < 0.050.050.05.
MethodSpleenRT kidneyLT kidneyGall bladder
No diff.96.62 ±plus-or-minus\pm 1.8795.08 ±plus-or-minus\pm 10.7496.29 ±plus-or-minus\pm 1.7378.83 ±plus-or-minus\pm 27.82
Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}96.40 ±plus-or-minus\pm 2.4296.24 ±plus-or-minus\pm 1.9096.27 ±plus-or-minus\pm 1.5376.68 ±plus-or-minus\pm 29.25
Ensemble96.78 ±plus-or-minus\pm 1.7596.47 ±plus-or-minus\pm 2.4496.50 ±plus-or-minus\pm 1.5179.65 ±plus-or-minus\pm 27.29
MethodEsophagusLiverStomachArota
No diff.83.22 ±plus-or-minus\pm 11.0897.36 ±plus-or-minus\pm 1.1790.53 ±plus-or-minus\pm 14.7894.65 ±plus-or-minus\pm 4.22
Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}83.60 ±plus-or-minus\pm 10.3297.33 ±plus-or-minus\pm 1.1390.77 ±plus-or-minus\pm 14.4694.66 ±plus-or-minus\pm 4.66
Ensemble84.10 ±plus-or-minus\pm 11.1597.54 ±plus-or-minus\pm 1.0591.07 ±plus-or-minus\pm 14.9194.96 ±plus-or-minus\pm 4.39
MethodPostcavaPancreasRight adrenal glandLeft adrenal gland
No diff.90.45 ±plus-or-minus\pm 4.6884.88 ±plus-or-minus\pm 11.4077.80 ±plus-or-minus\pm 9.4677.98 ±plus-or-minus\pm 11.95
Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}90.55 ±plus-or-minus\pm 4.1984.86 ±plus-or-minus\pm 11.1576.63 ±plus-or-minus\pm 12.8478.01 ±plus-or-minus\pm 11.60
Ensemble91.18 ±plus-or-minus\pm 4.1285.85 ±plus-or-minus\pm 11.1278.51 ±plus-or-minus\pm 10.5878.95 ±plus-or-minus\pm 11.45
MethodDuodenumBladderProstate/uterus
No diff.79.57 ±plus-or-minus\pm 14.8988.09 ±plus-or-minus\pm 16.2582.35 ±plus-or-minus\pm 18.90
Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}79.80 ±plus-or-minus\pm 15.1487.90 ±plus-or-minus\pm 16.6581.90 ±plus-or-minus\pm 18.86
Ensemble80.99 ±plus-or-minus\pm 15.0788.61 ±plus-or-minus\pm 16.5983.06 ±plus-or-minus\pm 18.68
(a) Abdominal CT: LT and RT stand for left and right, respectively.
MethodBladderBoneObturator internusTransition zone
No diff.93.28 ±plus-or-minus\pm 9.9093.12 ±plus-or-minus\pm 5.6888.95 ±plus-or-minus\pm 3.5379.61 ±plus-or-minus\pm 8.37
Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}93.57 ±plus-or-minus\pm 9.6193.84 ±plus-or-minus\pm 5.8589.15 ±plus-or-minus\pm 3.6279.79 ±plus-or-minus\pm 8.36
Ensemble93.66 ±plus-or-minus\pm 9.8493.77 ±plus-or-minus\pm 5.5289.52 ±plus-or-minus\pm 3.5180.57 ±plus-or-minus\pm 8.20
MethodCentral glandRectumSeminal vesicleNV bundle
No diff.88.75 ±plus-or-minus\pm 5.6093.30 ±plus-or-minus\pm 3.4877.55 ±plus-or-minus\pm 10.9967.17 ±plus-or-minus\pm 14.34
Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}89.13 ±plus-or-minus\pm 5.7893.42 ±plus-or-minus\pm 3.5178.39 ±plus-or-minus\pm 9.7167.07 ±plus-or-minus\pm 15.50
Ensemble89.45 ±plus-or-minus\pm 5.5693.70 ±plus-or-minus\pm 3.3778.91 ±plus-or-minus\pm 10.2868.01 ±plus-or-minus\pm 14.85
(b) Prostate MR: Dice score per class. NV stands for neurovascular.
Table 9: Per class Hausdorff distance comparison “No diff.” represents non-diffusion model. “Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}” represents the diffusion model with proposed recycling. “Ensemble” represents the model averaging the probabilities from “No diff.” and “Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}”. The inference sampler is DDPM. The best results are in bold and underline indicates the difference to non-diffusion model is significant with p-value < 0.050.050.05.
MethodSpleenRight kidneyLeft kidneyGall bladder
No diff.3.22 ±plus-or-minus\pm 4.911.97 ±plus-or-minus\pm 1.464.13 ±plus-or-minus\pm 10.839.23 ±plus-or-minus\pm 16.71
Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}2.86 ±plus-or-minus\pm 3.841.93 ±plus-or-minus\pm 0.833.13 ±plus-or-minus\pm 8.3512.65 ±plus-or-minus\pm 21.74
Ensemble2.89 ±plus-or-minus\pm 4.281.84 ±plus-or-minus\pm 1.112.70 ±plus-or-minus\pm 5.869.57 ±plus-or-minus\pm 18.86
MethodEsophagusLiverStomachArota
No diff.5.50 ±plus-or-minus\pm 6.813.50 ±plus-or-minus\pm 2.508.96 ±plus-or-minus\pm 13.996.62 ±plus-or-minus\pm 14.52
Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}5.30 ±plus-or-minus\pm 6.413.79 ±plus-or-minus\pm 4.169.00 ±plus-or-minus\pm 14.035.41 ±plus-or-minus\pm 11.20
Ensemble5.22 ±plus-or-minus\pm 6.633.06 ±plus-or-minus\pm 1.638.04 ±plus-or-minus\pm 12.875.47 ±plus-or-minus\pm 11.29
MethodPostcavaPancreasRight adrenal glandLeft adrenal gland
No diff.4.80 ±plus-or-minus\pm 4.557.57 ±plus-or-minus\pm 8.624.39 ±plus-or-minus\pm 2.395.15 ±plus-or-minus\pm 5.40
Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}4.62 ±plus-or-minus\pm 3.097.50 ±plus-or-minus\pm 8.624.66 ±plus-or-minus\pm 3.144.87 ±plus-or-minus\pm 4.64
Ensemble4.41 ±plus-or-minus\pm 3.256.96 ±plus-or-minus\pm 8.404.41 ±plus-or-minus\pm 2.794.82 ±plus-or-minus\pm 4.92
MethodDuodenumBladderProstate/uterus
No diff.10.54 ±plus-or-minus\pm 8.449.10 ±plus-or-minus\pm 23.0710.97 ±plus-or-minus\pm 19.01
Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}9.31 ±plus-or-minus\pm 7.1310.70 ±plus-or-minus\pm 31.8313.35 ±plus-or-minus\pm 32.75
Ensemble9.29 ±plus-or-minus\pm 7.376.52 ±plus-or-minus\pm 10.349.14 ±plus-or-minus\pm 13.11
(a) Abdominal CT: LT and RT stand for left and right, respectively.
MethodBladderBoneObturator internusTransition zone
No diff.3.30 ±plus-or-minus\pm 4.543.18 ±plus-or-minus\pm 9.774.60 ±plus-or-minus\pm 3.365.97 ±plus-or-minus\pm 4.97
Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}3.20 ±plus-or-minus\pm 4.122.21 ±plus-or-minus\pm 1.624.50 ±plus-or-minus\pm 3.466.25 ±plus-or-minus\pm 4.96
Ensemble2.95 ±plus-or-minus\pm 3.482.32 ±plus-or-minus\pm 1.464.34 ±plus-or-minus\pm 3.296.18 ±plus-or-minus\pm 5.11
MethodCentral glandRectumSeminal vesicleNV bundle
No diff.3.94 ±plus-or-minus\pm 2.284.46 ±plus-or-minus\pm 5.694.82 ±plus-or-minus\pm 3.856.68 ±plus-or-minus\pm 6.33
Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}3.70 ±plus-or-minus\pm 1.934.25 ±plus-or-minus\pm 4.754.57 ±plus-or-minus\pm 2.666.55 ±plus-or-minus\pm 6.28
Ensemble3.66 ±plus-or-minus\pm 1.934.16 ±plus-or-minus\pm 5.254.52 ±plus-or-minus\pm 2.836.45 ±plus-or-minus\pm 6.34
(b) Prostate MR: Dice score per class. NV stands for neurovascular.
Refer to caption
(a) DS difference for muscle ultrasound
Refer to caption
(b) HD difference for muscle ultrasound
Refer to caption
(c) DS difference for abdominal CT
Refer to caption
(d) HD difference for abdominal CT
Refer to caption
(e) DS difference for prostate MR
Refer to caption
(f) HD difference for prostate MR
Refer to caption
(g) DS difference for brain MR
Refer to caption
(h) HD difference for brain MR
Figure 3: Segmentation performance difference between the last step and first step using DDPM. “Diff.” represents standard diffusion. “Diff. sc. 𝐱tsubscript𝐱𝑡\operatorname{\mathbf{x}}_{t}” and “Diff. sc. 𝐱t+1subscript𝐱𝑡1\operatorname{\mathbf{x}}_{t+1}” represents self-conditioning from Chen et al. (2022b) and Watson et al. (2023), respectively. “Diff. rec. 𝐱t+1subscript𝐱𝑡1\operatorname{\mathbf{x}}_{t+1}” and “Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}” represents recycling from Fu et al. (2023) and the proposed recycling in this work, respectively. The sampler is DDPM. DS and HD represents Dice score and Hausdorff distance, respectively. The difference is the value at the last step subtracted by the one at the first step. A positive value for Dice score difference or a negative value for Hausdorff distance means improvement.
Refer to caption
(a) DS difference for muscle ultrasound
Refer to caption
(b) HD difference for muscle ultrasound
Refer to caption
(c) DS difference for abdominal CT
Refer to caption
(d) HD difference for abdominal CT
Refer to caption
(e) DS difference for prostate MR
Refer to caption
(f) HD difference for prostate MR
Refer to caption
(g) DS difference for brain MR
Refer to caption
(h) HD difference for brain MR
Figure 4: Segmentation performance difference between the last step and first step using DDIM. “Diff.” represents standard diffusion. “Diff. sc. 𝐱tsubscript𝐱𝑡\operatorname{\mathbf{x}}_{t}” and “Diff. sc. 𝐱t+1subscript𝐱𝑡1\operatorname{\mathbf{x}}_{t+1}” represents self-conditioning from Chen et al. (2022b) and Watson et al. (2023), respectively. “Diff. rec. 𝐱t+1subscript𝐱𝑡1\operatorname{\mathbf{x}}_{t+1}” and “Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}” represents recycling from Fu et al. (2023) and the proposed recycling in this work, respectively. The sampler is DDIM. DS and HD represents Dice score and Hausdorff distance, respectively. The difference is the value at the last step subtracted by the one at the first step. A positive value for Dice score difference or a negative value for Hausdorff distance means improvement.
Refer to caption
(a) Dice Score for muscle ultrasound
Refer to caption
(b) Hausdorff distance for muscle ultrasound
Refer to caption
(c) Dice score for abdominal CT
Refer to caption
(d) Hausdorff distance for abdominal CT
Refer to caption
(e) Dice score for prostate MR
Refer to caption
(f) Hausdorff distance for prostate MR
Refer to caption
(g) Dice score for brain MR
Refer to caption
(h) Hausdorff distance for brain MR
Figure 5: Segmentation performance per step. “Diff.” represents standard diffusion. “Diff. sc. 𝐱tsubscript𝐱𝑡\operatorname{\mathbf{x}}_{t}” and “Diff. sc. 𝐱t+1subscript𝐱𝑡1\operatorname{\mathbf{x}}_{t+1}” represents self-conditioning from Chen et al. (2022b) and Watson et al. (2023), respectively. “Diff. rec. 𝐱t+1subscript𝐱𝑡1\operatorname{\mathbf{x}}_{t+1}” and “Diff. rec. 𝐱Tsubscript𝐱𝑇\operatorname{\mathbf{x}}_{T}” represents recycling from Fu et al. (2023) and the proposed recycling in this work, respectively. The sampler is DDIM.