Score-Based Generative Models for PET Image Reconstruction

Imraj RD Singh*,1Orcid, Alexander Denker*,2Orcid, Riccardo Barbano*,1Orcid, Željko Kereta1Orcid, Bangti Jin3Orcid, Kris Thielemans4Orcid, Peter Maass2Orcid, Simon Arridge1Orcid
*: Equal contribution, 1: Department of Computer Science, University College London, 2: Center for Industrial Mathematics, University of Bremen, 3: Department of Mathematics, Chinese University of Hong Kong, 4: Institute of Nuclear Medicine, University College London
Publication date: 2024/01/23
https://doi.org/10.59275/j.melba.2024-5d51
PDF · Code · Dataset · arXiv

Abstract

Score-based generative models have demonstrated highly promising results for medical image reconstruction tasks in magnetic resonance imaging or computed tomography. However, their application to Positron Emission Tomography (PET) is still largely unexplored. PET image reconstruction involves a variety of challenges, including Poisson noise with high variance and a wide dynamic range. To address these challenges, we propose several PET-specific adaptations of score-based generative models. The proposed framework is developed for both 2D and 3D PET. In addition, we provide an extension to guided reconstruction using magnetic resonance images. We validate the approach through extensive 2D and 3D in-silico experiments with a model trained on patient-realistic data without lesions, and evaluate on data without lesions as well as out-of-distribution data with lesions. This demonstrates the proposed method’s robustness and significant potential for improved PET reconstruction.

Keywords

Positron Emission Tomography · Diffusion models · Score-based generative models · Image Reconstruction · 3D image reconstruction · Guided reconstruction

Bibtex @article{melba:2024:001:singh, title = "Score-Based Generative Models for PET Image Reconstruction", author = "Singh, Imraj RD and Denker, Alexander and Barbano, Riccardo and Kereta, Željko and Jin, Bangti and Thielemans, Kris and Maass, Peter and Arridge, Simon", journal = "Machine Learning for Biomedical Imaging", volume = "2", issue = "Special Issue for Generative Models", year = "2024", pages = "547--585", issn = "2766-905X", doi = "https://doi.org/10.59275/j.melba.2024-5d51", url = "https://melba-journal.org/2024:001" }
RISTY - JOUR AU - Singh, Imraj RD AU - Denker, Alexander AU - Barbano, Riccardo AU - Kereta, Željko AU - Jin, Bangti AU - Thielemans, Kris AU - Maass, Peter AU - Arridge, Simon PY - 2024 TI - Score-Based Generative Models for PET Image Reconstruction T2 - Machine Learning for Biomedical Imaging VL - 2 IS - Special Issue for Generative Models SP - 547 EP - 585 SN - 2766-905X DO - https://doi.org/10.59275/j.melba.2024-5d51 UR - https://melba-journal.org/2024:001 ER -

2024:001 cover

Disclaimer: the following html version has been automatically generated and the PDF remains the reference version. Feedback can be sent directly to publishing-editor@melba-journal.org

1 Introduction

Positron Emission Tomography (PET) is a functional medical imaging technique for quantifying and visualising the distribution of a radio-tracer within the body, and is vital in clinical practice for accurate diagnosis, treatment planning, and monitoring of diseases. In a PET scan, radio-tracers are injected to probe a specific biological pathway of interest. Through the decay of the radio-tracer a positron is released, which upon annihilating with an electron produces a pair of coincident photons that travel in approximately anti-parallel directions. These emitted photons are detected and are then used to reconstruct the underlying radio-tracer distribution. The relationship between the measured emissions and the radio-tracer can be approximated with the Poisson noise model as

𝐲Pois(𝐲¯),𝐲¯=𝐀𝐱+𝐛¯,formulae-sequencesimilar-to𝐲Pois¯𝐲¯𝐲𝐀𝐱¯𝐛{\mathbf{y}}\sim{\text{Pois}}(\mathbf{\bar{y}}),\quad\mathbf{\bar{y}}={\mathbf{A}}{\mathbf{x}}+\mathbf{\bar{b}},(1)

where 𝐲¯m¯𝐲superscript𝑚\mathbf{\bar{y}}\in\mathbb{R}^{m} is the expected value of the measurements (m𝑚m is the number of detector bins) and 𝐱n𝐱superscript𝑛{\mathbf{x}}\in\mathbb{R}^{n} is the discrete (voxel) basis representation of the tracer distribution (n𝑛n is the number of voxels). The system matrix 𝐀m×n𝐀superscript𝑚𝑛{\mathbf{A}}\in\mathbb{R}^{m\times n} includes approximate line integrals between detectors as well as physical phenomena such as photon attenuation, positron range, and detector sensitivity. It should be noted that 3D measurements detect pairs of photons between detector rings, i.e. they are not a stack of 2D measurements. The expected background 𝐛¯m¯𝐛superscript𝑚\mathbf{\bar{b}}\in\mathbb{R}^{m} are estimates of scatter and randoms events Qi and Leahy (2006). The unique challenges that distinguish PET from other imaging modalities, e.g. Magnetic Resonance Imaging (MRI) and Computed Tomography (CT), include Poisson noise with low mean number of counts, and widely varying dynamic range of images due to functional differences between patients.

Most inverse problems in imaging are ill-posed, in the sense that the solution may not exist, not be unique, or not depend continuously on the measurement noise (Engl et al., 1996; Ito and Jin, 2015). To stabilise the reconstruction process, prior knowledge is often leveraged through a penalising functional that promotes solutions from a desirable image subset. The priors are typically hand-crafted to promote desired features in the reconstructed image, such as sparsity of edges (Rudin et al., 1992) or smoothness. Furthermore, if an additional image is available, e.g. with high resolution structural information, a suitable prior can promote common features between the two images, commonly referred to as guided reconstruction (Ehrhardt, 2021). In recent years, deep learning approaches have shown state-of-the-art performance in PET image reconstruction, see surveys (Reader et al., 2021; Pain et al., 2022). Existing approaches include post-processing (Kaplan and Zhu, 2019), to synthesise high-dose images from low-dose ones (which is akin to denoising), and deep unrolled optimisation (Mehranian and Reader, 2021; Guazzo and Colarieti-Tosti, 2021). However, these supervised approaches require large volumes of paired data that is often hard to acquire, and their performance may degrade greatly in the presence of distributional shift (Antun et al., 2020; Darestani et al., 2022).

In contrast, generative models require a dataset only of images of the target domain. These can, for example, be high-quality reconstructions acquired from prior scans. The aim of generative modelling is to approximate the image manifold of a given dataset (Bengio et al., 2013). There are a variety of methods for this task, e.g. generative adversarial networks (Goodfellow et al., 2014), variational autoencoders (Kingma and Welling, 2014) and recently Score-based Generative Models (SGMs), which aim to generate high-quality samples, sample quickly, and have adequate mode coverage (Xiao et al., 2022). Over recent years, SGMs have become the de facto method for image generation due to the quality and diversity of generated images (Dhariwal and Nichol, 2021). Generative models can be integrated into the reconstruction process as data-driven priors, through the learnt image manifold, but which are independent of the specifics of the forward model, cf. (Dimakis, 2022). For example, a generative model trained on PET brain data would not be appropriate for MRI reconstruction of knees, but would be appropriate for the reconstruction of other PET brain images. In the latter case, by suitably changing the forward model, the generative model could be used across scanners and noise levels.

SGMs have been applied to CT and MR image reconstruction (Song et al., 2022). These reconstructions condition the SGM image generation on measurements, and balance the consistency with measurements versus consistency with the SGM learnt image manifold (Kobler and Pock, 2023). There are different methods to enforce measurement consistency of the reconstructions, which can be broadly classified into gradient based methods (Jalal et al., 2021; Chung et al., 2023a) and projection based methods (Song et al., 2022; Chung and Ye, 2022; Chung et al., 2023b). Recently, denoising diffusion models (discrete variants of SGMs) were used for PET image denoising (Gong et al., 2022). Instead, our work focuses on PET image reconstruction, and we present the following contributions:

  • We develop a novel algorithmic framework building upon SGMs that carefully addresses the challenges inherent to PET. To do so, we modify the conditional sampling method (Chung et al., 2023b; Zhu et al., 2023), recently proposed for inverse problems with Gaussian noise, for PET image reconstruction. This is achieved with a penalised Maximum A Posteriori (MAP) estimator computed with an accelerated algorithm that evaluates subsets of the measurements.

  • We leverage additional MR images to enhance the proposed framework, leading to improved image quality that better agrees with the measured data.

  • We scale the approach to 3D PET reconstruction.

The proposed method is tested on multiple noise levels, radio-tracers, and in both 2D and 3D settings with an SGM trained on patient-realistic BrainWeb data without lesions (Collins et al., 1998). In addition to data without lesions, we test on out-of-distribution (OOD) data with lesions to validate method robustness. The rest of the paper is structured as follows. In Section 2 we provide the background on PET reconstruction and SGMs. In particular, we present different methods for using SGMs in image reconstruction. In Section 3 we propose modifications needed to apply SGMs for PET reconstruction. We describe the experimental setting in Section 4, and present and discuss the results in Section 5. The code is publicly available on Github 111https://github.com/Imraj-Singh/Score-Based-Generative-Models-for-PET-Image-Reconstruction, and the dataset on Zenodo222https://zenodo.org/records/10509379.

2 Background

2.1 Fundamentals of Positron Emission Tomography Reconstruction

PET measurements are the result of a low-count photon-counting process. The true forward process, from tracer-distribution to photon detection, is approximated by the forward model defined in Eq. (1). The likelihood of the measured photon counts, for an unknown tracer distribution, can be modelled by an independent Poisson distribution. One of the first methods developed for estimating the tracer distribution through a Poisson model was maximum likelihood. This selects an image 𝐱0n𝐱superscriptsubscriptabsent0𝑛{\mathbf{x}}\in\mathbb{R}_{\geq 0}^{n} by maximising the Poisson Log-Likelihood (PLL) function, given by

L(𝐲|𝐱)=i=1myilog([𝐀𝐱+𝐛¯]i)[𝐀𝐱+𝐛¯]ilog(yi!).𝐿conditional𝐲𝐱superscriptsubscript𝑖1𝑚subscript𝑦𝑖subscriptdelimited-[]𝐀𝐱¯𝐛𝑖subscriptdelimited-[]𝐀𝐱¯𝐛𝑖subscript𝑦𝑖L({\mathbf{y}}|{\mathbf{x}})=\sum_{i=1}^{m}y_{i}\log([{\mathbf{A}}{\mathbf{x}}+\mathbf{\bar{b}}]_{i})-[{\mathbf{A}}{\mathbf{x}}+\mathbf{\bar{b}}]_{i}-\log(y_{i}!).(2)

By maximising the PLL, the Maximum Likelihood Estimate (MLE) is obtained. A particularly important algorithm for computing the MLE is Expectation Maximisation (EM) (Shepp and Vardi, 1982). However, due to its slow convergence, acceleration is sought through splitting the PLL into a sum of nsub1subscript𝑛sub1n_{\mathrm{sub}}\geq 1 sub-objectives. This gives rise to the greatly sped-up Ordered Subset Expectation Maximisation (OSEM) (Hudson and Larkin, 1994) algorithm. Because of the ill-conditioning of PET reconstruction, the MLE tends to overfit to measurement noise. To address the ill-conditioning and improve reconstruction quality, it is common practice to regularise the reconstruction problem via the use of an image-based prior. This gives rise to the MAP objective

Φ(𝐱)=L(𝐲|𝐱)+λR(𝐱),Φ𝐱𝐿conditional𝐲𝐱𝜆𝑅𝐱\Phi(\mathbf{x})=L(\mathbf{y}|\mathbf{x})+\lambda R(\mathbf{x}),(3)

where R(𝐱)𝑅𝐱R({\mathbf{x}}) is the log of a chosen image-based prior with penalty strength λ𝜆\lambda. Block-Sequential Regularised Expectation Maximisation (BSREM) (Pierro and Yamagishi, 2001; Ahn and Fessler, 2003) is an iterative algorithm, globally convergent under mild assumptions, that applies the subset idea of OSEM to the MAP objective. For Φ(𝐱)=j=1nsubΦj(𝐱)Φ𝐱superscriptsubscript𝑗1subscript𝑛subsubscriptΦ𝑗𝐱\Phi({\mathbf{x}})=\sum_{j=1}^{n_{\mathrm{sub}}}\Phi_{j}({\mathbf{x}}), where ΦjsubscriptΦ𝑗\Phi_{j} is the sub-objective Φj(𝐱)=Lj(𝐲|𝐱)+λR(𝐱)/nsubsubscriptΦ𝑗𝐱subscript𝐿𝑗conditional𝐲𝐱𝜆𝑅𝐱subscript𝑛sub\Phi_{j}({\mathbf{x}})=L_{j}({\mathbf{y}}|{\mathbf{x}})+\lambda R({\mathbf{x}})/n_{\mathrm{sub}}, and Ljsubscript𝐿𝑗L_{j} is the likelihood for a subset of the measurements. The BSREM update iterations are given by

𝐱i+1=P𝐱0[𝐱i+αi𝐃(𝐱i)Φj(𝐱i)]i0,formulae-sequencesuperscript𝐱𝑖1subscript𝑃𝐱0delimited-[]superscript𝐱𝑖subscript𝛼𝑖𝐃superscript𝐱𝑖subscriptΦ𝑗superscript𝐱𝑖𝑖0\mathbf{x}^{i+1}=P_{\mathbf{x}\geq 0}\left[\mathbf{x}^{i}+\alpha_{i}\mathbf{D}(\mathbf{x}^{i})\nabla\Phi_{j}(\mathbf{x}^{i})\right]\quad i\geq 0,(4)

where P𝐱0[]subscript𝑃𝐱0delimited-[]P_{\mathbf{x}\geq 0}[\cdot] denotes the non-negativity projection, i𝑖i is the iteration number, and index j=(imodnsub)+1𝑗modulo𝑖subscript𝑛sub1j=(i\mod n_{\mathrm{sub}})+1 cyclically accesses sub-objectives. The preconditioner is 𝐃(𝐱i)=diag(max(𝐱i,δ)/𝐀𝟏)𝐃superscript𝐱𝑖diagsuperscript𝐱𝑖𝛿superscript𝐀top1\mathbf{D}({\mathbf{x}}^{i})=\mathrm{diag}\left({\max({\mathbf{x}}^{i},\delta)}/{\mathbf{A}^{\top}\mathbf{1}}\right), where 𝟏m1superscript𝑚\mathbf{1}\in\mathbb{R}^{m} denotes the vector with all entries equal to 1, δ𝛿\delta is a small positive constant to ensure positive definiteness, and 𝐀superscript𝐀top\mathbf{A}^{\top} the matrix transpose. The quantity 𝐀𝟏superscript𝐀top1\mathbf{A}^{\top}\mathbf{1} is referred to as the sensitivity image. The step-sizes are αi=α0/(ζi/nsub+1)subscript𝛼𝑖subscript𝛼0𝜁𝑖subscript𝑛sub1\alpha_{i}=\alpha_{0}/(\zeta\lfloor i/n_{\mathrm{sub}}\rfloor+1), where α0=1subscript𝛼01\alpha_{0}=1 and ζ𝜁\zeta is a relaxation coefficient. A common regulariser for PET reconstruction is the Relative Difference Prior (RDP) (Nuyts et al., 2002), see Appendix A.3 for details. The gradient of the RDP is scale-invariant as it is computed using the ratio of voxel values. This partially overcomes the issue with the wide dynamic range observed in emission tomography images, helping to simplify the choice of the penalty strength across noise levels.

2.2 Score-based Generative Models

SGMs have emerged as a state-of-the-art method for modelling, and sampling from, high-dimensional image distributions (Song et al., 2021c). They reinterpret denoising diffusion probabilistic modelling (Sohl-Dickstein et al., 2015; Ho et al., 2020) and score-matching Langevin dynamics (Song and Ermon, 2019) through the lens of Stochastic Differential Equations (SDE). SGMs are often formulated by prescribing a forward diffusion process defined by an Itô SDE

d𝐱t=𝐟(𝐱t,t)dt+g(t)d𝐰t,𝐱0p0:=π,formulae-sequence𝑑subscript𝐱𝑡𝐟subscript𝐱𝑡𝑡𝑑𝑡𝑔𝑡𝑑subscript𝐰𝑡similar-tosubscript𝐱0subscript𝑝0assign𝜋d{\mathbf{x}_{t}}={\mathbf{f}}({\mathbf{x}_{t}},t)dt+g(t)d\mathbf{w}_{t},\quad{\mathbf{x}}_{0}\sim p_{0}:={\pi},(5)

where {𝐱t}t[0,T]subscriptsubscript𝐱𝑡𝑡0𝑇\{{\mathbf{x}}_{t}\}_{t\in[0,T]} is a stochastic process indexed by time t𝑡t and π𝜋{\pi} is the image distribution. Each random vector 𝐱tsubscript𝐱𝑡{\mathbf{x}}_{t} has an associated time-dependent density p(𝐱t)𝑝subscript𝐱𝑡p({\mathbf{x}}_{t}). To emphasise that the density is a function of t𝑡t we write pt(𝐱t):=p(𝐱t)assignsubscript𝑝𝑡subscript𝐱𝑡𝑝subscript𝐱𝑡p_{t}({\mathbf{x}}_{t}):=p({\mathbf{x}}_{t}). The multivariate Wiener process {𝐰t}t0subscriptsubscript𝐰𝑡𝑡0\{{\mathbf{w}}_{t}\}_{t\geq 0} is the standard Brownian motion. Starting at the image distribution π𝜋{\pi}, the drift function 𝐟(,t):nn:𝐟𝑡superscript𝑛superscript𝑛{\mathbf{f}}(\cdot,t):{\mathbb{R}}^{n}\to{\mathbb{R}}^{n} and the diffusion function g:n:𝑔superscript𝑛g:{\mathbb{R}}^{n}\to{\mathbb{R}} are chosen such that the terminal distribution at t=T𝑡𝑇t=T approximates the standard Gaussian, pT𝒩(0,I)subscript𝑝𝑇𝒩0𝐼p_{T}\approx\mathcal{N}(0,I). Thus, the forward diffusion process maps the image distribution π𝜋{\pi} to a simple, tractable distribution. The aim of SGMs is to invert this process, i.e. start at the Gaussian distribution and go back to the image distribution π𝜋{\pi}. Under certain conditions on 𝐟𝐟{\mathbf{f}} and g𝑔g, a reverse diffusion process can be defined (Anderson, 1982)

d𝐱t=[𝐟(𝐱t,t)g(t)2𝐱logpt(𝐱t)]dt+g(t)d𝐰¯t,𝑑subscript𝐱𝑡delimited-[]𝐟subscript𝐱𝑡𝑡𝑔superscript𝑡2subscript𝐱subscript𝑝𝑡subscript𝐱𝑡𝑑𝑡𝑔𝑡𝑑subscript¯𝐰𝑡d{\mathbf{x}_{t}}=[{\mathbf{f}}({\mathbf{x}_{t}},t)-g(t)^{2}\nabla_{\mathbf{x}}\log{p_{t}}({\mathbf{x}_{t}})]dt+g(t)d\bar{\mathbf{w}}_{t},(6)

that runs backwards in time. The Wiener process {𝐰¯t}t0subscriptsubscript¯𝐰𝑡𝑡0\{\bar{\mathbf{w}}_{t}\}_{t\geq 0} is time-reversed Brownian motion, and the term 𝐱logpt(𝐱t)subscript𝐱subscript𝑝𝑡subscript𝐱𝑡\nabla_{\mathbf{x}}\log{p_{t}}({\mathbf{x}_{t}}) is the score function. Denoising Score Matching (DSM) (Vincent, 2011) provides a methodology for estimating 𝐱logpt(𝐱t)subscript𝐱subscript𝑝𝑡subscript𝐱𝑡\nabla_{{\mathbf{x}}}\log{p_{t}}({\mathbf{x}_{t}}) by matching the transition densities pt(𝐱t|𝐱0)subscript𝑝𝑡conditionalsubscript𝐱𝑡subscript𝐱0p_{t}({\mathbf{x}_{t}}|{\mathbf{x}_{0}}) with a time-conditional neural network sθ(𝐱t,t)subscript𝑠𝜃subscript𝐱𝑡𝑡s_{\theta}({\mathbf{x}_{t}},t), called the score model, parametrised by θ𝜃\theta. The resulting optimisation problem is given by

minθ{LDSM(θ)=𝔼tU[0,T]𝔼𝐱0π𝔼𝐱tpt(𝐱t|𝐱0)[ωtsθ(𝐱t,t)𝐱logpt(𝐱t|𝐱0)22]},\min_{\theta}\left\{L_{\text{DSM}}(\theta)=\mathbb{E}_{t\sim U[0,T]}\mathbb{E}_{{\mathbf{x}_{0}}\sim{\pi}}\mathbb{E}_{{\mathbf{x}_{t}}\sim p_{t}({\mathbf{x}_{t}}|{\mathbf{x}_{0}})}\left[\omega_{t}\|s_{\theta}({\mathbf{x}_{t}},t)-\nabla_{{\mathbf{x}}}\log p_{t}({\mathbf{x}_{t}}|{\mathbf{x}_{0}})\|_{2}^{2}\right]\right\},(7)

where ωt>0subscript𝜔𝑡0\omega_{t}>0 are weighting factors, balancing the scores at different time steps. For general SDEs, the loss LDSM(θ)subscript𝐿DSM𝜃L_{\text{DSM}}(\theta) may still be intractable, since it requires access to the transition density pt(𝐱t|𝐱0)subscript𝑝𝑡conditionalsubscript𝐱𝑡subscript𝐱0p_{t}({\mathbf{x}_{t}}|{\mathbf{x}_{0}}). However, for SDEs with an affine linear drift function, pt(𝐱t|𝐱0)subscript𝑝𝑡conditionalsubscript𝐱𝑡subscript𝐱0p_{t}({\mathbf{x}_{t}}|{\mathbf{x}_{0}}) is a Gaussian and thus can be given in closed form (Särkkä and Solin, 2019). Throughout the paper, we use T=1𝑇1T=1 and the variance preserving SDE (Ho et al., 2020) given by

d𝐱t=β(t)2𝐱tdt+β(t)d𝐰t,𝑑subscript𝐱𝑡𝛽𝑡2subscript𝐱𝑡𝑑𝑡𝛽𝑡𝑑subscript𝐰𝑡d{\mathbf{x}_{t}}=-\frac{\beta(t)}{2}{\mathbf{x}_{t}}dt+\sqrt{\beta(t)}d\mathbf{w}_{t},(8)

where β(t):[0,1]>0:𝛽𝑡01subscriptabsent0\beta(t):[0,1]\to{\mathbb{R}}_{>0} is an increasing function defining the noise schedule. We use β(t)=βmin+t(βmaxβmin)𝛽𝑡subscript𝛽min𝑡subscript𝛽maxsubscript𝛽min\beta(t)=\beta_{\text{min}}+t(\beta_{\text{max}}-\beta_{\text{min}}) giving the transition kernel pt(𝐱t|𝐱0)=𝒩(𝐱t;γt𝐱0,νt2I)subscript𝑝𝑡conditionalsubscript𝐱𝑡subscript𝐱0𝒩subscript𝐱𝑡subscript𝛾𝑡subscript𝐱0superscriptsubscript𝜈𝑡2𝐼p_{t}({\mathbf{x}_{t}}|{\mathbf{x}_{0}})=\mathcal{N}({\mathbf{x}_{t}};\gamma_{t}{\mathbf{x}_{0}},\nu_{t}^{2}I) with coefficients γt,νtsubscript𝛾𝑡subscript𝜈𝑡\gamma_{t},\nu_{t}\in{\mathbb{R}} computed from drift and diffusion coefficients, see Appendix A.1 for details. Generating samples with the score model sθ(𝐱t,t)subscript𝑠𝜃subscript𝐱𝑡𝑡s_{\theta}({\mathbf{x}_{t}},t) as a surrogate requires solving the reverse SDE (6), with the score model sθ(𝐱t,t)subscript𝑠𝜃subscript𝐱𝑡𝑡s_{\theta}({\mathbf{x}_{t}},t) in place of 𝐱logpt(𝐱t)subscript𝐱subscript𝑝𝑡subscript𝐱𝑡\nabla_{\mathbf{x}}\log{p_{t}}({\mathbf{x}_{t}})

d𝐱t=[𝐟(𝐱t,t)g(t)2sθ(𝐱t,t)]dt+g(t)d𝐰¯t.𝑑subscript𝐱𝑡delimited-[]𝐟subscript𝐱𝑡𝑡𝑔superscript𝑡2subscript𝑠𝜃subscript𝐱𝑡𝑡𝑑𝑡𝑔𝑡𝑑subscript¯𝐰𝑡d{\mathbf{x}_{t}}=[{\mathbf{f}}({\mathbf{x}_{t}},t)-g(t)^{2}s_{\theta}({\mathbf{x}_{t}},t)]dt+g(t)d\bar{\mathbf{w}}_{t}.(9)

Drawing samples from the resulting generative model thus involves two steps. First, drawing a sample from the terminal distribution 𝐱1𝒩(0,I)p1similar-tosubscript𝐱1𝒩0𝐼subscript𝑝1{\mathbf{x}}_{1}\sim\mathcal{N}(0,I)\approx p_{1}, and second, initialising the reverse SDE (9) with 𝐱1subscript𝐱1{\mathbf{x}}_{1} and simulating backwards in time until t=0𝑡0t=0. The latter can be achieved by Euler-Maruyama schemes or predictor-corrector methods (Song et al., 2021c).

2.2.1 Denoising diffusion implicit models

Simulating the reverse SDE can be computationally expensive as a fine time grid is often necessary to produce realistic samples. Denoising Diffusion Implicit Models (DDIMs) (Song et al., 2021a) were introduced to allow faster sampling, and build upon a result by Tweedie (Efron, 2011) to approximate the expectation 𝔼[𝐱0|𝐱t]𝔼delimited-[]conditionalsubscript𝐱0subscript𝐱𝑡{\mathbb{E}}[{\mathbf{x}_{0}}|{\mathbf{x}_{t}}] via the score model sθ(𝐱t,t)subscript𝑠𝜃subscript𝐱𝑡𝑡s_{\theta}({\mathbf{x}_{t}},t) as

𝔼[𝐱0|𝐱t]=𝐱t+νt2𝐱logpt(𝐱t)γt𝐱t+νt2sθ(𝐱t,t)γt:=𝐱^0(𝐱t),𝔼delimited-[]conditionalsubscript𝐱0subscript𝐱𝑡subscript𝐱𝑡superscriptsubscript𝜈𝑡2subscript𝐱subscript𝑝𝑡subscript𝐱𝑡subscript𝛾𝑡subscript𝐱𝑡superscriptsubscript𝜈𝑡2subscript𝑠𝜃subscript𝐱𝑡𝑡subscript𝛾𝑡assignsubscript^𝐱0subscript𝐱𝑡{\mathbb{E}}[{\mathbf{x}_{0}}|{\mathbf{x}_{t}}]=\frac{{\mathbf{x}_{t}}+\nu_{t}^{2}\nabla_{\mathbf{x}}\log{p_{t}}({\mathbf{x}_{t}})}{\gamma_{t}}\approx\frac{{\mathbf{x}_{t}}+\nu_{t}^{2}s_{\theta}({\mathbf{x}_{t}},t)}{\gamma_{t}}:={\hat{\mathbf{x}}_{0}}({\mathbf{x}_{t}}),(10)

where the positive scalars γtsubscript𝛾𝑡\gamma_{t} and νt2superscriptsubscript𝜈𝑡2\nu_{t}^{2} are the coefficients for the mean and covariance, respectively, defining the transition kernel, cf. (38) in the Appendix for details. DDIM defines a non-Markovian sampling rule, which uses both the current sample 𝐱tsubscript𝐱𝑡{\mathbf{x}_{t}} and Tweedie’s estimate 𝐱^0(𝐱t)subscript^𝐱0subscript𝐱𝑡{\hat{\mathbf{x}}_{0}}({\mathbf{x}_{t}}) to create an accelerated sampler. Let 0=tk1tkN=10subscript𝑡subscript𝑘1subscript𝑡subscript𝑘𝑁10=t_{k_{1}}\leq\dots\leq t_{k_{N}}=1 be the time discretisation. The DDIM sampling update can be written as

𝐱tk1=γtk1𝐱^0(𝐱tk)+Noise(𝐱tk,sθ)+ηtk𝐳,𝐳𝒩(0,I) with Noise(𝐱tk,sθ):=νtkνtk12ηtk2sθ(𝐱tk,tk).\begin{split}{\mathbf{x}}_{t_{k-1}}&=\gamma_{t_{k-1}}{\hat{\mathbf{x}}_{0}}({\mathbf{x}}_{t_{k}})+\text{Noise}({\mathbf{x}}_{t_{k}},s_{\theta})+\eta_{t_{k}}{\mathbf{z}},\quad{\mathbf{z}}\sim\mathcal{N}(0,I)\\ &\text{ with }\text{Noise}({\mathbf{x}}_{t_{k}},s_{\theta}):=-\nu_{t_{k}}\sqrt{\nu_{t_{k-1}}^{2}-\eta_{t_{k}}^{2}}s_{\theta}({\mathbf{x}}_{t_{k}},t_{k}).\end{split}(11)

The sampling rule can be split into a denoising step (predicting 𝐱^0(𝐱tk)subscript^𝐱0subscript𝐱subscript𝑡𝑘{\hat{\mathbf{x}}_{0}}({\mathbf{x}}_{t_{k}}) using the score model), and adding an appropriate amount of noise back. Thus, the sampling mimics an iterative refinement process, as the prediction of the denoised estimate 𝐱^0(𝐱tk)subscript^𝐱0subscript𝐱subscript𝑡𝑘{\hat{\mathbf{x}}_{0}}({\mathbf{x}}_{t_{k}}) will be more accurate for smaller tksubscript𝑡𝑘t_{k}. Different choices of ηtsubscript𝜂𝑡\eta_{t} result in different sampling schemes. We choose ηtk=ηβtksubscript𝜂subscript𝑡𝑘𝜂subscript𝛽subscript𝑡𝑘\eta_{t_{k}}=\eta\beta_{t_{k}} with a hyperparameter η[0,1]𝜂01\eta\in[0,1], controlling the amount of stochasticity in the sampling, and βtk=νtk1/νtk1γtk/γtk1subscript𝛽subscript𝑡𝑘subscript𝜈subscript𝑡𝑘1subscript𝜈subscript𝑡𝑘1subscript𝛾subscript𝑡𝑘subscript𝛾subscript𝑡𝑘1\beta_{t_{k}}=\nu_{t_{k-1}}/\nu_{t_{k}}\sqrt{1-\gamma_{t_{k}}/\gamma_{t_{k-1}}}(Song et al., 2021a).

2.3 Using Score-based Generative Models for Inverse Problems

The goal of the Bayesian framework of inverse problems is to estimate the posterior ppost(𝐱|𝐲)superscript𝑝postconditional𝐱𝐲{p^{\text{post}}}({\mathbf{x}}|{\mathbf{y}}), i.e. the conditional distribution of images 𝐱𝐱{\mathbf{x}} given noisy measurements 𝐲𝐲{\mathbf{y}}. Using Bayes’ theorem the posterior can be factored into

ppost(𝐱|𝐲)plkhd(𝐲|𝐱)π(𝐱),𝐱logppost(𝐱|𝐲)=𝐱logplkhd(𝐲|𝐱)+𝐱logπ(𝐱),formulae-sequenceproportional-tosuperscript𝑝postconditional𝐱𝐲superscript𝑝lkhdconditional𝐲𝐱𝜋𝐱subscript𝐱superscript𝑝postconditional𝐱𝐲subscript𝐱superscript𝑝lkhdconditional𝐲𝐱subscript𝐱𝜋𝐱{p^{\text{post}}}({\mathbf{x}}|{\mathbf{y}})\propto{p^{\text{lkhd}}}({\mathbf{y}}|{\mathbf{x}}){\pi}({\mathbf{x}}),\quad\nabla_{{\mathbf{x}}}\log{p^{\text{post}}}({\mathbf{x}}|{\mathbf{y}})=\nabla_{{\mathbf{x}}}\log{p^{\text{lkhd}}}({\mathbf{y}}|{\mathbf{x}})+\nabla_{{\mathbf{x}}}\log{\pi}({\mathbf{x}}),(12)

where plkhdsuperscript𝑝lkhd{p^{\text{lkhd}}} denotes the likelihood and π𝜋{\pi} the prior given by the image distribution. We can set up a generative model for the posterior in the same way as for the prior π𝜋{\pi} in Section 2.2 by defining a forward SDE which maps the posterior to random noise. To generate a sample from the posterior ppost(𝐱|𝐲)superscript𝑝postconditional𝐱𝐲{p^{\text{post}}}({\mathbf{x}}|{\mathbf{y}}), we can simulate the corresponding reverse SDE

d𝐱t=[𝐟(𝐱t,t)g(t)2𝐱logpt(𝐱t|𝐲)]dt+g(t)d𝐰¯t,𝑑subscript𝐱𝑡delimited-[]𝐟subscript𝐱𝑡𝑡𝑔superscript𝑡2subscript𝐱subscript𝑝𝑡conditionalsubscript𝐱𝑡𝐲𝑑𝑡𝑔𝑡𝑑subscript¯𝐰𝑡d{\mathbf{x}_{t}}=[{\mathbf{f}}({\mathbf{x}_{t}},t)-g(t)^{2}\nabla_{{\mathbf{x}}}\log p_{t}({\mathbf{x}_{t}}|{\mathbf{y}})]dt+g(t)d\bar{\mathbf{w}}_{t},(13)

where we need access to the time-dependent posterior 𝐱logpt(𝐱t|𝐲)subscript𝐱subscript𝑝𝑡conditionalsubscript𝐱𝑡𝐲\nabla_{{\mathbf{x}}}\log p_{t}({\mathbf{x}_{t}}|{\mathbf{y}}). Similar to Eq. (12), we use Bayes’ theorem for the score of the posterior and decompose 𝐱logpt(𝐱t|𝐲)subscript𝐱subscript𝑝𝑡conditionalsubscript𝐱𝑡𝐲\nabla_{{\mathbf{x}}}\log p_{t}({\mathbf{x}_{t}}|{\mathbf{y}}) into a prior and a likelihood term, where the former is approximated with the trained score model

𝐱logpt(𝐱t|𝐲)=𝐱logpt(𝐱t)+𝐱logpt(𝐲|𝐱t)sθ(𝐱t,t)+𝐱logpt(𝐲|𝐱t).subscript𝐱subscript𝑝𝑡conditionalsubscript𝐱𝑡𝐲subscript𝐱subscript𝑝𝑡subscript𝐱𝑡subscript𝐱subscript𝑝𝑡conditional𝐲subscript𝐱𝑡subscript𝑠𝜃subscript𝐱𝑡𝑡subscript𝐱subscript𝑝𝑡conditional𝐲subscript𝐱𝑡\begin{split}\nabla_{{\mathbf{x}}}\log p_{t}({\mathbf{x}_{t}}|{\mathbf{y}})&=\nabla_{{\mathbf{x}}}\log{p_{t}}({\mathbf{x}_{t}})+\nabla_{{\mathbf{x}}}\log{p_{t}}({\mathbf{y}}|{\mathbf{x}_{t}})\\ &\approx s_{\theta}({\mathbf{x}_{t}},t)+\nabla_{{\mathbf{x}}}\log p_{t}({\mathbf{y}}|{\mathbf{x}_{t}}).\end{split}(14)

Substituting the above approximation into (13), we obtain

d𝐱t=[𝐟(𝐱t,t)g(t)2(sθ(𝐱t,t)+𝐱tlogpt(𝐲|𝐱t))]dt+g(t)d𝐰¯t.𝑑subscript𝐱𝑡delimited-[]𝐟subscript𝐱𝑡𝑡𝑔superscript𝑡2subscript𝑠𝜃subscript𝐱𝑡𝑡subscriptsubscript𝐱𝑡subscript𝑝𝑡conditional𝐲subscript𝐱𝑡𝑑𝑡𝑔𝑡𝑑subscript¯𝐰𝑡d{\mathbf{x}_{t}}=[{\mathbf{f}}({\mathbf{x}_{t}},t)-g(t)^{2}(s_{\theta}({\mathbf{x}_{t}},t)+\nabla_{{\mathbf{x}_{t}}}\log{p_{t}}({\mathbf{y}}|{\mathbf{x}_{t}}))]dt+g(t)d\bar{\mathbf{w}}_{t}.(15)

We can recover approximate samples from the posterior ppost(𝐱|𝐲)superscript𝑝postconditional𝐱𝐲{p^{\text{post}}}({\mathbf{x}}|{\mathbf{y}}), by simulating the reverse SDE (15). Through iterative simulation of the reverse SDE with varying noise initialisations, we can estimate moments of the posterior distribution. As is common practice in the field (Song et al., 2022; Chung and Ye, 2022; Jalal et al., 2021) we use one sample for the reconstruction, due to computational costs of repeatedly solving the reverse SDE. In addition to the score model sθ(𝐱t,t)subscript𝑠𝜃subscript𝐱𝑡𝑡s_{\theta}({\mathbf{x}_{t}},t), we need the score of the time-dependent likelihood 𝐱logpt(𝐲|𝐱t)subscript𝐱subscript𝑝𝑡conditional𝐲subscript𝐱𝑡\nabla_{{\mathbf{x}}}\log{p_{t}}({\mathbf{y}}|{\mathbf{x}_{t}}). At the start of the forward SDE (for t=0𝑡0t=0), it is equal to the true likelihood plkhdsuperscript𝑝lkhd{p^{\text{lkhd}}}. However, for t>0𝑡0t>0 the score 𝐱logpt(𝐲|𝐱t)subscript𝐱subscript𝑝𝑡conditional𝐲subscript𝐱𝑡\nabla_{{\mathbf{x}}}\log{p_{t}}({\mathbf{y}}|{\mathbf{x}_{t}}) is intractable to compute exactly and different approximations have been proposed. In (Jalal et al., 2021; Ramzi et al., 2020), this term was approximated with the likelihood plkhdsuperscript𝑝lkhd{p^{\text{lkhd}}} evaluated at the noisy sample 𝐱tsubscript𝐱𝑡{\mathbf{x}_{t}} with time-dependent penalty strength λtsubscript𝜆𝑡\lambda_{t}

𝐱logpt(𝐲|𝐱t)λt𝐱logplkhd(𝐲|𝐱t).subscript𝐱subscript𝑝𝑡conditional𝐲subscript𝐱𝑡subscript𝜆𝑡subscript𝐱superscript𝑝lkhdconditional𝐲subscript𝐱𝑡\nabla_{{\mathbf{x}}}\log{p_{t}}({\mathbf{y}}|{\mathbf{x}_{t}})\approx\lambda_{t}\nabla_{{\mathbf{x}}}\log{p^{\text{lkhd}}}({\mathbf{y}}|{\mathbf{x}_{t}}).(16)

We refer to Eq. (16) as the Naive  approximation. The Diffusion Posterior Sampling (DPS) (Chung et al., 2023a) uses Tweedie’s formula to obtain 𝐱^0(𝐱t)𝔼[𝐱0|𝐱t]subscript^𝐱0subscript𝐱𝑡𝔼delimited-[]conditionalsubscript𝐱0subscript𝐱𝑡\hat{{\mathbf{x}}}_{0}({\mathbf{x}_{t}})\approx{\mathbb{E}}[{\mathbf{x}_{0}}|{\mathbf{x}_{t}}] and approximates 𝐱logpt(𝐲|𝐱t)subscript𝐱subscript𝑝𝑡conditional𝐲subscript𝐱𝑡\nabla_{{\mathbf{x}}}\log{p_{t}}({\mathbf{y}}|{\mathbf{x}_{t}}) by

𝐱logpt(𝐲|𝐱t)𝐱logplkhd(𝐲|𝐱^0(𝐱t)),subscript𝐱subscript𝑝𝑡conditional𝐲subscript𝐱𝑡subscript𝐱superscript𝑝lkhdconditional𝐲subscript^𝐱0subscript𝐱𝑡\nabla_{{\mathbf{x}}}\log{p_{t}}({\mathbf{y}}|{\mathbf{x}_{t}})\approx\nabla_{{\mathbf{x}}}\log{p^{\text{lkhd}}}({\mathbf{y}}|{\hat{\mathbf{x}}_{0}}({\mathbf{x}_{t}})),(17)

where 𝐱subscript𝐱\nabla_{{\mathbf{x}}} denotes taking derivative in 𝐱tsubscript𝐱𝑡{\mathbf{x}}_{t} (instead of 𝐱^0subscript^𝐱0\hat{{\mathbf{x}}}_{0}). It was shown that this approximation leads to improved performance for several image reconstruction tasks (Chung et al., 2023a). However, DPS comes with a higher computational cost, due to the need to back-propagate the gradient through the score model.

Recently, several works proposed modifying the DDIM sampling rule in Eq. (11) for conditional generation (Zhu et al., 2023; Chung et al., 2023b). These methods generally consist of three steps: (1) estimating the denoised image 𝐱0subscript𝐱0{\mathbf{x}_{0}} using Tweedie’s estimate 𝐱^0(𝐱tk)subscript^𝐱0subscript𝐱subscript𝑡𝑘{\hat{\mathbf{x}}_{0}}({\mathbf{x}}_{t_{k}}); (2) updating 𝐱^0(𝐱tk)subscript^𝐱0subscript𝐱subscript𝑡𝑘{\hat{\mathbf{x}}_{0}}({\mathbf{x}}_{t_{k}}) with a data consistency step using the measurements 𝐲𝐲{\mathbf{y}}; and (3) adding the noise back, according to the DDIM update rule, in order to get a sample for the next time step tk1subscript𝑡𝑘1t_{k-1}. Importantly, with this approach there is no need to estimate the gradient of the time-dependent likelihood 𝐱logpt(𝐲|𝐱t)subscript𝐱subscript𝑝𝑡conditional𝐲subscript𝐱𝑡\nabla_{{\mathbf{x}}}\log{p_{t}}({\mathbf{y}}|{\mathbf{x}_{t}}) as data consistency is only enforced on Tweedie’s estimate at t=0𝑡0t=0. These conditional DDIM samplers differ most greatly in the implementation of the data consistency update. Decomposed Diffusion Sampling (DDS) (Chung et al., 2023b) proposes to align Tweedie’s estimate with the measurements by running p𝑝p steps of a Conjugate Gradient (CG) scheme for minimising the negative log-likelihood at each sampling step. Let CG(p)(𝐱^0)superscriptCG𝑝subscript^𝐱0\text{CG}^{(p)}({\hat{\mathbf{x}}_{0}}) denote the p𝑝p-th CG update initialised with 𝐱^0(𝐱tk)subscript^𝐱0subscript𝐱subscript𝑡𝑘{\hat{\mathbf{x}}_{0}}({\mathbf{x}}_{t_{k}}). This can be seen as an approximation to the conditional expectation, i.e. 𝔼[𝐱0|𝐱t,𝐲]CG(p)(𝐱^0)𝔼delimited-[]conditionalsubscript𝐱0subscript𝐱𝑡𝐲superscriptCG𝑝subscript^𝐱0{\mathbb{E}}[{\mathbf{x}_{0}}|{\mathbf{x}_{t}},{\mathbf{y}}]\approx\text{CG}^{(p)}({\hat{\mathbf{x}}_{0}}) (Ravula et al., 2023). Using this approximation, the update step for DDS can be written as

𝐱tk1=γtk1CG(p)(𝐱^0)+Noise(𝐱tk,sθ)+ηtk𝐳, with 𝐳𝒩(0,I),formulae-sequencesubscript𝐱subscript𝑡𝑘1subscript𝛾subscript𝑡𝑘1superscriptCG𝑝subscript^𝐱0Noisesubscript𝐱subscript𝑡𝑘subscript𝑠𝜃subscript𝜂subscript𝑡𝑘𝐳similar-to with 𝐳𝒩0𝐼{\mathbf{x}}_{t_{k-1}}=\gamma_{t_{k-1}}\text{CG}^{(p)}({\hat{\mathbf{x}}_{0}})+\text{Noise}({\mathbf{x}}_{t_{k}},s_{\theta})+\eta_{t_{k}}{\mathbf{z}},\text{ with }{\mathbf{z}}\sim\mathcal{N}(0,I),(18)

where the introduction of the conditional expectation offers us the possibility to explore different approximations specific for PET image reconstruction.

3 PET-specific Adaptations for SGMs

To apply SGMs to PET reconstruction, several key components of the pipeline in Section 2.2 have to be modified in order to incorporate PET-specific constraints. Namely, we introduce measurement-based normalisation of the input to the score model, and explain how to apply a score model trained on 2D slices for 3D reconstruction. Additionally, we adapt the sampling methods from Section 2.3 to incorporate the Poisson noise model. Finally, we demonstrate that the SGM framework allows for the incorporation of additional information, e.g. MR images, by using classifier-free guidance (Ho and Salimans, 2022). The overall adaption steps are summarised in Fig. 1.

TrainingMeasurement based normalisation of the SGM (Section 3.1)Incorporation of additional MR information (Section 3.4)SamplingScaling to 3D reconstruction (Section 3.2)Modification of sampling rules to include the PLL (Section 3.3)
Figure 1: Schematic illustration of the modification for training and sampling steps of SGMs.

3.1 Measurement-based Normalisation

The intensity of the unknown tracer distribution in emission tomography can significantly vary across different scans, resulting in a high dynamic range that poses challenges for deep learning approaches. Neural networks may exhibit bias toward intensity levels that appear more frequently in the training set. Consequently, the network might struggle to handle new images with unseen intensity levels, leading to instability in the learning and evaluation process (Tran et al., 2020). For SGMs the intensity range of images must be predefined to ensure that the forward diffusion process converges to a standard Gaussian distribution and to stabilise the sampling process (Lou and Ermon, 2023). Input normalisation is a standard deep learning methodology to deal with intensity shifts and normalise the inputs to the network. In a similar vein, we propose a PET-specific normalisation method to ensure that the score model sθ(𝐱t,t)subscript𝑠𝜃subscript𝐱𝑡𝑡s_{\theta}({\mathbf{x}_{t}},t) is able to estimate the score function of images with arbitrary intensity values. Namely, we normalise each training image 𝐱0subscript𝐱0{\mathbf{x}_{0}} to ensure voxel intensities are within a certain range. To do this we introduce a training normalisation factor ctrainsubscript𝑐trainc_{\text{train}} that when applied ensures the average emission per emission voxels (a voxel with non-zero intensity value) is 1. This is computed as

ctrain=c(𝐱0):=j=1m[𝐱0]j#{j:[𝐱0]j>0},subscript𝑐train𝑐subscript𝐱0assignsuperscriptsubscript𝑗1𝑚subscriptdelimited-[]subscript𝐱0𝑗#conditional-set𝑗subscriptdelimited-[]subscript𝐱0𝑗0c_{\text{train}}=c({\mathbf{x}_{0}}):=\frac{\sum_{j=1}^{m}[{\mathbf{x}_{0}}]_{j}}{\#\{j:[{\mathbf{x}_{0}}]_{j}>0\}},(19)

where the numerator captures the total emission in the image, and the denominator is the number of emission voxels. The normalisation factor is incorporated into the DSM training objective function by rescaling the initial image, yielding the objective

𝔼tU[0,T]𝔼𝐱0π𝔼𝐳𝒩(0,I)𝔼cU[ctrain2,3ctrain2][ωtsθ(𝐱~t,t)𝐱logpt(𝐱~t|𝐱0/c)22],\mathbb{E}_{t\sim U[0,T]}\mathbb{E}_{{\mathbf{x}_{0}}\sim{\pi}}\mathbb{E}_{{\mathbf{z}}\sim\mathcal{N}(0,I)}\mathbb{E}_{c\sim U[\frac{c_{\rm train}}{2},\frac{3c_{\rm train}}{2}]}\left[\omega_{t}\left\|s_{\theta}\left(\tilde{\mathbf{x}}_{t},t\right)-\nabla_{{\mathbf{x}}}\log p_{t}(\tilde{\mathbf{x}}_{t}|{\mathbf{x}_{0}}/c)\right\|_{2}^{2}\right],(20)

with 𝐱~t=γt𝐱0/c+νt𝐳subscript~𝐱𝑡subscript𝛾𝑡subscript𝐱0𝑐subscript𝜈𝑡𝐳\tilde{\mathbf{x}}_{t}=\gamma_{t}{\mathbf{x}_{0}}/c+\nu_{t}{\mathbf{z}}. Compared with Eq. (7), the scale-factor in range cU[ctrain2,3ctrain2]similar-to𝑐𝑈subscript𝑐train23subscript𝑐train2c\sim U[\frac{c_{\rm train}}{2},\frac{3c_{\rm train}}{2}] is used to encourage the score model to be more robust with respect to misestimations of the normalisation constant during sampling.

An analogue of Eq. (19) is unavailable during the sampling, and thus a surrogate is required. This is obtained through an approximate reconstruction, computed using a single epoch of OSEM from a constant non-negative initialisation. The resulting sampling normalisation factor is given by

cOSEM=j=1m[𝐱OSEM]j#{j:[𝐱OSEM]j>Q0.01},subscript𝑐OSEMsuperscriptsubscript𝑗1𝑚subscriptdelimited-[]subscript𝐱OSEM𝑗#conditional-set𝑗subscriptdelimited-[]subscript𝐱OSEM𝑗subscript𝑄0.01c_{\text{OSEM}}=\frac{\sum_{j=1}^{m}[{\mathbf{x}}_{\text{OSEM}}]_{j}}{\#\{j:[{\mathbf{x}}_{\rm OSEM}]_{j}>Q_{0.01}\}},(21)

where Q0.01subscript𝑄0.01Q_{0.01} defines the 1%percent11\% percentile of 𝐱OSEMsubscript𝐱OSEM{\mathbf{x}}_{\text{OSEM}} values. This threshold is heuristically chosen to ensure that noise and reconstruction artefacts do not cause an over-estimation of the number of emission voxels. In the reconstruction process the normalisation constant cOSEMsubscript𝑐OSEMc_{\rm OSEM} is applied as a factor scaling the time-dependent likelihood, giving

d𝐱t=[𝐟(𝐱t,t)g(t)2(sθ(𝐱t,t)+𝐱logpt(𝐲|cOSEM𝐱t))]dt+g(t)d𝐰¯t.𝑑subscript𝐱𝑡delimited-[]𝐟subscript𝐱𝑡𝑡𝑔superscript𝑡2subscript𝑠𝜃subscript𝐱𝑡𝑡subscript𝐱subscript𝑝𝑡conditional𝐲subscript𝑐OSEMsubscript𝐱𝑡𝑑𝑡𝑔𝑡𝑑subscript¯𝐰𝑡d{\mathbf{x}_{t}}=[{\mathbf{f}}({\mathbf{x}_{t}},t)-g(t)^{2}(s_{\theta}({\mathbf{x}_{t}},t)+\nabla_{{\mathbf{x}}}\log{p_{t}}({\mathbf{y}}|c_{\text{OSEM}}{\mathbf{x}_{t}}))]dt+g(t)d\bar{\mathbf{w}}_{t}.(22)

At final time step t=0𝑡0t=0, the output 𝐱𝐱{\mathbf{x}} is rescaled by cOSEMsubscript𝑐OSEMc_{\rm OSEM} to recover the correct intensity level.

3.2 Scaling to 3D Reconstruction

While some SGM studies deal with fully 3D image generation (Pinaya et al., 2022), the majority of work is restricted to 2D images. This is largely due to the fact that the learning of full 3D volume distributions is computationally expensive and requires access to many training volumes. Therefore, we propose to train the score model on 2D axial slices and use a specific decomposition of the conditional sampling rules to apply the model for 3D reconstruction. Upon simulating the conditional reverse SDE in Eq. (15) using the Euler-Maruyama approach, we arrive at the iteration rule

𝐱~tk1=𝐱tk+[𝐟(𝐱tk,tk)g(tk)2sθ(𝐱tk,tk)]Δt+g(tk)|Δt|𝐳,𝐳𝒩(0,I),formulae-sequencesubscript~𝐱subscript𝑡𝑘1subscript𝐱subscript𝑡𝑘delimited-[]𝐟subscript𝐱subscript𝑡𝑘subscript𝑡𝑘𝑔superscriptsubscript𝑡𝑘2subscript𝑠𝜃subscript𝐱subscript𝑡𝑘subscript𝑡𝑘Δ𝑡𝑔subscript𝑡𝑘Δ𝑡𝐳similar-to𝐳𝒩0𝐼\displaystyle\tilde{\mathbf{x}}_{t_{k-1}}={\mathbf{x}}_{t_{k}}+\big{[}{\mathbf{f}}({\mathbf{x}}_{t_{k}},t_{k})-g(t_{k})^{2}s_{\theta}({\mathbf{x}}_{t_{k}},t_{k})\big{]}\Delta t+g(t_{k})\sqrt{|\Delta t|}{\mathbf{z}},\quad{\mathbf{z}}\sim\mathcal{N}(0,I),(23)
𝐱tk1=𝐱~tk1g(tk)2𝐱logptk(𝐲|𝐱tk)Δt,subscript𝐱subscript𝑡𝑘1subscript~𝐱subscript𝑡𝑘1𝑔superscriptsubscript𝑡𝑘2subscript𝐱subscript𝑝subscript𝑡𝑘conditional𝐲subscriptsubscript𝐱𝑡𝑘Δ𝑡\displaystyle{\mathbf{x}}_{t_{k-1}}=\tilde{\mathbf{x}}_{t_{k-1}}-g(t_{k})^{2}\nabla_{{\mathbf{x}}}\log p_{t_{k}}({\mathbf{y}}|{\mathbf{x}_{t}}_{k})\Delta t,(24)

using an equidistant time discretisation 0=tk1tkN=10subscript𝑡subscript𝑘1subscript𝑡subscript𝑘𝑁10=t_{k_{1}}\leq\dots\leq t_{k_{N}}=1 for N𝑁N\in{\mathbb{N}}, with a time step Δt=1/NΔ𝑡1𝑁\Delta t=-1/N. We split the Euler-Maruyama update into two equations to highlight the influences of the score model and the measurements 𝐲𝐲{\mathbf{y}}. First, Eq. (23) is the Euler-Maruyama discretisation for the unconditional reverse SDE, see Eq. (9). This update is independent of the measurements 𝐲𝐲{\mathbf{y}} and can be interpreted as a prior update, increasing the likelihood of 𝐱tksubscript𝐱subscript𝑡𝑘{\mathbf{x}}_{t_{k}} under the SGM. The second step in Eq. (24) is a data consistency update, pushing the current iterate to better fit the measurements. Notably, this step is fully independent of the score model. This strategy was developed for 3D reconstruction, focusing on sparse view CT and MRI (Chung et al., 2023c). It was proposed to apply the prior update in Eq. (23) to all slices in the volume independently and use the 3D forward model in the data consistency step. Further, a regulariser in the direction orthogonal to the slice was introduced, to improve consistency of neighbouring slices. However, applying this approach to the Euler-Maruyama discretisation results in slow sampling times as a small time step |Δt|Δ𝑡|\Delta t| is necessary. To accelerate the sampling of high quality samples, we propose to use the DDS update in Eq. (18) that uses a similar decomposition of independent score model updates to axial slices, and 3D data consistency updates. Additionally, we accelerate data consistency updates by splitting the measurement data into ordered subsets and applying the forward model block-sequentially. The details are explained below.

3.3 Modifications of Sampling Methods

The sampling schemes and approximations in Section 2.3 were originally proposed for inverse problems with Gaussian noise. The work on DPS (Chung et al., 2023a) also considers inverse problems with Poisson noise, but utilises a Gaussian approximation to the Poisson noise, which is known to be unsuitable for PET reconstruction in the event of the low photon count (Bertero et al., 2009; Hohage and Werner, 2016). To apply the Naive approximation or the DPS approach to Poisson inverse problems, one could simply replace the Gaussian log-likelihood with the PLL in Eq. (16) and Eq. (17). However, PLL and its gradient are only defined for non-negative values. Therefore, we have to introduce a non-negativity projection into the sampling to ensure that the gradient of the PLL can be evaluated. In the context of guided diffusion, it was proposed to project the iterates 𝐱tksubscript𝐱subscript𝑡𝑘{\mathbf{x}}_{t_{k}} to a specified domain after each sampling step (Li et al., 2022; Saharia et al., 2022). In our case this would require thresholding all negative values. However, this creates a mismatch between the forward and reverse SDEs. It was observed that this mismatch results in artefacts in the reconstructions and may even lead to divergence of the sampling (Lou and Ermon, 2023). In our experiments, thresholding all negative values of 𝐱tksubscript𝐱subscript𝑡𝑘{\mathbf{x}}_{t_{k}} leads to a divergence of the sampling process. Therefore, we propose to only threshold the input to the PLL, i.e. with L𝐿L being the PLL, see Eq. (2), for the PET-Naive approximation we use

𝐱logpt(𝐲|𝐱t)λtNaive𝐱L(𝐲|cOSEMP𝐱0[𝐱t]),subscript𝐱subscript𝑝𝑡conditional𝐲subscript𝐱𝑡superscriptsubscript𝜆𝑡Naivesubscript𝐱𝐿conditional𝐲subscript𝑐OSEMsubscript𝑃𝐱0delimited-[]subscript𝐱𝑡\nabla_{{\mathbf{x}}}\log{p_{t}}({\mathbf{y}}|{\mathbf{x}_{t}})\approx\lambda_{t}^{\text{{Naive}{}}}\nabla_{{\mathbf{x}}}L({\mathbf{y}}|c_{\text{OSEM}}P_{{\mathbf{x}}\geq 0}[{\mathbf{x}}_{t}]),(25)

and likewise for PET-DPS

𝐱logpt(𝐲|𝐱t)λtDPS𝐱L(𝐲|cOSEMP𝐱0[𝐱^0(𝐱t)]).subscript𝐱subscript𝑝𝑡conditional𝐲subscript𝐱𝑡superscriptsubscript𝜆𝑡DPSsubscript𝐱𝐿conditional𝐲subscript𝑐OSEMsubscript𝑃𝐱0delimited-[]subscript^𝐱0subscript𝐱𝑡\nabla_{{\mathbf{x}}}\log{p_{t}}({\mathbf{y}}|{\mathbf{x}_{t}})\approx\lambda_{t}^{\text{DPS}}\nabla_{{\mathbf{x}}}L({\mathbf{y}}|c_{\text{OSEM}}P_{{\mathbf{x}}\geq 0}[{\hat{\mathbf{x}}_{0}}({\mathbf{x}}_{t})]).(26)

Note that this leads to a perturbed likelihood gradient that is not computed on the true iterate 𝐱tsubscript𝐱𝑡{\mathbf{x}}_{t}, but only on the projection. In order to reconstruct the PET image we have to solve the reverse SDE using the specific approximation (PET-Naive  or PET-DPS) as the likelihood term. This usually requires around 100010001000 sampling steps to produce an appropriate reconstruction and results in impractically long reconstruction times for 3D volumes.

To reduce the reconstruction times we propose to modify the conditional DDIM sampling rule, which we call PET-DDS, similar to the DDS framework (Zhu et al., 2023; Chung et al., 2023b), cf. Section 3.3. This circumvents the usage of 𝐱logpt(𝐲|𝐱t)subscript𝐱subscript𝑝𝑡conditional𝐲subscript𝐱𝑡\nabla_{\mathbf{x}}\log{p_{t}}({\mathbf{y}}|{\mathbf{x}_{t}}), instead enforcing data consistency for Tweedie’s estimate 𝐱^0(𝐱tk)subscript^𝐱0subscript𝐱subscript𝑡𝑘{\hat{\mathbf{x}}_{0}}({\mathbf{x}}_{t_{k}}). For PET reconstruction we propose to implement this data consistency with a MAP objective, leading to the PET-DDS update

𝐱tk0subscriptsuperscript𝐱0subscript𝑡𝑘\displaystyle{\mathbf{x}}^{0}_{t_{k}}=𝐱^0(𝐱tk)absentsubscript^𝐱0subscript𝐱subscript𝑡𝑘\displaystyle={\hat{\mathbf{x}}_{0}}({\mathbf{x}}_{t_{k}})(27)
𝐱tki+1subscriptsuperscript𝐱𝑖1subscript𝑡𝑘\displaystyle{\mathbf{x}}^{i+1}_{t_{k}}=P𝐱0[𝐱tki+𝐃(𝐱tki)𝐱Φj(𝐱tki)]absentsubscript𝑃𝐱0delimited-[]subscriptsuperscript𝐱𝑖subscript𝑡𝑘𝐃subscriptsuperscript𝐱𝑖subscript𝑡𝑘subscript𝐱subscriptΦ𝑗subscriptsuperscript𝐱𝑖subscript𝑡𝑘\displaystyle=P_{\mathbf{x}\geq 0}\left[{\mathbf{x}}^{i}_{t_{k}}+\mathbf{D}({\mathbf{x}}^{i}_{t_{k}})\nabla_{\mathbf{x}}\Phi_{j}({\mathbf{x}}^{i}_{t_{k}})\right](28)
i=0,,p1𝑖0𝑝1\displaystyle\quad\quad i=0,\dots,p-1
𝐱tk1subscript𝐱subscript𝑡𝑘1\displaystyle{\mathbf{x}}_{t_{k-1}}=γtk1𝐱tkp+Noise(𝐱tk,sθ)+ηtk𝐳,𝐳𝒩(0,I),formulae-sequenceabsentsubscript𝛾subscript𝑡𝑘1subscriptsuperscript𝐱𝑝subscript𝑡𝑘Noisesubscript𝐱subscript𝑡𝑘subscript𝑠𝜃subscript𝜂subscript𝑡𝑘𝐳similar-to𝐳𝒩0𝐼\displaystyle=\gamma_{t_{k-1}}{\mathbf{x}}^{p}_{t_{k}}+\text{Noise}({\mathbf{x}}_{t_{k}},s_{\theta})+\eta_{t_{k}}{\mathbf{z}},\quad{\mathbf{z}}\sim\mathcal{N}(0,I),(29)

where 𝐃(𝐱)=diag{max(𝐱,104)/𝐀𝟏}𝐃𝐱diag𝐱superscript104superscript𝐀top1\mathbf{D}({\mathbf{x}})=\mathrm{diag}\left\{{\max({\mathbf{x}},10^{-4})}/{\mathbf{A}^{\top}\mathbf{1}}\right\} is the preconditioner and the sub-objective is

Φj(𝐱i)=Lj(𝐲|cOSEM𝐱i)+(λRDPRz(𝐱i)λDDS𝐱i𝐱^022)/nsub.subscriptΦ𝑗superscript𝐱𝑖subscript𝐿𝑗conditional𝐲subscript𝑐OSEMsuperscript𝐱𝑖superscript𝜆RDPsubscript𝑅𝑧superscript𝐱𝑖superscript𝜆DDSsuperscriptsubscriptnormsuperscript𝐱𝑖subscript^𝐱022subscript𝑛sub\Phi_{j}({\mathbf{x}}^{i})=L_{j}({\mathbf{y}}|c_{\text{OSEM}}{\mathbf{x}}^{i})+(\lambda^{\text{RDP}}R_{z}({\mathbf{x}}^{i})-\lambda^{\text{DDS}}\|{\mathbf{x}}^{i}-{\hat{\mathbf{x}}_{0}}\|_{2}^{2})/n_{\mathrm{sub}}.(30)

The sub-objective index j=j(i)𝑗𝑗𝑖j=j(i) is given by j=(p(Nk)+imodnsub)+1𝑗modulo𝑝𝑁𝑘𝑖subscript𝑛sub1j=(p(N-k)+i\mod n_{\rm{sub}})+1, which cyclically accesses sub-objectives. The RDP used for 3D data Rzsubscript𝑅𝑧R_{z} is applied in the z𝑧z-direction, perpendicular to the axial slice, see Appendix A.3 for more details. The prior in Eq. (30) consists of two components: one anchoring to the Tweedie’s estimate 𝐱𝐱^022superscriptsubscriptnorm𝐱subscript^𝐱022\|{\mathbf{x}}-{\hat{\mathbf{x}}_{0}}\|_{2}^{2}, and the other RDP in the z𝑧z-direction Rz(𝐱)subscript𝑅𝑧𝐱R_{z}({\mathbf{x}}). The components have associated penalty strengths λRDPsuperscript𝜆RDP\lambda^{\text{RDP}} and λDDSsuperscript𝜆DDS\lambda^{\text{DDS}}, respectively.

In a PET-DDS update we first independently compute Tweedie’s estimate based on 𝐱tksubscript𝐱subscript𝑡𝑘{\mathbf{x}}_{t_{k}} for each axial slice (Eq. 27). Tweedie’s estimate 𝐱^0subscript^𝐱0{\hat{\mathbf{x}}_{0}} impacts the reconstruction in two ways: first through the Tikhonov regulariser scaled with λDDSsuperscript𝜆DDS\lambda^{\text{DDS}}, and second as the initial value for the projected gradient descent in Eq. (28). Through running p𝑝p steps of projected gradient descent consistency is balanced between a PLL on measurements, RDP in the z𝑧z-direction, and Tweedie’s estimate (Eq. 30). To speed up computation of the objective gradient, the objective is split into sub-objectives and the gradient of the log-likelihood is evaluated using only subsets of the measurements 𝐲𝐲{\mathbf{y}}, similar to the BSREM update in Eq. (4). The subsets are partitioned in a staggered configuration and are ordered with a Herman-Meyer order (Herman and Meyer, 1993). Eq. (29) is the DDIM update applied to the conditioned Tweedie estimate 𝐱psuperscript𝐱𝑝{\mathbf{x}}^{p}, where the score update is again applied independently for each axial slice, here the notation of Noise(𝐱tk,sθ)Noisesubscript𝐱subscript𝑡𝑘subscript𝑠𝜃\text{Noise}({\mathbf{x}}_{t_{k}},s_{\theta}) is overloaded. The DDIM update gives 𝐱tk1subscript𝐱subscript𝑡𝑘1{\mathbf{x}}_{t_{k-1}}, and these PET-DDS updates repeat until t0=0subscript𝑡00t_{0}=0 giving reconstruction 𝐱^^𝐱\hat{{\mathbf{x}}}.

Table 1: Summary of different sampling schemes proposed for PET.
MethodSampling typeData consistency with L𝐿L in (2)Algorithm
PET-NaiveEuler-Maruyama (23)L(𝐲|cOSEMP𝐱0[𝐱])𝐿conditional𝐲subscript𝑐OSEMsubscript𝑃𝐱0delimited-[]𝐱L({\mathbf{y}}|c_{\text{OSEM}}P_{{\mathbf{x}}\geq 0}[{\mathbf{x}}]) (25)Algo. 1
PET-DPSEuler-Maruyama (23)L(𝐲|cOSEMP𝐱0[𝐱^0(𝐱)])𝐿conditional𝐲subscript𝑐OSEMsubscript𝑃𝐱0delimited-[]subscript^𝐱0𝐱L({\mathbf{y}}|c_{\text{OSEM}}P_{{\mathbf{x}}\geq 0}[{\hat{\mathbf{x}}_{0}}({\mathbf{x}})]) (26)Algo. 2
PET-DDSDDIM (11)Lj(𝐲|cOSEM𝐱)subscript𝐿𝑗conditional𝐲subscript𝑐OSEM𝐱L_{j}({\mathbf{y}}|c_{\text{OSEM}}{\mathbf{x}}) (28) and (30)Algo. 3

In Table 1, we list the details and differences of the proposed score-based schemes for PET.

3.4 MR Image Guided Reconstruction

In recent years, several regularisation methods have been proposed which leverage the availability of additional MR images to improve PET image reconstruction (Ehrhardt et al., 2016; Bai et al., 2013; Somayajula et al., 2011). These studies often encode anatomical features of the MR image as edges or level sets and build hand-crafted regularisers based on these encoded features. This is commonly referred to as guided reconstruction (Ehrhardt, 2021), where the MR image is first reconstructed and is then used in the PET reconstruction pipeline. The SGM approach can be modified for guided reconstruction. In this setting we can use Bayes’ theorem to express the posterior

𝐱logppost(𝐱|𝐲,𝐱MR)=𝐱logplkhd(𝐲|𝐱)+𝐱logπ(𝐱|𝐱MR),subscript𝐱superscript𝑝postconditional𝐱𝐲subscript𝐱MRsubscript𝐱superscript𝑝lkhdconditional𝐲𝐱subscript𝐱𝜋conditional𝐱subscript𝐱MR\nabla_{\mathbf{x}}\log{p^{\text{post}}}({\mathbf{x}}|{\mathbf{y}},{\mathbf{x}_{\text{MR}}})=\nabla_{\mathbf{x}}\log{p^{\text{lkhd}}}({\mathbf{y}}|{\mathbf{x}})+\nabla_{\mathbf{x}}\log{\pi}({\mathbf{x}}|{\mathbf{x}_{\text{MR}}}),(31)

assuming that both 𝐲𝐲{\mathbf{y}} and 𝐱MRsubscript𝐱MR{\mathbf{x}_{\text{MR}}} are conditionally independent given 𝐱𝐱{\mathbf{x}}. Here, the likelihood plkhd(𝐲|𝐱)superscript𝑝lkhdconditional𝐲𝐱{p^{\text{lkhd}}}({\mathbf{y}}|{\mathbf{x}}) is given by the Poisson noise model and π(𝐱|𝐱MR)𝜋conditional𝐱subscript𝐱MR{\pi}({\mathbf{x}}|{\mathbf{x}_{\text{MR}}}) is a prior conditioned on the MR image 𝐱MRsubscript𝐱MR{\mathbf{x}_{\text{MR}}}, which will be learned via a score model. Using this decomposition, the reverse SDE, given the MR image, can be written as

d𝐱t=[𝐟(𝐱t,t)g(t)2(𝐱logpt(𝐲|𝐱)+𝐱logπ(𝐱|𝐱MR))]dt+g(t)d𝐰¯t.𝑑subscript𝐱𝑡delimited-[]𝐟subscript𝐱𝑡𝑡𝑔superscript𝑡2subscript𝐱subscript𝑝𝑡conditional𝐲𝐱subscript𝐱𝜋conditional𝐱subscript𝐱MR𝑑𝑡𝑔𝑡𝑑subscript¯𝐰𝑡d{\mathbf{x}_{t}}=[{\mathbf{f}}({\mathbf{x}_{t}},t)-g(t)^{2}\left(\nabla_{{\mathbf{x}}}\log{p_{t}}({\mathbf{y}}|{\mathbf{x}})+\nabla_{{\mathbf{x}}}\log{\pi}({\mathbf{x}}|{\mathbf{x}_{\text{MR}}})\right)]dt+g(t)d\bar{\mathbf{w}}_{t}.(32)

We can use PET-Naive  or PET-DPS  to approximate the score of the time dependent likelihood 𝐱logpt(𝐲|𝐱)subscript𝐱subscript𝑝𝑡conditional𝐲𝐱\nabla_{{\mathbf{x}}}\log{p_{t}}({\mathbf{y}}|{\mathbf{x}}). However, we have to train a score model, conditioned on the MR image, to estimate the conditional score function

sθ(𝐱t;t,𝐱MR)𝐱logpt(𝐱|𝐱MR),subscript𝑠𝜃subscript𝐱𝑡𝑡subscript𝐱MRsubscript𝐱subscript𝑝𝑡conditional𝐱subscript𝐱MRs_{\theta}({\mathbf{x}_{t}};t,{\mathbf{x}_{\text{MR}}})\approx\nabla_{\mathbf{x}}\log{p_{t}}({\mathbf{x}}|{\mathbf{x}_{\text{MR}}}),(33)

where 𝐱MRsubscript𝐱MR{\mathbf{x}_{\text{MR}}} is an additional input to the score model. This method was recently proposed and applied to PET image denoising (Gong et al., 2022). To train such a score model we need a paired dataset {(𝐱i,𝐱MRi)}i=1msuperscriptsubscriptsuperscript𝐱𝑖superscriptsubscript𝐱MR𝑖𝑖1𝑚\{({\mathbf{x}}^{i},\mathbf{x}_{\text{MR}}^{i})\}_{i=1}^{m} of PET images and corresponding MR images. In contrast, using the Classifier Free Guidance (CFG) framework (Ho and Salimans, 2022), we only need a partly paired dataset, i.e. besides paired data {(𝐱i,𝐱MRi)}i=1m1superscriptsubscriptsuperscript𝐱𝑖superscriptsubscript𝐱MR𝑖𝑖1subscript𝑚1\{({\mathbf{x}}^{i},\mathbf{x}_{\text{MR}}^{i})\}_{i=1}^{m_{1}} we can also make use of unpaired data {𝐱i}i=1m2superscriptsubscriptsuperscript𝐱𝑖𝑖1subscript𝑚2\{{\mathbf{x}}^{i}\}_{i=1}^{m_{2}}. In particular, CFG trains both a conditional and unconditional score model simultaneously and utilises their combination during the sampling process. CFG uses a zero image 𝟎0\mathbf{0} to distinguish between the conditional and unconditional score model

sθ(𝐱t;t,𝐱MR)𝐱logpt(𝐱t|𝐱MR)andsθ(𝐱t;t,𝐱MR=𝟎)𝐱logpt(𝐱t),formulae-sequencesubscript𝑠𝜃subscript𝐱𝑡𝑡subscript𝐱MRsubscript𝐱subscript𝑝𝑡conditionalsubscript𝐱𝑡subscript𝐱MRandsubscript𝑠𝜃subscript𝐱𝑡𝑡subscript𝐱MR0subscript𝐱subscript𝑝𝑡subscript𝐱𝑡s_{\theta}({\mathbf{x}_{t}};t,{\mathbf{x}_{\text{MR}}})\approx\nabla_{{\mathbf{x}}}\log{p_{t}}({\mathbf{x}_{t}}|{\mathbf{x}_{\text{MR}}})\quad\mbox{and}\quad s_{\theta}({\mathbf{x}_{t}};t,{\mathbf{x}_{\text{MR}}}=\mathbf{0})\approx\nabla_{{\mathbf{x}}}\log{p_{t}}({\mathbf{x}_{t}}),(34)

yielding a conditional DSM objective

𝔼tU[0,T]𝔼𝐱0,𝐱MRπ𝔼𝐱tpt(𝐱t|𝐱0)𝔼ρB(q)[ωtsθ(𝐱t,t;ρ𝐱MR)𝐱logpt(𝐱t|𝐱0)22]},\begin{split}\!{\mathbb{E}}_{t\sim U[0,T]}{\mathbb{E}}_{{\mathbf{x}_{0}},{\mathbf{x}_{\text{MR}}}\sim{\pi}}{\mathbb{E}}_{{\mathbf{x}_{t}}\sim p_{t}({\mathbf{x}_{t}}|{\mathbf{x}_{0}})}{\mathbb{E}}_{\rho\sim B(q)}\!\left[\omega_{t}\|s_{\theta}({\mathbf{x}_{t}},t;\rho~{}{\mathbf{x}_{\text{MR}}})\!-\!\nabla_{{\mathbf{x}}}\log p_{t}({\mathbf{x}_{t}}|{\mathbf{x}_{0}})\|_{2}^{2}\right]\!\},\end{split}(35)

where B(q)𝐵𝑞B(q) is a Bernoulli distribution with parameter q𝑞q. Thus, if the additional MR input is set to zero, the conditional DSM loss matches the unconditional DSM loss defined in Eq. (7). After training, CFG defines a combined score model

s~θ(𝐱t;t,𝐱MR)=(1+w)sθ(𝐱t;t,𝐱MR)wsθ(𝐱t;t,𝟎),subscript~𝑠𝜃subscript𝐱𝑡𝑡subscript𝐱MR1𝑤subscript𝑠𝜃subscript𝐱𝑡𝑡subscript𝐱MR𝑤subscript𝑠𝜃subscript𝐱𝑡𝑡0\tilde{s}_{\theta}({\mathbf{x}_{t}};t,{\mathbf{x}_{\text{MR}}})=(1+w)s_{\theta}({\mathbf{x}_{t}};t,{\mathbf{x}_{\text{MR}}})-ws_{\theta}({\mathbf{x}_{t}};t,\mathbf{0}),(36)

as a linear combination with w𝑤w as the guidance strength. This combined score model sθ(𝐱t;t,𝐱MR)subscript𝑠𝜃subscript𝐱𝑡𝑡subscript𝐱MRs_{\theta}({\mathbf{x}_{t}};t,{\mathbf{x}_{\text{MR}}}) can then be used for any of the presented sampling methods.

4 Experimental Setup

4.1 Dataset and Evaluation Metrics

We use the BrainWeb dataset consisting of 202020 patient-realistic volumes (Aubert-Broche et al., 2006). The tracer simulated was 18F-Fluorodeoxyglucose (FDG) and the volumes were further perturbed by three realisations of random distortions (Schramm, 2021). 191919 out of the 202020 volumes were used for training. Axial slices with non-zero intensity were removed, resulting in a training dataset of 456945694569 slices. We conducted two simulations for the evaluation, i.e., image and measurement. The image simulation utilised the segmentation mask of the held-out volume, subject 04. In 2D, an FDG tracer was simulated by perturbing the contrast, applying a bias-field to grey matter, blurring and adding noise as per Schramm (2021). In 3D, FDG and Amyloid tracers were independently simulated, and blurring and noise were added. The Amyloid tracer was included to provide a further OOD evaluation set. To further diversify the evaluation sets, simulated lesions were included in both 3D simulations and added to an additional 2D evaluation set. Lesions were simulated as local regions of ellipsoidal hyper-intensity of random size and location within soft-tissue, and allowed for local evaluation metrics to be computed. The setting of measurement simulation is described below separately for the 2D and 3D cases.

For 2D evaluation, resolution modelling, attenuation, sensitivity, and background contamination were modelled and subsequently included in the forward model utilising ParallelProj (Schramm, 2022). We use 202020 equidistant axial slices from the simulation of subject 04. The noise level of simulated measurements was set by re-scaling forward projected ground truth images, where the scale ensured that the total counts divided by emission volume was 2.5 or 10. These rescaled measurements are the clean measurements, which were then corrupted with Poisson noise and constant background contamination. In addition, 101010 noise realisations were obtained. Herein, we refer to the noise levels as 2.52.52.5 and 101010, where the total true counts averaged over evaluation dataset were 122 808122808122\,808 and 491 232491232491\,232, respectively.

For the 3D evaluation, measurements of subject 04 were simulated with an Siemens Biograph mMR scanner geometry (Karlberg et al., 2016). Measurements with detector sensitivities and attenuation were simulated and included in the forward model using SIRF and STIR (Ovtchinnikov et al., 2020; Thielemans et al., 2012). The noise level was equivalent to 40 million counts without background, and 5 noisy realisations were obtained. Unless otherwise specified, the projector and measurements were split into 28 ordered subsets in the experiments below.

We evaluate the performance between reconstructions and ground truths using two global metrics: Peak-Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) (Wang et al., 2004). Moreover, we compute two local quality scores over a Region of Interest (ROI). First, to quantify the detectability of lesions, we compute the Contrast Recovery Coefficient (CRC). Second, the noise in reconstructions is estimated over background ROIs using Standard Deviation (STD), which computes standard deviation across realisations and then averages over ROI voxels (Tong et al., 2010). In 2D, we evaluate the reconstruction consistency by computing the Kullback-Leibler Divergence (KLDIV) between measurements 𝐲𝐲{\mathbf{y}} and estimated measurements 𝐲¯=𝐀𝐱^+𝐛¯¯𝐲𝐀^𝐱¯𝐛\mathbf{\bar{y}}={\mathbf{A}}\hat{{\mathbf{x}}}+\mathbf{\bar{b}}, where 𝐱^^𝐱\hat{{\mathbf{x}}} denotes the reconstruction. Furthermore, we include the “mean KL” between the noisy and clean measurements across the 2D evaluation dataset. More information about quality metrics can be found in Appendix A.4.

We present tables of best performing methods with optimal penalty strengths, as well as qualitative figures of reconstructed images. Furthermore, to allow direct comparison between methods, we give sensitivity plots of PSNR, SSIM, KLDIV or CRC vs. STDs. Since STD gives an estimate of the noise in the image, these plots can show the effect of varying penalty strength on reconstruction quality or data-consistency. A lower STD typically corresponds to lower data-fidelity (a higher prior strength), and the converse is true for higher STD. In practice, with a generative model as a prior, higher penalty strengths do not necessarily lead to a lower STD as there may be multiple reconstruction with high likelihood under the model. Variations in STD are further exacerbated by approximate nature and stochasticity of SGM sampling.

4.2 Comparison Methods

In the 2D setting, we compare against two established supervised learning methods used in medical image reconstruction: the UNet post-processing FBPConvNet (Jin et al., 2017) and unrolled iterative learned primal dual (Adler and Öktem, 2018). We modify both models for PET reconstruction; the post-processing method is referred to as PET-UNet, and the unrolled method as PET-LPD. Additionally we compare against a state-of-the-art SGM approach for PET image denoising (Gong et al., 2022), referred to as Naive (OSEM). This denoising approach replaces the likelihood on the measurements with a likelihood modelled as a Gaussian centred at the noisy reconstruction. Therefore, Naive (OSEM) is able to use the same pre-trained score model as our proposed PET-Naive, PET-DPS, and PET-DDS methods.

For 3D evaluation, Deep Image Prior (DIP) reconstruction was included as an unsupervised comparison method with a 3D network architecture well-established in literature (Gong et al., 2019; Ote et al., 2023; Singh et al., 2023). For comparison, converged MAP solutions with an RDP regulariser were computed. BSREM algorithm was used with a range of penalty strengths, cf. PET background Sect. 2.1.

Further details on all comparison methods can be found in Appendix A.3.

5 Numerical Experiments

The first set of experiments investigates the performance of the SGM methods (Naive (OSEM), PET-Naive, PET-DPS, PET-DDS) against one-another and against established supervised methods (PET-UNet, PET-LPD). This is done in 2D and at two noise levels, with and without lesions. In the second set of experiments we present results with MR image guidance. The last set of experiments investigates the best performing SGM method (PET-DDS) on 3D reconstruction, and provides a comparison against classical MAP and state-of-the-art DIP reconstructions with lesions and two simulated tracers. For all SGM results we make use of a single score model trained on the dataset of axial BrainWeb slices described in Section 4.1. The details about the training process and network architecture can be found in Appendix A.1. Further results can be found in the Appendix B. All results were computed with a single NVIDIA GeForce RTX 3090.

5.1 2D Reconstruction

The aim of 2D experiments is to benchmark the SGM and supervised methods, and analyse the stability of SGM methods with respect to the choice of different penalty strengths λtNaivesuperscriptsubscript𝜆𝑡Naive\lambda_{t}^{\text{{Naive}}}, λtDPSsuperscriptsubscript𝜆𝑡DPS\lambda_{t}^{\text{DPS}} and λDDSsuperscript𝜆DDS\lambda^{\text{DDS}}. The penalty strengths for PET-Naive and PET-DPS depends on the time step t𝑡t, and the details about their specific choice can be found in Appendix A.2.

5.1.1 Reconstruction without Lesion

The results in Fig. 2 show that the performance of the four SGM methods vary greatly for data of noise level 2.5 with no lesions. PET-DPS is the best performing method, consistently giving high PSNR, SSIM and low KLDIV values. However, it is also computationally the most expensive, requiring 100010001000 steps with back-propagation through the score model. PET-DDS preforms competitively with a much lower computational overhead of 100100100 steps without score model back-propagation. Naive (OSEM) performs well with regards to PSNR, but performs poorly in terms of data-consistency (KLDIV) and SSIM. As Naive (OSEM) computes the likelihood on an early-stopped OSEM image, increasing data-consistency ensures the reconstruction approaches the OSEM image. The maximum achievable likelihood of Naive (OSEM) does not give a KLDIV lower than the “mean KL”. Hence it is not deemed a strong surrogate to the true likelihood computed on measurements. The PET-Naive reconstructions have substantially higher STD values. This is attributed to instability when computing the PLL gradient due to non-negativity projection directly applied on 𝐱tsubscript𝐱𝑡{\mathbf{x}_{t}}.

Refer to caption
Figure 2: Results for BrainWeb without lesions with noise level 2.5 for different penalty parameters. Standard deviation is across reconstructions from different realisations of measurements. The points represent different values of the parameter λ𝜆\lambda, and the notation 2719 and \Diamondblack denote the smallest and largest value of λ𝜆\lambda, respectively.
Table 2: The mean quality score and standard error using the best hyperparameters for each method for BrainWeb without lesions for noise level 2.5 (out-of-distribution) and 10 (in-distribution). The penalty strength used for each SGM method is denoted by λ𝜆\lambda. The best SGM is highlighted in grey, and overall best metric is underlined. Supervised methods are trained on data with noise levels 5, 10 and 50.
Noise LevelModelMethodPSNR, λ𝜆\lambdaSSIM, λ𝜆\lambda
2.5 Score-basedGenerativeNaive (OSEM)22.3822.3822.38±0.82plus-or-minus0.82\pm 0.82, 0.5270.5270.5270.7700.7700.770±0.02plus-or-minus0.02\pm 0.02, 3.083.083.08
PET-Naive21.5221.5221.52±0.84plus-or-minus0.84\pm 0.84, 12.012.012.00.7810.7810.781±0.01plus-or-minus0.01\pm 0.01, 12.012.012.0
PET-DPS22.8022.8022.80±0.81plus-or-minus0.81\pm 0.81, 650.650650.0.8180.8180.818±0.01plus-or-minus0.01\pm 0.01, 750.750750.
PET-DDS22.4622.4622.46±0.82plus-or-minus0.82\pm 0.82, 0.250.250.250.7890.7890.789±0.02plus-or-minus0.02\pm 0.02, 0.20.20.2
SupervisedPET-LPD23.0623.0623.06±0.85plus-or-minus0.85\pm 0.85, N/A0.8060.8060.806±0.01plus-or-minus0.01\pm 0.01, N/A
PET-UNet22.8022.8022.80±0.82plus-or-minus0.82\pm 0.82, N/A0.7980.7980.798±0.01plus-or-minus0.01\pm 0.01, N/A
10 Score-basedGenerativeNaive (OSEM)23.4023.4023.40±0.84plus-or-minus0.84\pm 0.84, 0.20.20.20.7930.7930.793±0.02plus-or-minus0.02\pm 0.02, 0.90.90.9
PET-Naive22.8122.8122.81±0.87plus-or-minus0.87\pm 0.87, 10.1010.0.8150.8150.815±0.01plus-or-minus0.01\pm 0.01, 10.1010.
PET-DPS23.7023.7023.70±0.83plus-or-minus0.83\pm 0.83, 400.400400.0.8500.8500.850±0.01plus-or-minus0.01\pm 0.01, 400.400400.
PET-DDS23.5523.5523.55±0.77plus-or-minus0.77\pm 0.77, 0.0250.0250.0250.8490.8490.849±0.01plus-or-minus0.01\pm 0.01, 0.0250.0250.025
SupervisedPET-LPD24.7424.7424.74±0.91plus-or-minus0.91\pm 0.91, N/A0.8610.8610.861±0.01plus-or-minus0.01\pm 0.01, N/A
PET-UNet24.5224.5224.52±0.85plus-or-minus0.85\pm 0.85, N/A0.8680.8680.868±0.01plus-or-minus0.01\pm 0.01, N/A

In Table 2 we show quantitative results of the optimal penalty strength choice for each metric, and comparisons against PET-UNet and PET-LPD. These supervised methods are trained on data with noise levels of 5, 10 and 50 without lesions. Using noise levels 2.5 and 10 in evaluation allows investigating the effect of OOD noise levels on supervised methods. PET-LPD is the best performing method, giving the best SSIM at noise level 10, and best PSNR at both noise levels. Between noise levels 10 and 2.5 PET-LPD observes a drop of 6.7% and 6.6% for PSNR and SSIM, whereas PET-DPS exhibits a drop of 3.4% and 3.8%, respectively. PET-DPS performs competitively across both noise levels and metrics, and gives the best SSIM value at noise level 2.5. The competitive performance and the reduced performance drop of quality metrics, with increasing noise, provides evidence that PET-DPS is more robust to different noise levels. This may be attributed to the unsupervised nature of SGM methods. Namely, as they are not trained on data of given noise levels they are less affected by distributional differences in noise levels at evaluation and training stages. However, a more comprehensive study over a wider range of noise levels is needed to robustly support this observation. Interestingly, the standard deviation of the different samples is non-negligible in terms of PSNR, but the SSIM is nearly independent of the samples. The larger standard deviation for the PSNR is due to the different dynamic range, e.g., the maximum intensity varies between 11.8911.8911.89 and 24.1524.1524.15 for different slices at noise level 2.52.52.5. We leave a full investigation of uncertainty quantification using SGM methods to future work. The supervised methods, i.e., PET-LPD and PET-UNet, perform better than score-based models for both noise levels. Note, this margin is larger for the in-distribution case, i.e., the noise level 10.

In the table, we also observe a pronounced change in the selected λ𝜆\lambda values between noise levels 2.5 and 10 for PET-DDS, especially when compared with other methods. This phenomenon is attributed to the following fact. In the experiments, the BSREM-like data consistency updates are run for p𝑝p steps, with a small p𝑝p (see Appendix A.2 for the heuristic rule to determine p𝑝p), not giving a converged MAP estimate, and there is an implicit regularisation that is proportional to the rate of convergence to noisier images. Hence iterative reconstructions at the noise level 2.5 fit to noise quicker, thereby necessitating much stronger regularisation than the noise level 10.

Table 3: The computing time of a single reconstruction, averaged over 5 reconstructions.
MethodPET-NaivePET-DPSPET-DDS
Time (s)41.5243.643.90

In Table 3, we compare the computing time for one single reconstruction. PET-Naive and PET-DPS are largely comparable in terms of inference efficiency, and are about ten times slower than PET-DDS. The difference in computing times can be attributed to the fact that PET-DDS requires fewer time steps; through the use of the accelerated DDIM sampling method.

5.1.2 2D Reconstruction with Lesion

As the score model was trained on data without lesions, testing on data with simulated hot lesions gives an insight into generalisability to OOD data. The quantitative results in Fig. 3 and Table 4 show results that are consistent with those for data with no lesions in Fig. 2. CRC was computed to quantify the detectability of hot lesions. The CRC results indicate that PET-DDS is better at resolving lesions than other SGM methods. Further, Fig. 3 shows a clear trade-off between reconstruction quality in terms of PSNR and SSIM and visibility of lesions. Here, a lower regularisation results in a better performance in terms of CRC. Results for noise level 101010 are shown in Appendix B.1.

Refer to caption
Figure 3: Results for BrainWeb with lesions with noise level 2.5 for different penalty parameters. Standard deviation is across reconstructions from different realisations of measurements. The points represent different values of the parameter λ𝜆\lambda, and the notation 2719 and \Diamondblack denote the smallest and largest numerical value of λ𝜆\lambda, respectively.

Comparing the results between noise levels 2.52.52.5 and 101010 in Table 4, we observe that SGMs increase CRC values as compared to supervised methods. SGMs also compare favourably with regards to PSNR and SSIM. CRC is local metric that is more relevant than PSNR or SSIM in a clinical setting, as it quantifies the detectability of lesions. Therefore, it is of greater interest to improve this local metric rather than global metrics. With this perspective, SGMs outperform supervised methods, and the best-preforming SGM methods are PET-DPS and PET-DDS . Due to the performance observed and computational overhead, PET-DDS is considered the most appropriate method to test in guided reconstruction and in the 3D setting.

Numerically, we observe that PET-DDS performs better than PET-DPS in terms of CRC but worse in terms of PSNR and SSIM. Note that PSNR and SSIM are global quality metrics, evaluating the full reconstructed image relative to the reference, whereas CRC is a local quality metric that focuses on the contrast in a specific ROI. While both global and local measures are desirable for PET reconstructions, they are not always consistent. For example, smoothing may suppress the noise and increase PSNR and SSIM, but it can reduce the contrast, leading to a worse CRC. Likewise, if the method preserves details, it may increase the contrast at the expense of having a higher noise in the image and decreasing PSNR and SSIM. Thus, the results in Table 4 indicate that PET-DDS is more effective in enhancing the contrast of the ROIs, but less effective in suppressing the noise, when compared to PET-DPS.

Table 4: Results using the best hyperparameters for each method for BrainWeb with lesions for noise level 2.5 and 10. The penalty strength used for each SGM method is denoted by λ𝜆\lambda. The best score-based method is highlighted in grey. The overall best score per noise level is underlined.
Noise LevelModelMethodPSNR, λ𝜆\lambdaSSIM, λ𝜆\lambdaCRC, λ𝜆\lambda
2.5 Score-basedGenerativeNaive (OSEM)27.6027.6027.60±0.87plus-or-minus0.87\pm 0.87, 0.5270.5270.5270.8210.8210.821±0.02plus-or-minus0.02\pm 0.02, 1.711.711.710.8910.8910.891±0.02plus-or-minus0.02\pm 0.02, 50.5050.
PET-Naive26.8226.8226.82±0.90plus-or-minus0.90\pm 0.90, 12.1212.0.8170.8170.817±0.02plus-or-minus0.02\pm 0.02, 12.1212.0.9080.9080.908±0.03plus-or-minus0.03\pm 0.03, 50.5050.
PET-DPS27.9927.9927.99±0.85plus-or-minus0.85\pm 0.85, 625.625625.0.8550.8550.855±0.01plus-or-minus0.01\pm 0.01, 650.650650.0.8860.8860.886±0.02plus-or-minus0.02\pm 0.02, 1500.15001500.
PET-DDS27.4627.4627.46±0.83plus-or-minus0.83\pm 0.83, 0.150.150.150.8410.8410.841±0.01plus-or-minus0.01\pm 0.01, 0.150.150.150.9770.9770.977±0.01plus-or-minus0.01\pm 0.01, 0.010.010.01
SupervisedPET-LPD28.4028.4028.40±0.92plus-or-minus0.92\pm 0.92, N/A0.8530.8530.853±0.01plus-or-minus0.01\pm 0.01, N/A0.8650.8650.865±0.03plus-or-minus0.03\pm 0.03, N/A
PET-UNet27.7427.7427.74±0.83plus-or-minus0.83\pm 0.83, N/A0.8360.8360.836±0.01plus-or-minus0.01\pm 0.01, N/A0.8050.8050.805±0.03plus-or-minus0.03\pm 0.03, N/A
10 Score-basedGenerativeNaive (OSEM)28.8728.8728.87±0.93plus-or-minus0.93\pm 0.93, 0.250.250.250.8470.8470.847±0.01plus-or-minus0.01\pm 0.01, 0.90.90.90.9020.9020.902±0.02plus-or-minus0.02\pm 0.02, 4.44.
PET-Naive28.0728.0728.07±0.94plus-or-minus0.94\pm 0.94, 10.1010.0.8450.8450.845±0.01plus-or-minus0.01\pm 0.01, 7.57.57.50.9110.9110.911±0.02plus-or-minus0.02\pm 0.02, 20.2020.
PET-DPS29.0129.0129.01±0.87plus-or-minus0.87\pm 0.87, 400.400400.0.8780.8780.878±0.01plus-or-minus0.01\pm 0.01, 400.400400.0.9200.9200.920±0.02plus-or-minus0.02\pm 0.02, 550.550550.
PET-DDS28.9928.9928.99±0.88plus-or-minus0.88\pm 0.88, 0.0250.0250.0250.8790.8790.879±0.01plus-or-minus0.01\pm 0.01, 0.0250.0250.0251.001.001.00±0.01plus-or-minus0.01\pm 0.01, 0.00.333Regularised due to denoised score estimate initialisation.
SupervisedPET-LPD30.0730.0730.07±0.96plus-or-minus0.96\pm 0.96, N/A0.8940.8940.894±0.01plus-or-minus0.01\pm 0.01, N/A0.9040.9040.904±0.02plus-or-minus0.02\pm 0.02, N/A
PET-UNet29.4129.4129.41±0.82plus-or-minus0.82\pm 0.82, N/A0.8890.8890.889±0.01plus-or-minus0.01\pm 0.01, N/A0.8650.8650.865±0.03plus-or-minus0.03\pm 0.03, N/A

5.1.3 MR Guided Reconstruction

Experiments with and without additional MR image guidance were conducted to illustrate the flexibility of the proposed approach, and tested at three guidance strengths w=0.25𝑤0.25w=0.25, 0.50.50.5, 1.01.01.0, where the guidance strength w𝑤w closer to zero constitutes more guidance. We use high resolution, low noise T1-weighted simulated MR image, which allows testing the worst case scenario where the MR image misrepresents the lesions. The results with best hyper-parameters are given in Table 5. It is observed that there are significant improvements to PSNR (>18%absentpercent18>18\%) and SSIM (>13%absentpercent13>13\%) with guidance. On PET data with lesions, the lesions were only simulated for PET and not MR images. Therefore, the data was of a worse-case scenario where clinically important features are only present in the PET image. The results with lesions show increasing the guidance strength decreased of CRC values and the lesions were more difficult to detect - see Fig. 13. Conversely, the PSNR and SSIM values on with lesions data increased with w𝑤w closer to zero (more guidance). This highlights the potential dangers of guidance, as well as the importance of evaluating local and global quality metrics.

Table 5: Results using the best hyperparameters for SGM methods for noise level 2.5 with MR image guidance. The penalty strength used for each SGM method is denoted by λ𝜆\lambda. The best method by performance metric is highlighted in grey for with/without lesion. The penalty strength is tuned for each method individually.
without lesionswith lesions
PSNR, λ𝜆\lambdaSSIM, λ𝜆\lambdaPSNR, λ𝜆\lambdaSSIM, λ𝜆\lambdaCRC, λ𝜆\lambda
DDS (w/o MR)22.4622.4622.46, 0.250.250.250.7890.7890.789, 0.20.20.227.4627.4627.46, 0.150.150.150.8410.8410.841, 0.150.150.150.9100.9100.910, 0.010.010.01
DDS w=0.25𝑤0.25w=0.2530.2230.2230.22, 0.350.350.350.9500.9500.950, 0.350.350.3531.2131.2131.21, 0.150.150.150.9540.9540.954, 0.250.250.250.7260.7260.726, 0.00.00.0
DDS w=0.5𝑤0.5w=0.529.3229.3229.32, 0.250.250.250.9400.9400.940, 0.250.250.2531.1231.1231.12, 0.150.150.150.9460.9460.946, 0.250.250.250.7780.7780.778, 0.00.00.0
DDS w=1.0𝑤1.0w=1.026.6626.6626.66, 0.150.150.150.8990.8990.899, 0.150.150.1529.3129.3129.31, 0.10.10.10.9060.9060.906, 0.150.150.150.9390.9390.939, 0.00.00.0

From Fig. 4 a reconstruction without guidance and with guidance of various strengths is presented for data without and with lesions at noise level 2.5. The reconstructions indicate that MR guidance helps to reconstruct the specific anatomical boundaries and structure, i.e., white matter tracts. In Appendix B.2 we give additional qualitative slices with and without lesions, cf. Figs. 12 and 13, and the associated sensitivity plots in Figs. 10 and 11.

Refer to caption
Refer to caption
Figure 4: Comparisons of single slice reconstructions with the PET-DDS MR guided vs. unguided at noise level 2.5 without lesion (top) and with lesion (bottom).

5.2 3D Reconstruction

Full 3D reconstructions were analysed for two tracers with simulated lesions. We evaluate the performance of PET-DDS with additional RDP regularisation in the z𝑧z-direction perpendicular to axial slices (termed RDPz), and introduce subset-based data consistency updates as in Eq. (28). Acceleration of PET-DDS was obtained through the use of subset-based data consistency updates, see Table 6. For further experiments 28 subsets were used. We compare against a BSREM computed MAP solution with RDP, and DIP with RDP, similar to Singh et al. (2023). In Fig. 5 we show sensitivity plots for the FDG tracer and in Fig. 7 we plot the axial, coronal and sagittal slices centred on the lesion location. Additionally, sensitivity curves for the amyloid tracer are given in Fig. 6, and the associated reconstructions are available in Appendix B, see Fig. 16. In Table 6, we present the computing time for 3D PET-DDS+RDPz versus the number of subsets. The results indicate that the computing time decreases with the number of subsets while having minimal affect on evaluation metrics, indicating the preference for using more subsets for better computational efficiency in practice.

Table 6: 3D PET-DDS+RDPz computing time with different numbers of subsets. Quality metrics computed on the first realisation of FDG tracer measurements using 3D PET-DDS+RDPz with λ=158.0𝜆158.0\lambda=158.0 and β=21.9𝛽21.9\beta=21.9. Best values are highlighted in grey.
Number of subsets147142842
Reconstruction time (min)47.847.847.813.613.613.68.68.68.65.15.15.13.43.43.42.82.82.8
PSNR25.9125.9125.9125.9025.9025.9025.8925.8925.8925.8425.8425.8425.7225.7225.7225.5725.5725.57
SSIM0.9270.9270.9270.9270.9270.9270.9270.9270.9270.9250.9250.9250.9220.9220.9220.9190.9190.919
CRC0.9900.9900.9900.9900.9900.9900.9900.9900.9900.9850.9850.9850.9870.9870.9870.9880.9880.988

The FDG tracer sensitivity plot in Fig. 7 shows that adding RDPz into PET-DDS improves SSIM and CRC metrics, while classical RDP provides highest PSNR values. Since PSNR is computed using a mean squared reconstruction error, the resulting metric is biased toward blurrier reconstructions. This can be observed in the qualitative images given in Fig. 7, where RDP gives high PSNR values while the image insets show excessive blurring on the lesion. PET-DDS without RDPz performs worse than with RDPz, since the score model only acts on axial slices and, without RDPz, consistency in z𝑧z-direction is only ensured through data consistency. Qualitatively this can be observed in Fig. 7, where coronal and sagittal slices display discontinuities in the z𝑧z-direction whereas the axial slice is smoother. DIP reconstructions give improvements in SSIM and CRC as compared to classical RDP results, but fail to improve PSNR. Results with OOD Amyloid tracer show milder improvements with PET-DDS, with trends similar to those seen with the FDG tracer.

PET reconstructions by the proposed methods in Fig. 7 closely match the ground truth. However, unlike in RDP and DIP-RDP reconstructions, the results by PET-DDS and PET-DDS-RDPz exhibit regions with much higher intensity values than in neighbouring areas. The inhomogeneity is attributed to the presence of noise in the data, and PET-DDS without RDPz is expected to be more inhomogenous for slices that are not axial, because the score-model was only enforced axially.

Refer to caption
Figure 5: Results for 3D reconstruction using the FDG tracer for different penalty values. PET-DDS-RDPz β=21.9𝛽21.9\beta=21.9, and DIP+RDP β=0.1𝛽0.1\beta=0.1. Standard deviation is across reconstructions from different realisations of measurements. For DIP, the points corresponds to various number of optimisation steps. For the other methods, the points represent different values of the parameter λ𝜆\lambda, and the notation 2719 and \Diamondblack denote the smallest and largest numerical value of λ𝜆\lambda, respectively.
Table 7: Results using the best hyperparameters for each method for 3D BrainWeb data with FDG and Amyloid tracers. The penalty strength used for each SGM method is denoted by λ𝜆\lambda. The best performing method is highlighted in grey.
TracerMethodPSNR, λ𝜆\lambdaSSIM, λ𝜆\lambdaCRC, λ𝜆\lambda
FDGRDP25.7425.7425.74, 1.811.811.810.9110.9110.911, 2.772.772.770.9940.9940.994, 0.50.50.5
DIP+RDP25.2625.2625.26, 9,80098009,8000.9170.9170.917, 10,8001080010,8000.9660.9660.966, 9,50095009,500
PET-DDS24.8324.8324.83, 3983983980.9100.9100.910, 3983983981.011.011.01, 158158158
PET-DDS+RDPz25.7025.7025.70, 1581581580.9220.9220.922, 63.163.163.10.9960.9960.996, 158158158
AmyloidRDP24.1524.1524.15, 2.772.772.770.8980.8980.898, 1.811.811.810.9960.9960.996, 0.50.50.5
DIP+RDP24.1024.1024.10, 10,2001020010,2000.8940.8940.894, 10,8001080010,8000.9640.9640.964, 9,50095009,500
PET-DDS23.0823.0823.08, 1000100010000.8900.8900.890, 3983983981.0091.0091.009, 101010
PET-DDS+RDPz24.1524.1524.15, 3983983980.9060.9060.906, 1581581580.9990.9990.999, 101010
Refer to caption
Figure 6: Results for 3D reconstruction using the Amyloid tracer for different penalty values. PET-DDS-RDPz β=21.9𝛽21.9\beta=21.9, and DIP+RDP β=0.1𝛽0.1\beta=0.1. Standard deviation is across reconstructions from different realisations of measurements. For DIP, the points corresponds to various number of optimisation steps. For the other methods, the points represent different values of the parameter λ𝜆\lambda, and the notation 2719 and \Diamondblack denote the smallest and largest numerical value of λ𝜆\lambda, respectively.
Refer to caption
Figure 7: 3D reconstruction for the different methods with FDG tracer, and metrics computed on the inset lesion.

6 Conclusion

In this work we adapt SGMs for PET image reconstruction by incorporating PET specific constraints, e.g. Poisson noise and non-negativity, into several popular sampling techniques. We further introduce a measurement-based normalisation technique, to improve the generalisability to different intensity values by stabilising the dynamic range encountered by the score model. In future work, reflected SGMs, recently proposed by Lou and Ermon (2023), could be leveraged to introduce non-negativity into the sampling procedure in a more principled manner. This work provides a first investigation of the generalisation capabilities by training the score model on patient-realistic slices without lesions and testing on slices with lesions. However, further work is needed to comprehensively evaluate the generalisation performance on more diverse datasets of in-vivo data, to investigate the biases of SGMs, which is vitally important for clinical adoption. The proposed SGM sampling methods can produce multiple samples from the posterior p(𝐱|𝐲)𝑝conditional𝐱𝐲p({\mathbf{x}}|{\mathbf{y}}), and in this vein one can draw multiple samples from the posterior for empirical uncertainty estimation; this is left for future work. This work proposes guided SGM reconstruction with an additional MR guidance image using CFG. The preliminary results are promising and further validation is required. A clinically pertinent investigation into robustness to misregistration of the MR image could be investigated. Furthermore, guidance could be extended to a joint PET-MRI reconstruction. Recently, Levac et al. (2023) used similar ideas for a joint reconstruction of multi-contrast MR images.


Acknowledgments

I.R.D. Singh and R. Barbano are supported by the EPSRC-funded UCL Centre for Doctoral Training in Intelligent, Integrated Imaging in Healthcare (i4Health) (EP/S021930/1) and the Department of Health’s NIHR-funded Biomedical Research Centre at University College London Hospitals. A. Denker acknowledges the support by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - Project number 281474342/GRK2224/2. Ž. Kereta was supported by the UK EPSRC grant EP/X010740/1. B. Jin and S. Arridge are supported by the UK EPSRC EP/V026259/1, and B. Jin is also supported by a start-up fund from The Chinese University of Hong Kong. Software used in this project is partially maintained by CCP SyneRBI (EPSRC EP/T026693/1). P. Maass acknowledges support by DFG-NSFC project M-0187 of the Sino-German Center mobility programme. The authors thank Georg Schramm for help with ParallelProj.


Ethical Standards

The work follows appropriate ethical standards in conducting research and writing the manuscript, following all applicable laws and regulations regarding treatment of animals or human subjects.


Conflicts of Interest

We declare we don’t have conflicts of interest.

References

  • Adler and Öktem (2018) Jonas Adler and Ozan Öktem. Learned primal-dual reconstruction. IEEE Transactions on Medical Imaging, 37(6):1322–1332, 2018.
  • Ahn and Fessler (2003) Sangtae Ahn and Jeffrey A Fessler. A globally convergent image reconstruction for emission tomography using relaxed ordered subsets algorithms. IEEE Transactions on Medical Imaging, 22(5):623–626, 2003.
  • Anderson (1982) Brian D O Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313–326, 1982.
  • Antun et al. (2020) Vegard Antun, Francesco Renna, Clarice Poon, Ben Adcock, and Anders C Hansen. On instabilities of deep learning in image reconstruction and the potential costs of ai. Proceedings of the National Academy of Sciences, 117(48):30088–30095, 2020.
  • Aubert-Broche et al. (2006) Bérengère Aubert-Broche, M Griffin, G Bruce Pike, Alan C Evans, and D Louis Collins. Twenty new digital brain phantoms for creation of validation image data bases. IEEE Transactions on Medical Imaging, 25(11):1410–1416, 2006.
  • Bai et al. (2013) Bing Bai, Quanzheng Li, and Richard M Leahy. Magnetic resonance-guided positron emission tomography image reconstruction. In Seminars in Nuclear Medicine, volume 43, pages 30–44. Elsevier, 2013.
  • Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2013.
  • Bertero et al. (2009) M. Bertero, P. Boccacci, G. Desiderà, and G. Vicidomini. Image deblurring with Poisson data: from cells to galaxies. Inverse Problems, 25(12):123006, 26, 2009. ISSN 0266-5611,1361-6420. . URL https://doi.org/10.1088/0266-5611/25/12/123006.
  • Chung and Ye (2022) Hyungjin Chung and Jong Chul Ye. Score-based diffusion models for accelerated MRI. Medical Image Analysis, 80:102479, 2022.
  • Chung et al. (2023a) Hyungjin Chung, Jeongsol Kim, Michael T McCann, Marc L Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems. In Proceedings of the Eleventh International Conference on Learning Representations. ICLR, 2023a.
  • Chung et al. (2023b) Hyungjin Chung, Suhyeon Lee, and Jong Chul Ye. Fast diffusion sampler for inverse problems by geometric decomposition. CoRR, abs/2303.05754, 2023b.
  • Chung et al. (2023c) Hyungjin Chung, Dohoon Ryu, Michael T McCann, Marc L Klasky, and Jong Chul Ye. Solving 3D inverse problems using pre-trained 2D diffusion models. In Proceedings of the Thirty Seventh IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22542–22551. CVPR, 2023c.
  • Collins et al. (1998) D Louis Collins, Alex P Zijdenbos, Vasken Kollokian, John G Sled, Noor Jehan Kabani, Colin J Holmes, and Alan C Evans. Design and construction of a realistic digital brain phantom. IEEE Transactions on Medical Imaging, 17(3):463–468, 1998.
  • Darestani et al. (2022) Mohammad Zalbagi Darestani, Jiayu Liu, and Reinhard Heckel. Test-time training can close the natural distribution shift performance gap in deep learning based compressed sensing. In International Conference on Machine Learning, pages 4754–4776. PMLR, 2022.
  • Dhariwal and Nichol (2021) Prafulla Dhariwal and Alexander Q Nichol. Diffusion models beat GANs on image synthesis. In Proceedings of the Thirty Fourth Conference on Advances in Neural Information Processing Systems, pages 8780–8794. NeurIPS, 2021.
  • Dimakis (2022) Alexandros G Dimakis. Deep generative models and inverse problems. In Mathematical Aspects of Deep Learning, page 400–421. Cambridge University Press, Cambridge, UK, 2022.
  • Efron (2011) Bradley Efron. Tweedie’s formula and selection bias. Journal of the American Statistical Association, 106(496):1602–1614, 2011.
  • Ehrhardt (2021) Matthias J Ehrhardt. Multi-modality imaging with structure-promoting regularizers. In Ke Chen, Carola-Bibiane Schönlieb, Xue-Cheng Tai, and Laurent Younces, editors, Handbook of Mathematical Models and Algorithms in Computer Vision and Imaging: Mathematical Imaging and Vision, pages 1–38. Springer International Publishing, Cham, Switzerland, 2021. ISBN 978-3-030-03009-4.
  • Ehrhardt et al. (2016) Matthias J Ehrhardt, Pawel Markiewicz, Maria Liljeroth, Anna Barnes, Ville Kolehmainen, John S Duncan, Luis Pizarro, David Atkinson, Brian F Hutton, Sébastien Ourselin, Kris Thielemans, and Simon R Arridge. PET reconstruction with an anatomical MRI prior using parallel level sets. IEEE Transactions on Medical Imaging, 35(9):2189–2199, 2016.
  • Engl et al. (1996) Heinz W Engl, Martin Hanke, and Andreas Neubauer. Regularization of inverse problems. Mathematics and its Applications. Kluwer Academic Publishers Group, Dordrecht, 1996.
  • Gong et al. (2019) Kuang Gong, Ciprian Catana, Jinyi Qi, and Quanzheng Li. PET image reconstruction using deep image prior. IEEE Transactions on Medical Imaging, 38(7):1655–1665, 2019.
  • Gong et al. (2022) Kuang Gong, Keith A. Johnson, Georges El Fakhri, Quanzheng Li, and Tinsu Pan. PET image denoising based on denoising diffusion probabilistic models. CoRR, abs/2209.06167, 2022.
  • Goodfellow et al. (2014) Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C Courville, and Yoshua Bengio. Generative adversarial nets. In Proceedings of the Second Conference on Advances in Neural Information Processing Systems, pages 2672–2680. NeurIPS, 2014.
  • Guazzo and Colarieti-Tosti (2021) Alessandro Guazzo and Massimiliano Colarieti-Tosti. Learned primal dual reconstruction for PET. Journal of Imaging, 7(12), 2021.
  • Herman and Meyer (1993) Gabor T Herman and Lorraine B Meyer. Algebraic reconstruction techniques can be made computationally efficient (positron emission tomography application). IEEE Transactions on Medical Imaging, 12(3):600–609, 1993.
  • Ho and Salimans (2022) Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. CoRR, abs/2207.12598, 2022.
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Proceedings of the Thirty Third Conference on Advances in Neural Information Processing Systems. NeurIPS, 2020.
  • Hohage and Werner (2016) Thorsten Hohage and Frank Werner. Inverse problems with Poisson data: statistical regularization theory, applications and algorithms. Inverse Problems, 32(9):093001, 56, 2016. ISSN 0266-5611,1361-6420. . URL https://doi.org/10.1088/0266-5611/32/9/093001.
  • Hudson and Larkin (1994) H Malcolm Hudson and Richard S Larkin. Accelerated image reconstruction using ordered subsets of projection data. IEEE Transactions on Medical Imaging, 13(4):601–609, 1994.
  • Ito and Jin (2015) Kazufumi Ito and Bangti Jin. Inverse problems: Tikhonov theory and algorithms. World Scientific, Hackensack, NJ, 2015. ISBN 978-981-4596-19-0.
  • Jalal et al. (2021) Ajil Jalal, Marius Arvinte, Giannis Daras, Eric Price, Alexandros G. Dimakis, and Jonathan I. Tamir. Robust compressed sensing MRI with deep generative priors. In Proceedings of the Thirty Fourth Conference on Advances in Neural Information Processing Systems, pages 14938–14954. NeurIPS, 2021.
  • Jin et al. (2017) Kyong H Jin, Michael T McCann, Emmanuel Froustey, and Michael Unser. Deep convolutional neural network for inverse problems in imaging. IEEE Transactions on Image Processing, 26(9):4509–4522, 2017.
  • Kaplan and Zhu (2019) Sydney Kaplan and Yang-Ming Zhu. Full-dose PET image estimation from low-dose PET image using deep learning: a pilot study. Journal of Digital Imaging, 32(5):773–778, 2019.
  • Karlberg et al. (2016) Anna M Karlberg, Oddbjørn Sæther, Live Eikenes, and Pål Erik Goa. Quantitative comparison of PET performance—siemens biograph mCT and mMR. European Journal of Nuclear Medicine and Molecular Imaging: Physics, 3(1), 2016.
  • Kingma and Welling (2014) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In Proceedings of the Second International Conference on Learning Representations. ICLR, 2014.
  • Kobler and Pock (2023) Erich Kobler and Thomas Pock. Learning gradually non-convex image priors using score matching. CoRR, abs/2302.10502, 2023.
  • Leuschner et al. (2021) Johannes Leuschner, Maximilian Schmidt, Daniel Otero Baguer, David Erzmann, and Mateus Baltazar. DIVα𝛼\alphal library, 2021. URL https://doi.org/10.5281/zenodo.4428220.
  • Levac et al. (2023) Brett Levac, Ajil Jalal, Kannan Ramchandran, and Jonathan I Tamir. MRI reconstruction with side information using diffusion models. arXiv, 2023. URL https://arxiv.org/abs/2303.14795.
  • Li et al. (2022) Xiang Li, John Thickstun, Ishaan Gulrajani, Percy Liang, and Tatsunori B Hashimoto. Diffusion-LM improves controllable text generation. In Proceedings to the Thirty Fifth Conference on Advances in Neural Information Processing Systems, pages 4328–4343. NeurIPS, 2022.
  • Lou and Ermon (2023) Aaron Lou and Stefano Ermon. Reflected diffusion models. In Proceedings of the Fortieth International Conference on Machine Learning, volume 202, pages 22675–22701. PMLR, 2023.
  • Mehranian and Reader (2021) Abolfazl Mehranian and Andrew J Reader. Model-based deep learning PET image reconstruction using forward–backward splitting expectation–maximization. IEEE Transactions on Radiation and Plasma Medical Sciences, 5(1):54–64, 2021.
  • Nuyts et al. (2002) Johan Nuyts, Dirk Bequé, Patrick Dupont, and Luc Mortelmans. A concave prior penalizing relative differences for maximum-a-posteriori reconstruction in emission tomography. IEEE Transactions on Nuclear Science, 49(1):56–60, 2002.
  • Ote et al. (2023) Kibo Ote, Fumio Hashimoto, Yuya Onishi, Takashi Isobe, and Yasuomi Ouchi. List-mode PET image reconstruction using deep image prior. IEEE Transactions on Medical Imaging, 42(6):1822–1834, 2023.
  • Ovtchinnikov et al. (2020) Evgueni Ovtchinnikov, Richard Brown, Christoph Kolbitsch, Edoardo Pasca, Casper O da Costa-Luis, Ashley Gillman, Benjamin A Thomas, Nikos Efthimiou, Johannes Mayer, Palak Wadhwa, Matthias J Ehrhardt, Sam Ellis, Jakob S Jørgensen, Julian C Matthews, Claudia Prieto, Andrew J Reader, Charalampos Tsoumpas, Martin J Turner, and Kris Thielemans. SIRF: synergistic image reconstruction framework. Computer Physics Communications, 249:107087, 2020.
  • Pain et al. (2022) Cameron D Pain, Gary F Egan, and Zhaolin Chen. Deep learning-based image reconstruction and post-processing methods in positron emission tomography for low-dose imaging and resolution enhancement. European Journal of Nuclear Medicine and Molecular Imaging, 49(9):3098–3118, 2022.
  • Pierro and Yamagishi (2001) Alvaro R De Pierro and Michel E B Yamagishi. Fast EM-like methods for maximum ‘a posteriori’ estimates in emission tomography. IEEE Transactions on Medical Imaging, 20(4):280–288, 2001.
  • Pinaya et al. (2022) Walter H L Pinaya, Petru-Daniel Tudosiu, Jessica Dafflon, Pedro F Da Costa, Virginia Fernandez, Parashkev Nachev, Sébastien Ourselin, and M Jorge Cardoso. Brain imaging generation with latent diffusion models. In Proceedings of the Twenty Fifth Conference on Medical Image Computing and Computer Assisted Intervention, pages 117–126. MICCAI, 2022.
  • Qi and Leahy (2006) Jinyi Qi and Richard M Leahy. Iterative reconstruction techniques in emission computed tomography. Physics in Medicine & Biology, 51(15):R541, 2006.
  • Ramzi et al. (2020) Zaccharie Ramzi, Benjamin Remy, François Lanusse, Jean-Luc Starck, and Philippe Ciuciu. Denoising score-matching for uncertainty quantification in inverse problems. CoRR, abs/2011.08698, 2020.
  • Ravula et al. (2023) Sriram Ravula, Brett Levac, Ajil Jalal, Jonathan I Tamir, and Alexandros G Dimakis. Optimizing sampling patterns for compressed sensing MRI with diffusion generative models. CoRR, abs/2306.03284, 2023.
  • Reader et al. (2021) Andrew J Reader, Guillaume Corda, Abolfazl Mehranian, Casper da Costa-Luis, Sam Ellis, and Julia A Schnabel. Deep learning for PET image reconstruction. IEEE Transactions on Radiation and Plasma Medical Sciences, 5(1):1–25, 2021.
  • Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Eighteenth Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241. MICCAI, 2015.
  • Rudin et al. (1992) Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: Nonlinear Phenomena, 60(1-4):259–268, 1992.
  • Saharia et al. (2022) Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding. In Proceedings of the Thirty Fifth Conference on Advances in Neural Information Processing Systems, pages 36479–36494. NeurIPS, 2022.
  • Schramm (2021) Georg Schramm. Simulated brainweb PET/MR data sets for denoising and deblurring, 2021. URL https://zenodo.org/record/4897350.
  • Schramm (2022) Georg Schramm. PARALLELPROJ – an open-source framework for fast calculation of projections in tomography, 2022. URL https://arxiv.org/abs/2212.12519.
  • Shepp and Vardi (1982) Lawrence A Shepp and Yehuda Vardi. Maximum likelihood reconstruction for emission tomography. IEEE Transactions on Medical Imaging, 1(2):113–122, 1982.
  • Singh et al. (2023) Imraj RD Singh, Riccardo Barbano, Željko Kereta, Bangti Jin, Kris Thielemans, and Simon Arridge. 3D PET-DIP reconstruction with relative difference prior using a SIRF-based objective. In Proceedings to the Seventeenth International Meeting on Fully Three-Dimensional Image Reconstruction in Radiology and Nuclear Medicine. Fully3D, 2023.
  • Singh et al. (in-press) Imraj RD Singh, Alexander Denker, Bangti Jin, Kris Thielemans, and Simon Arridge. Investigating intensity normalisation for pet reconstruction with supervised deep learning. In Proceedings of 2023 IEEE Nuclear Science Symposium and Medical Imaging Conference, together with the International Symposium on Room-Temperature Semiconductor X-Ray and Gamma-Ray Detectors, in-press. URL https://discovery.ucl.ac.uk/id/eprint/10181467.
  • Sohl-Dickstein et al. (2015) Jascha Sohl-Dickstein, Eric A Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the Thirty Second International Conference on Machine Learning, pages 2256–2265, 2015.
  • Somayajula et al. (2011) Sangeetha Somayajula, Christos Panagiotou, Anand Rangarajan, Quanzheng Li, Simon R Arridge, and Richard M Leahy. PET image reconstruction using information theoretic anatomical priors. IEEE Transactions on Medical Imaging, 30(3):537–549, 2011.
  • Song et al. (2021a) Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In Proceedings of the Ninth International Conference on Learning Representations. ICLR, 2021a.
  • Song and Ermon (2019) Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Proceedings of the Thirty Second Conference on Advances in Neural Information Processing Systems, pages 11895–11907. NeurIPS, 2019.
  • Song et al. (2021b) Yang Song, Conor Durkan, Iain Murray, and Stefano Ermon. Maximum likelihood training of score-based diffusion models. In Proceedings of the Thirty Fourth Conference on Advances in Neural Information Processing Systems, pages 1415–1428. NeurIPS, 2021b.
  • Song et al. (2021c) Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In Proceedings of the Ninth International Conference on Learning Representations. ICLR, 2021c.
  • Song et al. (2022) Yang Song, Liyue Shen, Lei Xing, and Stefano Ermon. Solving inverse problems in medical imaging with score-based generative models. In Proceedings of the Tenth International Conference on Learning Representations. ICLR, 2022.
  • Särkkä and Solin (2019) Simo Särkkä and Arno Solin. Applied Stochastic Differential Equations. Institute of Mathematical Statistics Textbooks. Cambridge University Press, Cambridge, 2019. ISBN 978-131-6649-46-6.
  • Thielemans et al. (2012) Kris Thielemans, Charalampos Tsoumpas, Sanida Mustafovic, Tobias Beisel, Pablo Aguiar, Nikolaos Dikaios, and Matthew W Jacobson. STIR: software for tomographic image reconstruction release 2. Physics in Medicine & Biology, 57(4):867, 2012.
  • Tong et al. (2010) Shan Tong, Adam M Alessio, and Paul E Kinahan. Noise and signal properties in PSF-based fully 3D PET image reconstruction: an experimental evaluation. Physics in Medicine & Biology, 55(5):1453, 2010.
  • Tran et al. (2020) Dustin Tran, Jasper Snoek, and Balaji Lakshminarayanan. Practical uncertainty estimation and out-of-distribution robustness in deep learning. NeurIPS Tutorial, Google Brain, 2020. URL https://neurips.cc/media/neurips-2020/Slides/16649.pdf.
  • Ulyanov et al. (2018) Dmitry Ulyanov, Andrea Vedaldi, and Victor S Lempitsky. Deep image prior. In Proceedings of the Thirty First IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9446–9454. CVPR, 2018.
  • Vincent (2011) Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674, 2011.
  • Wang et al. (2004) Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
  • Xiao et al. (2022) Zhisheng Xiao, Karsten Kreis, and Arash Vahdat. Tackling the generative learning trilemma with denoising diffusion GANs. In Proceedings of the Tenth International Conference on Learning Representations. ICLR, 2022.
  • Zhu et al. (2023) Yuanzhi Zhu, Kai Zhang, Jingyun Liang, Jiezhang Cao, Bihan Wen, Radu Timofte, and Luc Van Gool. Denoising diffusion models for plug-and-play image restoration. CoRR, abs/2305.08995, 2023.

A Appendix

A.1 Training of Score-based Models

Forward SDE

In our experiments, we make use of the variance preserving SDE (Ho et al., 2020)

d𝐱t=β(t)2𝐱tdt+β(t)d𝐰,𝑑subscript𝐱𝑡𝛽𝑡2subscript𝐱𝑡𝑑𝑡𝛽𝑡𝑑𝐰\displaystyle d{\mathbf{x}_{t}}=-\frac{\beta(t)}{2}{\mathbf{x}_{t}}dt+\sqrt{\beta(t)}d\mathbf{w},(37)

were we employ β(t)=βmin+t(βmaxβmin)𝛽𝑡subscript𝛽min𝑡subscript𝛽maxsubscript𝛽min\beta(t)=\beta_{\text{min}}+t(\beta_{\text{max}}-\beta_{\text{min}}) as a linear schedule with βmin=0.1subscript𝛽min0.1\beta_{\text{min}}=0.1 and βmax=10subscript𝛽max10\beta_{\text{max}}=10 with a terminal time T=1𝑇1T=1. The coefficients were chosen such that the terminal distribution approximates a Gaussian, i.e. p1(𝐱)𝒩(0,I)subscript𝑝1𝐱𝒩0𝐼p_{1}({\mathbf{x}})\approx\mathcal{N}(0,I). We also tested the Variance Exploding (VE) SDE (Song et al., 2021c); it was found that VE-SDE was more unstable than VP-SDE for PET image reconstruction. The transition kernel for the variance preserving SDE is a Gaussian, i.e. pt(𝐱t|𝐱0)=𝒩(𝐱t;γt𝐱0,νt2I)subscript𝑝𝑡conditionalsubscript𝐱𝑡subscript𝐱0𝒩subscript𝐱𝑡subscript𝛾𝑡subscript𝐱0superscriptsubscript𝜈𝑡2𝐼p_{t}({\mathbf{x}_{t}}|{\mathbf{x}_{0}})=\mathcal{N}({\mathbf{x}_{t}};\gamma_{t}{\mathbf{x}_{0}},\nu_{t}^{2}I), with coefficients

γt=exp(120tβ(s)𝑑s),νt2=1exp(0tβ(s)𝑑s).formulae-sequencesubscript𝛾𝑡12superscriptsubscript0𝑡𝛽𝑠differential-d𝑠superscriptsubscript𝜈𝑡21superscriptsubscript0𝑡𝛽𝑠differential-d𝑠\displaystyle\gamma_{t}=\exp{\left(-\frac{1}{2}\int_{0}^{t}\beta(s)ds\right)},\quad\nu_{t}^{2}=1-\exp{\left(-\int_{0}^{t}\beta(s)ds\right)}.(38)

Using this closed form expression for the transition kernel, the denoising score matching loss can be rewritten as

LDSM(θ)=𝔼tU[0,1]𝔼𝐱0π𝔼𝐳𝒩(0,I)[ωtsθ(𝐱t,t)+1νt𝐳22],subscript𝐿DSM𝜃subscript𝔼similar-to𝑡𝑈01subscript𝔼similar-tosubscript𝐱0𝜋subscript𝔼similar-to𝐳𝒩0𝐼delimited-[]subscript𝜔𝑡superscriptsubscriptnormsubscript𝑠𝜃subscript𝐱𝑡𝑡1subscript𝜈𝑡𝐳22\displaystyle L_{\text{DSM}}(\theta)=\mathbb{E}_{t\sim U[0,1]}\mathbb{E}_{{\mathbf{x}_{0}}\sim{\pi}}\mathbb{E}_{{\mathbf{z}}\sim\mathcal{N}(0,I)}\left[\omega_{t}\left\|s_{\theta}({\mathbf{x}_{t}},t)+\frac{1}{\nu_{t}}{\mathbf{z}}\right\|_{2}^{2}\right],(39)

with 𝐱t=γt𝐱0+νt𝐳subscript𝐱𝑡subscript𝛾𝑡subscript𝐱0subscript𝜈𝑡𝐳{\mathbf{x}}_{t}=\gamma_{t}{\mathbf{x}}_{0}+\nu_{t}{\mathbf{z}}. The weighting ωtsubscript𝜔𝑡\omega_{t} is chosen as ωt=νt2subscript𝜔𝑡superscriptsubscript𝜈𝑡2\omega_{t}=\nu_{t}^{2} to approximate maximum likelihood training (Song et al., 2021b).

Model Architecture

We use the architecture proposed by Dhariwal and Nichol (2021)444available at https://github.com/openai/guided-diffusion. The architecture is based on the popular U-Net architecture (Ronneberger et al., 2015) consisting of a decoder implemented as a stack of residual blocks and downsampling operations and an encoder of residual blocks and upsampling operations. At the lowest resolution (8×8888\times 8), additional global attention layers are used. To incorporate the timestep into each residual block, the authors use adaptive group normalisation (AdaGN) layers defined as AdaGN(h,e)=esGroupNorm(h)+ebAdaGN𝑒subscript𝑒𝑠GroupNormsubscript𝑒𝑏\text{AdaGN}(h,e)=e_{s}\text{GroupNorm}(h)+e_{b}, where hh are intermediate features and e=[es,eb]𝑒subscript𝑒𝑠subscript𝑒𝑏e=[e_{s},e_{b}] is the encoded time step. The specific implementation and the choice of our hyperparameters can be seen in our github. For the MRI guided model we apply the clean MRI image as an additional channel to the input of the network.

A.2 Experimental Details

The sampling methods presented in Section 3 use different penalty strengths in order to scale the likelihood term for PET-Naive and PET-DPS or to set the strength of the additional Tikhonov regularisation for PET-DDS. For Naive it is recommended to choose λtnaivesuperscriptsubscript𝜆𝑡naive\lambda_{t}^{\text{naive}} s.t. the penalty is zero at the start of sampling and increased as t0𝑡0t\to 0 (Jalal et al., 2021). We use λt=λ(1t)subscript𝜆𝑡𝜆1𝑡\lambda_{t}=\lambda(1-t) in all our experiments. For the PET-DPS approach (Chung et al., 2023a) define the sampling iteration as

𝐱~tk1=𝐱tk+[𝐟(𝐱tk,tk)g(tk)2sθ(𝐱tk,tk)]Δt+g(tk)|Δt|𝐳𝐳𝒩(0,I),𝐱tk1=𝐱~tk1λtkDPS𝐱L(𝐲|𝐱^0(𝐱tk)).formulae-sequencesubscript~𝐱subscript𝑡𝑘1subscript𝐱subscript𝑡𝑘delimited-[]𝐟subscript𝐱subscript𝑡𝑘subscript𝑡𝑘𝑔superscriptsubscript𝑡𝑘2subscript𝑠𝜃subscript𝐱subscript𝑡𝑘subscript𝑡𝑘Δ𝑡𝑔subscript𝑡𝑘Δ𝑡𝐳formulae-sequencesimilar-to𝐳𝒩0𝐼subscript𝐱subscript𝑡𝑘1subscript~𝐱subscript𝑡𝑘1superscriptsubscript𝜆subscript𝑡𝑘DPSsubscript𝐱𝐿conditional𝐲subscript^𝐱0subscript𝐱subscript𝑡𝑘\begin{split}&\tilde{{\mathbf{x}}}_{t_{k-1}}={\mathbf{x}}_{t_{k}}+[{\mathbf{f}}({\mathbf{x}}_{t_{k}},t_{k})-g(t_{k})^{2}s_{\theta}({\mathbf{x}}_{t_{k}},t_{k})]\Delta t+g(t_{k})\sqrt{|\Delta t|}{\mathbf{z}}\quad{\mathbf{z}}\sim\mathcal{N}(0,I),\\ &{\mathbf{x}}_{t_{k-1}}=\tilde{{\mathbf{x}}}_{t_{k-1}}-\lambda_{t_{k}}^{\text{DPS}}\nabla_{\mathbf{x}}L({\mathbf{y}}|{\hat{\mathbf{x}}_{0}}({\mathbf{x}}_{t_{k}})).\end{split}(40)

This is equivalent to the classical Euler-Maruyama scheme, when λtkDPSsuperscriptsubscript𝜆subscript𝑡𝑘DPS\lambda_{t_{k}}^{\text{DPS}} is chosen in such a way that it incorporates the step size ΔtΔ𝑡\Delta t and the diffusion function g(tk)2𝑔superscriptsubscript𝑡𝑘2g(t_{k})^{2}. We follow Chung et al. (2023a) and define λtDPS=λDKL(A𝐱^0||𝐲)\lambda_{t}^{\text{DPS}}=\frac{\lambda}{D_{KL}(A{\hat{\mathbf{x}}_{0}}||{\mathbf{y}})}. For PET-DDS a constant penalty λDDSsuperscript𝜆DDS\lambda^{\text{DDS}}, without time dependency, is used. Heuristically, the number of iterations used for data-consistency projection were adjusted such that the results, with λDDS=0superscript𝜆DDS0\lambda^{\text{DDS}}=0, overfit to noise. The penalty strength λDDSsuperscript𝜆DDS\lambda^{\text{DDS}} was then increased to regularise the reconstruction more. In 2D the number of projection steps for PET-DDS were set to 444 for noise level 2.52.52.5 and 151515 for noise level 101010. In 3D the number of projection steps for PET-DDS were set to 555 for both tracers. For guided reconstruction, 101010 steps for noise level 2.52.52.5 and 202020 steps for noise level 101010 were used for PET-DDS.

A.3 Baseline Methods

Classical Methods

Relative Difference Prior (RDP) is a common penalty for PET reconstruction (Nuyts et al., 2002), defined by

R(𝐱)=j=1nkNj(xjxk)2xj+xk+ξ|xjxk|,𝑅𝐱superscriptsubscript𝑗1𝑛subscript𝑘subscript𝑁𝑗superscriptsubscript𝑥𝑗subscript𝑥𝑘2subscript𝑥𝑗subscript𝑥𝑘𝜉subscript𝑥𝑗subscript𝑥𝑘\displaystyle R({\mathbf{x}})=-\sum_{j=1}^{n}\sum_{k\in N_{j}}\frac{(x_{j}-x_{k})^{2}}{x_{j}+x_{k}+\xi\lvert x_{j}-x_{k}\rvert},

where Njsubscript𝑁𝑗N_{j} is a pre-defined neighbourhood around xjsubscript𝑥𝑗x_{j}, typically 3×3333\times 3 in 2D or 3×3×33333\times 3\times 3 in 3D. Rz(𝐱)subscript𝑅𝑧𝐱R_{z}({\mathbf{x}}) is a variant of RDP whereby the neighbourhood is defined in the axial dimensions in 3D, i.e. 3×1×13113\times 1\times 1. A Neumann boundary was chosen where neighbourhoods that were outside of the domain. The tunable parameter ξ>0𝜉0\xi>0 controls the degree of edge-preservation (ξ=1𝜉1\xi=1, in-line with clinical practice), and with its gradient given by

R(𝐱)xj=kNj(rjk1)(ξ|rjk1|+rjk+3)(rjk+1+ξ|rjk1|)2,with rjk:=xjxk.formulae-sequence𝑅𝐱subscript𝑥𝑗subscript𝑘subscript𝑁𝑗subscript𝑟𝑗𝑘1𝜉subscript𝑟𝑗𝑘1subscript𝑟𝑗𝑘3superscriptsubscript𝑟𝑗𝑘1𝜉subscript𝑟𝑗𝑘12assignwith subscript𝑟𝑗𝑘subscript𝑥𝑗subscript𝑥𝑘\displaystyle\frac{\partial R({\mathbf{x}})}{\partial x_{j}}=\sum_{k\in N_{j}}-\frac{(r_{jk}-1)(\xi|r_{jk}-1|+r_{jk}+3)}{(r_{jk}+1+\xi|r_{jk}-1|)^{2}},\quad\text{with }r_{jk}:=\frac{x_{j}}{x_{k}}.(41)

The penalisation is scale-invariant since the gradient is computed using the ratio of voxel values rjksubscript𝑟𝑗𝑘r_{jk}. This partially overcomes the issue with the wide dynamic range observed in emission tomography images. For BSREM algorithm the convergence criteria was set based on the change of voxel values within the reconstruction between iterates. Specifically, the change in mean voxel values across non-zero voxel values was less than 0.01%percent0.010.01\%, we set the relaxation coefficient to ζ=0.1𝜁0.1\zeta=0.1.

PET Image Denoising with SGM

In PET image denoising, the goal is to sample from the posterior p(𝐱|𝐱noisy)𝑝conditional𝐱subscript𝐱noisyp({\mathbf{x}}|\mathbf{x}_{\text{noisy}}) of the true image 𝐱𝐱{\mathbf{x}} given an initial (low-count) reconstruction 𝐱noisysubscript𝐱noisy\mathbf{x}_{\text{noisy}}. This is differs from PET reconstruction, where the goal is to sample from the posterior ppost(𝐱|𝐲)superscript𝑝postconditional𝐱𝐲{p^{\text{post}}}({\mathbf{x}}|{\mathbf{y}}) conditioned on the measurements 𝐲𝐲{\mathbf{y}}. In this framework the denoising likelihood is given by Gaussian noise, i.e.

p(𝐱OSEM|𝐱)=𝒩(𝐱noisy;𝐱,σd2I),𝑝conditionalsubscript𝐱OSEM𝐱𝒩subscript𝐱noisy𝐱superscriptsubscript𝜎𝑑2𝐼\displaystyle p(\mathbf{x}_{\text{OSEM}}|{\mathbf{x}})=\mathcal{N}(\mathbf{x}_{\text{noisy}};{\mathbf{x}},\sigma_{d}^{2}I),(42)

with the noise level σdsubscript𝜎𝑑\sigma_{d} to be specified. Using the Naive approximation, we get the following reverse SDE for the PET denoising likelihood

d𝐱t=[𝐟(𝐱t,t)g(t)2(sθ(𝐱t,t)1/σd2(𝐱noisy𝐱t)]dt+g(t)d𝐰¯t.\displaystyle d{\mathbf{x}_{t}}=[{\mathbf{f}}({\mathbf{x}_{t}},t)-g(t)^{2}(s_{\theta}({\mathbf{x}_{t}},t)-1/\sigma_{d}^{2}(\mathbf{x}_{\text{noisy}}-{\mathbf{x}_{t}})]dt+g(t)d\bar{\mathbf{w}}_{t}.(43)

In our implementation we estimate the initial reconstruction using OSEM with 34 subsets and iterations (i.e. 1 epoch). The same score model sθ(𝐱t,t)subscript𝑠𝜃subscript𝐱𝑡𝑡s_{\theta}({\mathbf{x}_{t}},t) is used for both PET denoising and reconstruction. The noise level σdsubscript𝜎𝑑\sigma_{d} is chosen based on a held-out evaluation dataset.

Supervised Learning

We are using two popular supervised learning techniques: post-processing and learned iterative methods. For the post-processing method we used a variant of the FBPConvNet (Jin et al., 2017), modified to PET reconstruction. The input to the FBPConvNet was changed to an OSEM with 34 subsets and iterations, this variation is denoted by PET-UNet. For the learned iterative method, we adopt Learned Primal Dual (LPD) (Adler and Öktem, 2018), referred to as PET-LPD. For PET-LPD  we use the same OSEM reconstruction as initialisation for the primal channels and include the affine forward model with sample specific attenuation maps. Note that these sample specific factors were not included in previous implementation of learned iterative methods for PET image reconstruction (Guazzo and Colarieti-Tosti, 2021). Three primal and dual unrolled iterations were used. Both of these networks were implemented using Divα𝛼\alphaL (Leuschner et al., 2021) with only minimal changes to the architecture; PET-UNet  was a UNet with 1 783 24917832491\,783\,249 parameters, and PET-LPD  used a block of convolutional filters for each primal and dual network with a total of 132 300132300132\,300 parameters. Both networks were trained using the dataset in Section 4.1 without lesions and noise levels of 555, 101010, and 505050. The dataset was split into training and evaluation, and training was terminated when over-fitting was observed. Additionally, data-corrected mean normalisation was included to promote generalisability between noise levels. The code for these supervised learning models is publicly available at https://github.com/Imraj-Singh/pet_supervised_normalisation and for further details see (Singh et al., in-press).

Deep Image Prior

The Deep Image Prior (DIP) (Ulyanov et al., 2018) is a popular framework for unsupervised image reconstruction, relying only on a single measurement. A common problem of the DIP is its tendency to overfit to noise. Therefore some regularisation has to be used. We included RDP into the objective function to elevate the need for early-stopping and prevent over-fitting to noise. The architecture used was a three-scale U-Net (Ronneberger et al., 2015) with 1 606 89916068991\,606\,899 parameters, with a rectified linear unit on the output to ensure non-negativity. This architecture is minimally changed from previous applications of DIP to PET (Gong et al., 2019; Ote et al., 2023; Singh et al., 2023). DIP results are computed on reconstructions along the optimisation trajectory, every 100 iterations from 6,600 iterations to 11,600.

A.4 Evaluation metrics

In addition to peak-signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) (Wang et al., 2004), we compute two local quality scores over a Region of Interest (ROI). For reconstructions with lesions a Contrast Recovery Coefficient (CRC) was computed to quantify detectability of these local features. This was computed between lesion L𝐿L and background B𝐵B ROIs, these have NLsubscript𝑁𝐿N_{L} and NBsubscript𝑁𝐵N_{B} number of elements respectively555We use L𝐿L to denote the lesion ROI in this section only; in the main manuscript L𝐿L is the likelihood.. Additionally, there are R𝑅R realisations of the measured data. Given an ROI Z𝑍Z, we include subscript indices for element and realisation Zr,ksubscript𝑍𝑟𝑘Z_{r,k}, where r𝑟r is the realisation index, and k𝑘k is the element index. An average over the elements of the ROI is denoted as Z¯r=1NZk=1NZZr,ksubscript¯𝑍𝑟1subscript𝑁𝑍superscriptsubscript𝑘1subscript𝑁𝑍subscript𝑍𝑟𝑘\bar{Z}_{r}=\frac{1}{N_{Z}}\sum_{k=1}^{N_{Z}}Z_{r,k}. The CRC is defined by

CRC:=r=1R(L¯rB¯r1)/(LtBt1),assignCRCsuperscriptsubscript𝑟1𝑅subscript¯𝐿𝑟subscript¯𝐵𝑟1subscript𝐿tsubscript𝐵t1\mathrm{CRC}:=\sum_{r=1}^{R}\left(\frac{\bar{L}_{r}}{\bar{B}_{r}}-1\right)/\left(\frac{L_{\mathrm{t}}}{B_{\mathrm{t}}}-1\right),(44)

where the subscript tt\rm t denotes the ground truth ROIs. We study the noise over realisations of the measured data using normalised STD (also referred to as ensemble noise, see Tong et al. (2010), and is reported to give a true estimate of noise in the image). We define an average over realisations of the ROI as Z¯k=1Rr=1RZr,ksubscript¯𝑍𝑘1𝑅superscriptsubscript𝑟1𝑅subscript𝑍𝑟𝑘\bar{Z}_{k}=\frac{1}{R}\sum_{r=1}^{R}Z_{r,k} where STD is computed on background ROIs it is given by

STD:=1NBk=1NB1R1r=1R(Br,kB¯k)2B¯k.assignSTD1subscript𝑁𝐵superscriptsubscript𝑘1subscript𝑁𝐵1𝑅1subscriptsuperscript𝑅𝑟1superscriptsubscript𝐵𝑟𝑘subscript¯𝐵𝑘2subscript¯𝐵𝑘\mathrm{STD}:=\frac{1}{N_{B}}\sum_{k=1}^{N_{B}}\sqrt{\frac{1}{R-1}\sum^{R}_{r=1}\frac{(B_{r,k}-\bar{B}_{k})^{2}}{\bar{B}_{k}}}.(45)

For reconstructions with lesions the background ROI was used, and without lesions a background of the whole emission volume (defined on reference images) was used. In 2D R=10𝑅10R=10 noise realisations of acquisition data were used, and R=5𝑅5R=5 in 3D. To evaluate the consistency of our reconstructions to the true measurements, we compute the Kullback-Leibler divergence (KLDIV)

KLDIV:=j=1my¯jlog(y¯jyj)y¯j+yj,assignKLDIVsuperscriptsubscript𝑗1𝑚subscript¯𝑦𝑗subscript¯𝑦𝑗subscript𝑦𝑗subscript¯𝑦𝑗subscript𝑦𝑗\displaystyle\mathrm{KLDIV}:=\sum_{j=1}^{m}\bar{y}_{j}\log\left(\frac{\bar{y}_{j}}{y_{j}}\right)-\bar{y}_{j}+y_{j},(46)

between measurements 𝐲𝐲{\mathbf{y}} and estimated measurements 𝐲¯=𝐀𝐱^+𝐛¯¯𝐲𝐀^𝐱¯𝐛\mathbf{\bar{y}}={\mathbf{A}}\hat{{\mathbf{x}}}+\mathbf{\bar{b}} where 𝐱^^𝐱\hat{{\mathbf{x}}} denotes the reconstruction.

B Additional Results

B.1 2D Reconstruction

We show additional sensitivity plots for 2D reconstruction. For noise level 101010 these results are presented in Fig. 8 and Fig. 9 without and with lesions, respectively. The results are similar to the settings for noise level 2.52.52.5, as we see a clear trade-off between reconstruction quality in terms of PSNR/SSIM and visibility of lesions in terms of CRC in Fig. 9. Here, a higher regularisation leads to better PSNR/SSIM scores and a lower regularisation to a better recovery of lesions. A high regularisation, i.e. a high influence of the score model, may lead to a worse reconstruction of the lesions, as the score model was trained on images without lesions.

Refer to caption
Figure 8: Results for BrainWeb without lesions with noise level 10 for different penalty parameters. The Standard Deviation is computed over reconstructions of different noise realisations 𝐲𝐲{\mathbf{y}}. The points represent different values of the parameter λ𝜆\lambda, and the notation 2719 and \Diamondblack denote the smallest and largest numerical value of λ𝜆\lambda, respectively.
Refer to caption
Figure 9: Results for BrainWeb with lesions with noise level 10 for different penalty parameters. The Standard Deviation is computed over reconstructions of different noise realisations 𝐲𝐲{\mathbf{y}}. The points represent different values of the parameter λ𝜆\lambda, and the notation 2719 and \Diamondblack denote the smallest and largest numerical value of λ𝜆\lambda, respectively.

B.2 MR guidance

We show additional results for the MR guided model. Sensitivity plots without and with lesions are presented in Fig. 10 and Fig. 11. These results support the findings of the paper, as the MR guided models achieve better reconstruction quality w.r.t. PSNR and SSIM. However, the CRC is similar to the unguided model. As the lesions were not visible in the MR image, no additional information about the lesions are introduced through guidance. We show two more reconstruction examples without lesions in Fig. 12 and examples with lesions in Fig. 13.

Refer to caption
Figure 10: Results for 2D reconstruction guided vs unguided without lesions for noise level 2.5. The points represent different values of the parameter λ𝜆\lambda, and the notation 2719 and \Diamondblack denote the smallest and largest numerical value of λ𝜆\lambda, respectively.
Refer to caption
Figure 11: Results for 2D reconstruction guided vs unguided with lesion for noise level 2.5. The points represent different values of the parameter λ𝜆\lambda, and the notation 2719 and \Diamondblack denote the smallest and largest numerical value of λ𝜆\lambda, respectively.
Refer to caption
Refer to caption
Figure 12: Comparisons of the PET-DDS MR guided vs. unguided at noise level 2.5 without lesions.
Refer to caption
Refer to caption
Figure 13: Comparison of the PET-DDS MR guided vs. unguided with a noise level 2.5 with lesions.

B.3 3D results RDPz sweeps

We show the sensitivity plots for different penalty values of the additional RDP regularizer in z𝑧z-direction for PET-DDS in Fig. 14 and 15 for the two different tracers. In addition, we show axial, coronal and saggital slices of the reconstruction with the Amyloid tracer in 16.

Refer to caption
Figure 14: Results for 3D reconstruction using the FDG tracer for different penalty values. The points represent different values of the parameter λ𝜆\lambda, and the notation 2719 and \Diamondblack denote the smallest and largest numerical value of λ𝜆\lambda, respectively.
Refer to caption
Figure 15: Results for 3D reconstruction using the Amyloid tracer for different penalty values. The points represent different values of the parameter λ𝜆\lambda, and the notation 2719 and \Diamondblack denote the smallest and largest numerical value of λ𝜆\lambda, respectively.
Refer to caption
Figure 16: 3D reconstruction for the different method with Amyloid tracer, and metrics computed on inset lesion.
Algorithm 1 PET-Naive
Measurements 𝐲𝐲{\mathbf{y}}
Number of steps N𝑁N\in{\mathbb{N}}
Time discretisation 0=tk1tkN=10subscript𝑡subscript𝑘1subscript𝑡subscript𝑘𝑁10=t_{k_{1}}\leq\dots\leq t_{k_{N}}=1
𝐱tNp1similar-tosubscript𝐱subscript𝑡𝑁subscript𝑝1{\mathbf{x}}_{t_{N}}\sim p_{1} \triangleright Sample initial noise
for k=N1,,1𝑘𝑁11k=N-1,\dots,1 do
     sθsθ(𝐱tk+1,tk+1)subscript𝑠𝜃subscript𝑠𝜃subscript𝐱subscript𝑡𝑘1subscript𝑡𝑘1s_{\theta}\leftarrow s_{\theta}({\mathbf{x}}_{t_{k+1}},t_{k+1})
     𝐳𝒩(0,𝐈)similar-to𝐳𝒩0𝐈{\mathbf{z}}\sim\mathcal{N}(0,\mathbf{I})
     Δttktk+1Δ𝑡subscript𝑡𝑘subscript𝑡𝑘1\Delta t\leftarrow t_{k}-t_{k+1}
     𝐱~tk𝐱tk+1+[𝐟(𝐱tk+1,tk+1)g(tk+1)2sθ]Δt+g(tk+1)|Δt|𝐳subscript~𝐱subscript𝑡𝑘subscript𝐱subscript𝑡𝑘1delimited-[]𝐟subscript𝐱subscript𝑡𝑘1subscript𝑡𝑘1𝑔superscriptsubscript𝑡𝑘12subscript𝑠𝜃Δ𝑡𝑔subscript𝑡𝑘1Δ𝑡𝐳\tilde{{\mathbf{x}}}_{t_{k}}\leftarrow{\mathbf{x}}_{t_{k+1}}+\big{[}{\mathbf{f}}({\mathbf{x}}_{t_{k+1}},t_{k+1})-g(t_{k+1})^{2}s_{\theta}\big{]}\Delta t+g(t_{k+1})\sqrt{|\Delta t|}{\mathbf{z}} \triangleright Unconditional score update
     𝐱tk𝐱~tkg(tk+1)2λtk+1Naive𝐱L(𝐲|cOSEMP𝐱0[𝐱tk+1])Δtsubscript𝐱subscript𝑡𝑘subscript~𝐱subscript𝑡𝑘𝑔superscriptsubscript𝑡𝑘12superscriptsubscript𝜆subscript𝑡𝑘1Naivesubscript𝐱𝐿conditional𝐲subscript𝑐OSEMsubscript𝑃𝐱0delimited-[]subscript𝐱subscript𝑡𝑘1Δ𝑡{\mathbf{x}}_{t_{k}}\leftarrow\tilde{\mathbf{x}}_{t_{k}}-g(t_{k+1})^{2}\lambda_{t_{k+1}}^{\text{{Naive}{}}}\nabla_{{\mathbf{x}}}L({\mathbf{y}}|c_{\text{OSEM}}P_{{\mathbf{x}}\geq 0}[{\mathbf{x}}_{t_{k+1}}])\Delta t \triangleright Data consistency step
end for
𝐱^cOSEM𝐱t1^𝐱subscript𝑐OSEMsubscript𝐱subscript𝑡1\hat{{\mathbf{x}}}\leftarrow c_{\text{OSEM}}{\mathbf{x}}_{t_{1}}
Algorithm 2 PET-DPS
Measurements 𝐲𝐲{\mathbf{y}}
Number of steps N𝑁N\in{\mathbb{N}}
Time discretisation 0=tk1tkN=10subscript𝑡subscript𝑘1subscript𝑡subscript𝑘𝑁10=t_{k_{1}}\leq\dots\leq t_{k_{N}}=1
Transition density p(𝐱t|𝐱0)=𝒩(𝐱t;γt𝐱0,νt2𝐈)𝑝conditionalsubscript𝐱𝑡subscript𝐱0𝒩subscript𝐱𝑡subscript𝛾𝑡subscript𝐱0superscriptsubscript𝜈𝑡2𝐈p({\mathbf{x}_{t}}|{\mathbf{x}_{0}})=\mathcal{N}({\mathbf{x}_{t}};\gamma_{t}{\mathbf{x}_{0}},\nu_{t}^{2}\mathbf{I})
𝐱tNp1similar-tosubscript𝐱subscript𝑡𝑁subscript𝑝1{\mathbf{x}}_{t_{N}}\sim p_{1} \triangleright Sample initial noise
for k=N1,,1𝑘𝑁11k=N-1,\dots,1 do
     ssθ(𝐱tk+1,tk+1)𝑠subscript𝑠𝜃subscript𝐱subscript𝑡𝑘1subscript𝑡𝑘1s\leftarrow s_{\theta}({\mathbf{x}}_{t_{k+1}},t_{k+1})
     𝐱0^(𝐱tk+1)γtk+11(𝐱tk+1+νtk+12sθ)^subscript𝐱0subscript𝐱subscript𝑡𝑘1superscriptsubscript𝛾subscript𝑡𝑘11subscript𝐱subscript𝑡𝑘1superscriptsubscript𝜈subscript𝑡𝑘12subscript𝑠𝜃\hat{{\mathbf{x}_{0}}}({\mathbf{x}}_{t_{k+1}})\leftarrow\gamma_{t_{k+1}}^{-1}({\mathbf{x}}_{t_{k+1}}+\nu_{t_{k+1}}^{2}s_{\theta}) \triangleright Compute Tweedie estimate
     𝐳𝒩(0,𝐈)similar-to𝐳𝒩0𝐈{\mathbf{z}}\sim\mathcal{N}(0,\mathbf{I})
     Δttktk+1Δ𝑡subscript𝑡𝑘subscript𝑡𝑘1\Delta t\leftarrow t_{k}-t_{k+1}
     𝐱~tk𝐱tk+1+[𝐟(𝐱tk+1,tk+1)g(tk+1)2sθ]Δt+g(tk+1)|Δt|𝐳subscript~𝐱subscript𝑡𝑘subscript𝐱subscript𝑡𝑘1delimited-[]𝐟subscript𝐱subscript𝑡𝑘1subscript𝑡𝑘1𝑔superscriptsubscript𝑡𝑘12subscript𝑠𝜃Δ𝑡𝑔subscript𝑡𝑘1Δ𝑡𝐳\tilde{{\mathbf{x}}}_{t_{k}}\leftarrow{\mathbf{x}}_{t_{k+1}}+\big{[}{\mathbf{f}}({\mathbf{x}}_{t_{k+1}},t_{k+1})-g(t_{k+1})^{2}s_{\theta}\big{]}\Delta t+g(t_{k+1})\sqrt{|\Delta t|}{\mathbf{z}} \triangleright Unconditional score update
     L(𝐲|cOSEMP𝐱0[𝐱0^(𝐱tk+1)])𝐿conditional𝐲subscript𝑐OSEMsubscript𝑃𝐱0delimited-[]^subscript𝐱0subscript𝐱subscript𝑡𝑘1\ell\leftarrow L({\mathbf{y}}|c_{\text{OSEM}}P_{{\mathbf{x}}\geq 0}[\hat{{\mathbf{x}_{0}}}({\mathbf{x}}_{t_{k+1}})])
     𝐱tk𝐱~tkλtk+1DPS/𝐱L(𝐲|cOSEMP𝐱0[𝐱0^(𝐱tk+1)])subscript𝐱subscript𝑡𝑘subscript~𝐱subscript𝑡𝑘superscriptsubscript𝜆subscript𝑡𝑘1DPSsubscript𝐱𝐿conditional𝐲subscript𝑐OSEMsubscript𝑃𝐱0delimited-[]^subscript𝐱0subscript𝐱subscript𝑡𝑘1{\mathbf{x}}_{t_{k}}\leftarrow\tilde{\mathbf{x}}_{t_{k}}-\lambda_{t_{k+1}}^{\text{{DPS}}}/\ell~{}\nabla_{{\mathbf{x}}}L({\mathbf{y}}|c_{\text{OSEM}}P_{{\mathbf{x}}\geq 0}[\hat{{\mathbf{x}_{0}}}({\mathbf{x}}_{t_{k+1}})]) \triangleright Data consistency step
end for
𝐱^cOSEM𝐱t1^𝐱subscript𝑐OSEMsubscript𝐱subscript𝑡1\hat{{\mathbf{x}}}\leftarrow c_{\text{OSEM}}{\mathbf{x}}_{t_{1}}
Algorithm 3 PET-DDS
Measurements 𝐲𝐲{\mathbf{y}}
Number of steps N𝑁N\in{\mathbb{N}}
Time discretisation 0=tk1tkN=10subscript𝑡subscript𝑘1subscript𝑡subscript𝑘𝑁10=t_{k_{1}}\leq\dots\leq t_{k_{N}}=1
Transition density p(𝐱t|𝐱0)=𝒩(𝐱t;γt𝐱0,νt2𝐈)𝑝conditionalsubscript𝐱𝑡subscript𝐱0𝒩subscript𝐱𝑡subscript𝛾𝑡subscript𝐱0superscriptsubscript𝜈𝑡2𝐈p({\mathbf{x}_{t}}|{\mathbf{x}_{0}})=\mathcal{N}({\mathbf{x}_{t}};\gamma_{t}{\mathbf{x}_{0}},\nu_{t}^{2}\mathbf{I})
Number of inner optimisation steps p𝑝p\in{\mathbb{N}}, number of subsets nsubsubscript𝑛subn_{\rm{sub}}\in{\mathbb{N}}
Stochasticity {ηt}t0subscriptsubscript𝜂𝑡𝑡0\{\eta_{t}\}_{t\geq 0}
𝐱tNp1similar-tosubscript𝐱subscript𝑡𝑁subscript𝑝1{\mathbf{x}}_{t_{N}}\sim p_{1} \triangleright Sample initial noise
for k=N1,,1𝑘𝑁11k=N-1,\dots,1 do
     sθsθ(𝐱tk+1,tk+1)subscript𝑠𝜃subscript𝑠𝜃subscript𝐱subscript𝑡𝑘1subscript𝑡𝑘1s_{\theta}\leftarrow s_{\theta}({\mathbf{x}}_{t_{k+1}},t_{k+1})
     𝐱0^(𝐱tk+1)0γtk+11(𝐱tk+1+νtk+12sθ)^subscript𝐱0superscriptsubscript𝐱subscript𝑡𝑘10superscriptsubscript𝛾subscript𝑡𝑘11subscript𝐱subscript𝑡𝑘1superscriptsubscript𝜈subscript𝑡𝑘12subscript𝑠𝜃\hat{{\mathbf{x}_{0}}}({\mathbf{x}}_{t_{k+1}})^{0}\leftarrow\gamma_{t_{k+1}}^{-1}({\mathbf{x}}_{t_{k+1}}+\nu_{t_{k+1}}^{2}s_{\theta})\triangleright Compute Tweedie estimate
     for i=0,,p1𝑖0𝑝1i=0,\dots,p-1 do \triangleright Inner optimisation for data consistency
         j(p(Nk)+imodnsub)+1𝑗modulo𝑝𝑁𝑘𝑖subscript𝑛sub1j\leftarrow(p(N-k)+i\mod n_{\rm{sub}})+1
         Φj(𝐱tk+1i)Lj(𝐲|cOSEM𝐱tk+1i)+(λRDPRz(𝐱tk+1i)λDDS𝐱tk+1i𝐱0^(𝐱tk+1)022)/nsubsubscriptΦ𝑗subscriptsuperscript𝐱𝑖subscript𝑡𝑘1subscript𝐿𝑗conditional𝐲subscript𝑐OSEMsubscriptsuperscript𝐱𝑖subscript𝑡𝑘1superscript𝜆RDPsubscript𝑅𝑧subscriptsuperscript𝐱𝑖subscript𝑡𝑘1superscript𝜆DDSsuperscriptsubscriptnormsubscriptsuperscript𝐱𝑖subscript𝑡𝑘1^subscript𝐱0superscriptsubscript𝐱subscript𝑡𝑘1022subscript𝑛sub\Phi_{j}({\mathbf{x}}^{i}_{t_{k+1}})\leftarrow L_{j}({\mathbf{y}}|c_{\text{OSEM}}{\mathbf{x}}^{i}_{t_{k+1}})+(\lambda^{\text{RDP}}R_{z}({\mathbf{x}}^{i}_{t_{k+1}})-\lambda^{\text{DDS}}\|{\mathbf{x}}^{i}_{t_{k+1}}-\hat{{\mathbf{x}_{0}}}({\mathbf{x}}_{t_{k+1}})^{0}\|_{2}^{2})/n_{\mathrm{sub}}
         𝐱tk+1i+1P𝐱0[𝐱tk+1i+𝐃(𝐱tk+1i)𝐱Φj(𝐱tk+1i)]subscriptsuperscript𝐱𝑖1subscript𝑡𝑘1subscript𝑃𝐱0delimited-[]subscriptsuperscript𝐱𝑖subscript𝑡𝑘1𝐃subscriptsuperscript𝐱𝑖subscript𝑡𝑘1subscript𝐱subscriptΦ𝑗subscriptsuperscript𝐱𝑖subscript𝑡𝑘1{\mathbf{x}}^{i+1}_{t_{k+1}}\leftarrow P_{\mathbf{x}\geq 0}\left[{\mathbf{x}}^{i}_{t_{k+1}}+\mathbf{D}({\mathbf{x}}^{i}_{t_{k+1}})\nabla_{\mathbf{x}}\Phi_{j}({\mathbf{x}}^{i}_{t_{k+1}})\right]
     end for
     𝐳𝒩(0,𝐈)similar-to𝐳𝒩0𝐈{\mathbf{z}}\sim\mathcal{N}(0,\mathbf{I})
     Noise(𝐱tk+1,sθ)νtk+1νtk2ηtk+12sθNoisesubscript𝐱subscript𝑡𝑘1subscript𝑠𝜃subscript𝜈subscript𝑡𝑘1superscriptsubscript𝜈subscript𝑡𝑘2superscriptsubscript𝜂subscript𝑡𝑘12subscript𝑠𝜃\text{Noise}({\mathbf{x}}_{t_{k+1}},s_{\theta})\leftarrow-\nu_{t_{k+1}}\sqrt{\nu_{t_{k}}^{2}-\eta_{t_{k+1}}^{2}}s_{\theta}
     𝐱tkγtk𝐱tk+1p+Noise(𝐱tk+1,sθ)+ηtk+1𝐳subscript𝐱subscript𝑡𝑘subscript𝛾subscript𝑡𝑘subscriptsuperscript𝐱𝑝subscript𝑡𝑘1Noisesubscript𝐱subscript𝑡𝑘1subscript𝑠𝜃subscript𝜂subscript𝑡𝑘1𝐳{\mathbf{x}}_{t_{k}}\leftarrow\gamma_{t_{k}}{\mathbf{x}}^{p}_{t_{k+1}}+\text{Noise}({\mathbf{x}}_{t_{k+1}},s_{\theta})+\eta_{t_{k+1}}{\mathbf{z}}
end for
𝐱^cOSEM𝐱t1^𝐱subscript𝑐OSEMsubscript𝐱subscript𝑡1\hat{{\mathbf{x}}}\leftarrow c_{\text{OSEM}}{\mathbf{x}}_{t_{1}}