Recalibration of Aleatoric and EpistemicRegression Uncertainty in Medical Imaging

Laves, Max-Heinrich; Ihler, Sontje; Fast, Jacob F.; Kahrs, Lüder A.; Ortmaier, Tobias

doi:https://doi.org/10.59275/j.melba.2021-a6fd

1 Introduction

Predictive uncertainty should be considered in any medical imaging task that is approached with deep learning. Well-calibrated uncertainty is of great importance for decision-making and is anticipated to increase patient safety. It allows to robustly reject unreliable predictions or out-of-distribution samples. In this paper, we address the problem of miscalibration of regression uncertainty with application to medical image analysis.

For the task of regression, we aim to estimate a continuous target value $\bm{y}\in\mathbb{R}^{d}$ given an input image $\bm{x}$ . Regression in medical imaging with deep learning has been applied to forensic age estimation from hand CT/MRI (Halabi et al., 2019; Štern et al., 2016), natural landmark localization (Payer et al., 2019), cell detection in histology (Xie et al., 2018), or instrument pose estimation (Gessert et al., 2018). By predicting the coordinates of object boundaries, segmentation can also be performed as a regression task. This has been done for segmentation of pulmonary nodules in CT (Messay et al., 2015), kidneys in ultrasound (Yin et al., 2020), or left ventricles in MRI (Tan et al., 2017). In registration of medical images, a continuous displacement field is predicted for each coordinate of $\bm{x}$ , which has also recently been addressed by CNNs for regression (Dalca et al., 2019).

In medical imaging, it is crucial to consider the predictive uncertainty of deep learning models. Bayesian neural networks (BNN) and their approximation provide mathematical tools for reasoning the uncertainty (Bishop, 2006; Kingma and Welling, 2014). In general, predictive uncertainty can be split into two types: aleatoric and epistemic uncertainty (Tanno et al., 2017; Kendall and Gal, 2017). This distinction was first made in the context of risk management (Hora, 1996). Aleatoric uncertainty arises from the data directly; e. g. sensor noise or motion artifacts. Epistemic uncertainty is caused by uncertainty in the model parameters due to a limited amount of training data (Bishop, 2006). A well-accepted approach to quantify epistemic uncertainty is variational inference with Monte Carlo (MC) dropout, where dropout is used at test time to sample from the approximate posterior (Gal and Ghahramani, 2016).

Refer to caption — Figure 1: Calibration plots and uncertainty calibration error (UCE) for EfficientNet-B4 on BreastPathQ test set. Uncalibrated uncertainty is underestimated and does not correspond well with the model error (left). Uncertainty can be calibrated most effectively with $\sigma$ scaling (right). Solid lines show the mean and shaded areas show standard deviation from 5 repeated runs. Dashed lines denote perfect calibration.

Uncertainty quantification in regression problems in medical imaging has been addressed by prior work. Medical image enhancement with image quality transfer (IQT) has been extended to a Bayesian approach to obtain pixel-wise uncertainty (Tanno et al., 2016). Additionally, CNN-based IQT was used to estimate both aleatoric and epistemic uncertainty in MRI super-resolution (Tanno et al., 2017). Dalca et al. (2019) estimated uncertainty for a deformation field in medical image registration using a probabilistic CNN. Registration uncertainty has also been addressed outside the deep learning community (Luo et al., 2019). Schlemper et al. (2018) used sub-network ensembles to obtain uncertainty estimates in cardiac MRI reconstruction. Aleatoric and epistemic uncertainty was also used in multitask learning for MRI-based radiotherapy planning (Bragman et al., 2018).

Uncertainty obtained by deep BNNs tends to be miscalibrated, i. e. it does not correlate well with the model error (Laves et al., 2019). Fig. 1 shows calibration plots (observed uncertainty vs. expected uncertainty) for uncalibrated and calibrated uncertainty estimates. The predicted uncertainty (taking into account both epistemic and aleatoric uncertainty) is underestimated and does not allow robust detection of uncertain predictions at test time.

Calibration of uncertainty in regression has been addressed in prior work outside medical imaging. In (Kuleshov et al., 2018), inaccurate uncertainties from Bayesian models for regression are recalibrated using a technique inspired by Platt scaling. Given a pre-trained, miscalibrated model $\bm{H}$ , an auxiliary model $\bm{R}:[0,1]^{d}\rightarrow[0,1]^{d}$ is trained, that yields a calibrated regressor $\bm{R}\circ\bm{H}$ . In (Phan et al., 2018), this method was applied to bounding box regression. However, an auxiliary model with enough capacity will always be able to recalibrate, even if the predicted uncertainty is completely uncorrelated with the real uncertainty. Furthermore, Kuleshov et al. (2018) state that calibration via $\bm{R}$ is possible if enough independent and identically distributed (i.i.d.) data is available. In medical imaging, large data sets are usually hard to obtain, which can cause $\bm{R}$ to overfit the calibration set. This downside was addressed in (Levi et al., 2019), which is most related to our work. They proposed to scale the standard deviation of a Gaussian model to recalibrate aleatoric uncertainty. In contrast to our work, they do not take into account epistemic uncertainty, which is an important source of uncertainty, especially when dealing with small data sets in medical imaging.

This paper extends a preliminary version of this work presented at the Medical Imaging with Deep Learning (MIDL) 2020 conference (Laves et al., 2020). We continue this work by providing a new derivation of our definition of perfect calibrtaion, new experimental results, analysis and discussion. Additionally, prediction intervals are computed to further assess the quality of the estimated uncertainty. We find that prediction intervals are estimated too narrow and that recalibration can mitigate this problem.

To the best of our knowledge, calibration of predictive uncertainty for regression tasks in medical imaging has not been addressed. Our main contributions are: (1) We suggest to use $\sigma$ scaling in a separate calibration phase to tackle underestimation of aleatoric and epistemic uncertainty, (2) we propose to use the uncertainty calibration error and prediction intervals to assess the quality of the estimated uncertainty, and (3) we perform extensive experiments on four different data sets to show the effectiveness of the proposed method.

2 Methods

In this section, we discuss estimation of aleatoric and epistemic uncertainty for regression and show why uncertainty is systematically miscalibrated. We propose to use $\sigma$ scaling to jointly calibrate aleatoric and epistemic uncertainty.

2.1 Conditional Log-Likelihood for Regression

We revisit regression under the maximum posterior (MAP) framework to derive direct estimation of heteroscedastic aleatoric uncertainty. That is, the aleatoric uncertainty varies with the input and is not assumed to be constant. The goal of our regression model is to predict a target value $\bm{y}$ given some new input $\bm{x}$ and a training set $\mathcal{D}$ of $m$ inputs $\bm{X}=\{\bm{x}_{1},\ldots,\bm{x}_{m}\}$ and their corresponding (observed) target values $\bm{Y}=\{\bm{y}_{1},\ldots,\bm{y}_{m}\}$ . We assume that $\bm{y}$ has a Gaussian distribution $\mathcal{N}\left(\bm{y};\hat{\bm{y}}(\bm{x}),\hat{\sigma}^{2}(\bm{x})\right)$ with mean equal to $\hat{\bm{y}}(\bm{x})$ and variance $\hat{\sigma}^{2}(\bm{x})$ . A neural network with parameters $\bm{\theta}$

\bm{f}_{\bm{\theta}}\left(\bm{x}\right)=\left[\hat{\bm{y}}(\bm{x}),\hat{\sigma}^{2}(\bm{x})\right],~{}\hat{\bm{y}}\in\mathbb{R}^{d},~{}\hat{\sigma}^{2}\in\mathbb{R},\hat{\sigma}^{2}\geq 0

(1)

outputs these values for a given input (Nix and Weigend, 1994). By assuming a Gaussian prior over the parameters $\bm{\theta}\sim\mathcal{N}(\bm{\theta};\bm{0},\lambda^{-1}\bm{I})$ , MAP estimation becomes maximum-likelihood estimation with added weight decay (Bishop, 2006). With $m$ i.i.d. random samples, the conditional log-likelihood $\log p(\bm{Y}\,|\,\bm{X},\bm{\theta})$ is given by

	$\displaystyle\log p(\bm{Y}\,\|\,\bm{X},\bm{\theta})=$	$\displaystyle\sum_{i=1}^{m}\log\left(\frac{1}{\sqrt{2\pi}\hat{\sigma}^{(i)}_{\bm{\theta}}}\exp\left\{-\frac{\big{\\|}\bm{y}^{(i)}-\hat{\bm{y}}_{\bm{\theta}}^{(i)}\big{\\|}^{2}}{2\big{(}\hat{\sigma}^{(i)}_{\bm{\theta}}\big{)}^{2}}\right\}\right)$		(2)
	$\displaystyle=$	$\displaystyle-\dfrac{m}{2}\log\left(2\pi\right)-\sum_{i=1}^{m}\log\big{(}\hat{\sigma}^{(i)}_{\bm{\theta}}\big{)}+\frac{1}{2\big{(}\hat{\sigma}_{\bm{\theta}}^{(i)}\big{)}^{2}}\big{\\|}\bm{y}^{(i)}-\hat{\bm{y}}_{\bm{\theta}}^{(i)}\big{\\|}^{2}~{}.$		(3)

The dependence on $\bm{x}$ has been omitted to simplify the notation. Maximizing the log-likelihood in Eq. (3) w.r.t. $\bm{\theta}$ is equivalent to minimizing the negative log-likelihood (NLL), which leads to the following optimization criterion (with weight decay)

\mathcal{L}_{\mathrm{G}}(\bm{\theta})=\sum_{i=1}^{m}\big{(}\hat{\sigma}^{(i)}_{\bm{\theta}}\big{)}^{-2}\big{\|}\bm{y}^{(i)}-\hat{\bm{y}}_{\bm{\theta}}^{(i)}\big{\|}^{2}+\log\big{(}(\hat{\sigma}_{\bm{\theta}}^{(i)})^{2}\big{)}~{}.

(4)

Here, $\hat{\bm{y}}_{\bm{\theta}}$ and $\hat{\sigma}_{\bm{\theta}}$ are estimated jointly by finding $\bm{\theta}$ that minimizes Eq. (4). This can be achieved using gradient descent in a standard training procedure. In this case, $\hat{\sigma}_{\bm{\theta}}$ captures the uncertainty that is inherent in the data (aleatoric uncertainty). To avoid numerical instability due to potential division by zero, we directly estimate $\log\hat{\sigma}^{2}(\bm{x})$ and implement Eq. (4) in similar practice to Kendall and Gal (2017).

2.2 Biased estimation of $\sigma$

Ignoring their dependence through $\bm{\theta}$ , the solution to Eq. (4) decouples estimation of $\hat{\bm{y}}$ and $\hat{\sigma}$ . In case of a Gaussian likelihood, minimizing Eq. (4) w.r.t. $\hat{\bm{y}}^{(i)}$ yields

\hat{\bm{y}}^{(i)}=\operatorname*{arg\,min}_{\hat{\bm{y}}^{(i)}}\mathcal{L}_{\mathrm{G}}=\bm{y}^{(i)}~{}\operatorname{\forall}i~{}.

(5)

Minimizing (4) w.r.t. $(\hat{\sigma}^{(i)})^{2}$ yields

\big{(}\hat{\sigma}^{(i)}\big{)}^{2}=\operatorname*{arg\,min}_{(\hat{\sigma}^{(i)})^{2}}\mathcal{L}_{\mathrm{G}}=\|\bm{y}^{(i)}-\hat{\bm{y}}^{(i)}\|^{2}~{}\operatorname{\forall}i~{}.

(6)

That is, estimation of $\sigma^{2}$ should perfectly reflect the squared error. However, in Eq. (6) $\sigma^{2}$ is estimated relative to the estimated mean $\hat{\bm{y}}$ and therefore biased. In fact, the maximum likelihood solution systematically underestimates $\sigma^{2}$ , which is a phenomenon of overfitting the training set (Bishop, 2006). The squared error $\|\bm{y}-\hat{\bm{y}}\|^{2}$ will be lower on the training set and $\hat{\sigma}^{2}$ on new samples will be systematically too low (see Fig. 2). This is a problem especially in deep learning, where large models have millions of parameters and tend to overfit. To solve this issue, we introduce a simple learnable scalar parameter $s$ to rescale the biased estimation of $\sigma^{2}$ .

2.3 $\sigma$ Scaling for Aleatoric Uncertainty

We first derive $\sigma$ scaling for aleatoric uncertainty. Using a Gaussian model, we scale the standard deviation $\sigma$ with a scalar value $s$ to recalibrate the probability density function

p\left(\bm{y}|\bm{x};\hat{\bm{y}}(\bm{x}),\hat{\sigma}^{2}(\bm{x})\right)=\mathcal{N}\left(\bm{y};\hat{\bm{y}}(\bm{x}),(s\cdot\hat{\sigma}(\bm{x}))^{2}\right)~{}.

(7)

This results in the following minimization objective:

\mathcal{L}_{\mathrm{G}}(s)=m\log(s)+\tfrac{1}{2}s^{-2}\sum_{i=1}^{m}\big{(}\hat{\sigma}_{\bm{\theta}}^{(i)}\big{)}^{-2}\big{\|}\bm{y}^{(i)}-\hat{\bm{y}}_{\bm{\theta}}^{(i)}\big{\|}^{2}~{}.

(8)

Eq. (8) is optimized w.r.t. $s$ with fixed $\bm{\theta}$ using gradient descent in a separate calibration phase after training to calibrate aleatoric uncertainty measured by $\hat{\sigma}_{\bm{\theta}}^{2}$ . In case of a single scalar, the solution to Eq. (8) can also be written in closed form as

s=\pm\sqrt{\frac{1}{m}\sum_{i=1}^{m}\big{(}\hat{\sigma}_{\bm{\theta}}^{(i)}\big{)}^{-2}\big{\|}\bm{y}^{(i)}-\hat{\bm{y}}_{\bm{\theta}}^{(i)}\big{\|}^{2}}~{}.

(9)

We apply $\sigma$ scaling to jointly calibrate aleatoric and epistemic uncertainty in the next section.

2.4 Well-Calibrated Estimation of Predictive Uncertainty

So far we have assumed a MAP point estimate for $\bm{\theta}$ which does not consider uncertainty in the parameters. To quantify both aleatoric and epistemic uncertainty, we extend $\bm{f}_{\bm{\theta}}$ into a fully Bayesian model under the variational inference framework with Monte Carlo dropout (Gal and Ghahramani, 2016). In MC dropout, the model $\bm{f}_{\tilde{\bm{\theta}}}$ is trained with dropout (Srivastava et al., 2014) and dropout is applied at test time by performing $N$ stochastic forward passes to sample from the approximate Bayesian posterior $\tilde{\bm{\theta}}\sim q(\bm{\theta})$ . Following (Kendall and Gal, 2017), we use MC integration to approximate the predictive variance

\hat{\Sigma}^{2}=\underbrace{\frac{1}{N}\sum_{n=1}^{N}\left(\hat{\bm{y}}_{n}-\frac{1}{N}\sum_{n=1}^{N}\hat{\bm{y}}_{n}\right)^{2}}_{\mathrm{epistemic}}+\underbrace{\frac{1}{N}\sum_{n=1}^{N}\hat{\sigma}^{2}_{n}}_{\mathrm{aleatoric}}

(10)

and use $\hat{\Sigma}^{2}$ as a measure of predictive uncertainty. If the neural network has multiple outputs ( $d>1$ ), the predictive variance is calculated per output and the mean across $d$ forms the final uncertainty value. Eq. (10) is an unbiased estimator of the approximate predictive variance (see proof in Appendix C). From Eq. (33) of our proof follows, that $\hat{\Sigma}^{2}$ is expected to equal the true variance $\Sigma=\mathbb{E}[(\hat{\bm{y}}-\bm{y})^{2}]$ . Thus, we define perfect calibration of regression uncertainty as

\mathbb{E}_{\bm{x},\bm{y}}\left[\mathbb{E}[(\hat{\bm{y}}-\bm{y})^{2}]\,\big{|}\,\hat{\Sigma}^{2}=\alpha^{2}\right]=\alpha^{2}\quad\operatorname{\forall}\left\{\alpha^{2}\in\mathbb{R}\,|\,\alpha^{2}\geq 0\right\}~{},

(11)

which extends the definition of (Levi et al., 2019) to both aleatoric and epistemic uncertainty. We expect that additionally accounting for epistemic uncertainty is particularly beneficial for smaller data sets. However, even in deep learning with Bayesian principles, the approximate posterior predictive distribution can overfit on small data sets. In practice, this leads to underestimation of the predictive uncertainty.

One could regularize overfitting by early stopping that prevents large differences between training and test loss, which would circumvent underestimation of $\sigma^{2}$ . However, our experiments show that early stopping is not optimal with regard to accuracy, i.e. the squared error of $\hat{\bm{y}}$ on both training and testing data (see Fig. 2). In contrast, the model with lowest mean error on the validation set underestimates predictive uncertainty considerably. Therefore, we apply $\sigma$ scaling to recalibrate the predictive uncertainty $\hat{\Sigma}^{2}$ . This allows a lower squared error while reducing underestimation of uncertainty as shown experimentally in the following section.

2.5 Expected Uncertainty Calibration Error for Regression

We extend the definition of miscalibrated uncertainty for classification (Laves et al., 2019) to quantify miscalibration of uncertainty in regression

\mathbb{E}_{\hat{\Sigma}^{2}}\left[\big{|}\big{(}\mathbb{E}[(\hat{\bm{y}}-\bm{y})^{2}]\,\big{|}\,\hat{\Sigma}^{2}=\alpha^{2}\big{)}-\alpha^{2}\big{|}\right]\quad\operatorname{\forall}\left\{\alpha^{2}\in\mathbb{R}\,|\,\alpha^{2}\geq 0\right\}~{},

(12)

using the second moment of the error. On finite data sets, this can be approximated with the expected uncertainty calibration error (UCE) for regression. Following (Guo et al., 2017), the uncertainty output $\hat{\Sigma}^{2}$ of a deep model is partitioned into $K$ bins with equal width. A weighted average of the difference between the variance and predictive uncertainty is used:

\mathrm{UCE}:={\sum_{k=1}^{K}}\frac{|B_{k}|}{m}\big{|}{\mathrm{var}}(B_{k})-\mathrm{uncert}(B_{k})\big{|}~{},

(13)

with number of inputs $m$ and set of indices $B_{k}$ of inputs, for which the uncertainty falls into the bin $k$ . The variance per bin is defined as

\mathrm{var}(B_{k}):=\frac{1}{|B_{k}|}\sum_{i\in B_{m}}{\frac{1}{N}\sum_{n=1}^{N}\left(\hat{\bm{y}}_{i,n}-\bm{y}_{i}\right)^{2}}~{},

(14)

with $N$ stochastic forward passes, and the uncertainty per bin is defined as

\mathrm{uncert}(B_{k}):=\frac{1}{|B_{k}|}\sum_{i\in B_{k}}\hat{\Sigma}_{i}^{2}~{}.

(15)

Note that computing the second moment from Eq. (12) also incorporates MC samples, which can introduce some bias in the evaluation. The UCE considers both aleatoric and epistemic uncertainty and is given in % throughout this work. Additionally, we plot $\mathrm{var}(B_{k})$ vs. $\mathrm{uncert}(B_{k})$ to create calibration diagrams.

3 Experiments

We use four data sets and three common deep network architectures to evaluate recalibration with $\sigma$ scaling. The data sets were selected to represent various regression tasks in medical imaging with different dimension $d$ of target value $\bm{y}\in\mathbb{R}^{d}$ :

(1) Estimation of tumor cellularity in histology whole slides of cancerous breast tissue from the BreastPathQ SPIE challenge data set ( $d=1$ ) (Martel et al., 2019). The public data set consists of 2579 images, from which 1379/600/600 are used for training/validation/testing. The ground truth label is a single scalar $y\in[0,1]$ denoting the ratio of tumor cells to non-tumor cells.

(2) Hand CT age regression from the RSNA pediatric bone age data set ( $d=1$ ) (Halabi et al., 2019). The task is to infer a person’s age in months from CT scans of the hand. This data set is the largest used in this paper and has 12,811 images, from which we use 6811/2000/4000 images for training/validation/testing.

(3) Surgical instrument tracking on endoscopic images from the EndoVis endoscopic vision challenge 2015¹¹1endovissub-instrument.grand-challenge.org data set ( $d=2$ ). This data set contains 8,984 video frames from 6 different robot-assisted laparoscopic interventions showing surgical instruments with ground truth pixel coordinates of the instrument’s center point $\bm{y}\in\mathbb{R}^{2}$ . We use 4483/2248/2253 frames for training/validation/testing. As the public data set is only sparsely annotated, we created our own ground truth labels, which can be found in our code repository.

(4) 6DoF needle pose estimation on optical coherence tomography (OCT) scans from our own data set²²2Our OCT pose estimation data set is publicly available at github.com/mlaves/3doct-pose-dataset. This data set contains 5,000 3D-OCT scans with the accompanying needle pose $\bm{y}\in\mathbb{R}^{6}$ , from which we use 3300/850/850 for training/validation/testing. Additional details on this data set can be found in Appendix E.

All outputs are normalized such that $\bm{y}\in[0,1]^{d}$ . The employed network architectures are ResNet-101, DenseNet-201 and EfficientNet-B4 (He et al., 2016; Huang et al., 2017; Tan and Le, 2019). The last linear layer of all networks is replaced by two linear layers predicting $\hat{\bm{y}}$ and $\hat{\sigma}^{2}$ as described in § 2.1. For MC dropout, we use dropout before the last linear layers. Dropout is further added after each of the four layers of stacked residual blocks in ResNet. In DenseNet and EfficientNet, we use the default configuration of dropout during training and testing. The networks are trained until no further decrease in mean squared error (MSE) on the validation set can be observed. More details on the training procedure can be found in Appendix D.

Calibration is performed after training in a separate calibration phase using the validation data set. We plug the predictive uncertainty $\hat{\Sigma}^{2}$ into Eq. (8) and minimize w.r.t. $s$ . Additionally, we compare $\sigma$ scaling to a more powerful auxiliary recalibration model $\bm{R}$ consisting of a two-layer fully-connected network with 16 hidden units and ReLU activations (inspired by (Kuleshov et al., 2018), see § 1).

4 Results

To quantify miscalibration, we use the proposed expected uncertainty calibration error for regression. We visualize (mis-)calibration in Fig. 1 and Fig. 3 using calibration diagrams, which show expected uncertainty vs. observed uncertainty. The discrepancy to the identity function reveals miscalibration. The calibration diagrams clearly show the underestimation of uncertainty for the uncalibrated models. After calibration with both aux and $\sigma$ scaling, the estimated uncertainty better reflects the actual uncertainty. Figures for all configurations are listed in Appendix G.

Tab. 1 reports UCE values of all data set/model combinations on the respective test sets. The negative log-likelihood also measures miscalibration; the values on the test set can be found in Tab. 2 in the appendix. In general, recalibration considerably reduces miscalibration. On the data sets BoneAge, EndoVis and OCT, both scaling methods perform similarly well. However, on the BreastPathQ data set, $\sigma$ scaling clearly outperforms aux scaling in terms of UCE. BreastPathQ is the smallest data set and thus has the smallest calibration set size. We hypothesize that the more powerful auxiliary model $\bm{R}$ overfits the calibration set (see BreastPathQ/DenseNet-201 in Tab. 1), which leads to an increase of UCE on the test set. An ablation study on BreastPathQ for the auxiliary model can be found in Appendix F.

We also compare our approach to Levi et al. (2019) in Tab. 1, which only considers aleatoric uncertainty. The aleatoric uncertainty is well-calibrated if it reflects the bias $\left(\mathbb{E}\left[\hat{\bm{y}}_{n}\right]-\bm{y}\right)^{2}$ , which is given by the squared error between the expectation of the stochastic predictions $\hat{\bm{y}}_{n}$ and the ground truth. Therefore, the UCE for aleatoric-only is computed by $\mathrm{UCE}={\sum_{k=1}^{K}}\frac{|B_{k}|}{m}\big{|}{\mathrm{err}}(B_{k})-\mathrm{uncert}(B_{k})\big{|}~{}$ , where $\mathrm{err}(\cdot)$ is the mean squared error and $\mathrm{uncert}(\cdot)$ is the mean aleatoric uncertainty per bin. Consideration of epistemic uncertainty is especially beneficial on smaller data sets (BreastPathQ), where our approach outperforms Levi et al. (2019). On larger data sets, the benefit diminishes and both approaches are equally calibrated.

Additionally, we report UCE values from a DenseNet ensemble for comparison. In contrast to what is reported by Lakshminarayanan et al. (2017), the deep ensemble tends to be calibrated worse. Only on BoneAge, the ensemble is better calibrated prior to recalibration of the other methods. After recalibration, both approaches outperform the deep ensemble.

Fig. 4 shows the result of intra-training calibration of aleatoric uncertainty. It indicates that the gap between training and test loss is successfully closed.

Table 1: Uncertainty calibration error test set results for different datasets and model architectures (averaged over 5 runs). High UCE values indicate miscalibration. In addition, the resulting

s

for

\sigma

scaling is given. We also report UCE values for an ensemble of DenseNets. Bold font indicates lowest values in each experiment.

			Levi et al.				ours
Data Set	Model	MSE	none	aux	$\sigma$	$s$	none	aux	$\sigma$	$s$	ensemble
	ResNet-101	6.4e-3	0.51	0.35	0.28	2.91	0.49	0.31	0.20	2.37
BreastPathQ	DenseNet-201	7.0e-3	0.21	0.38	0.15	1.62	0.11	0.36	0.15	1.33	0.51
	EfficientNet-B4	6.4e-3	0.49	0.65	0.10	2.30	0.46	0.47	0.17	1.77
	ResNet-101	5.3e-3	0.28	0.07	0.06	1.46	0.28	0.02	0.06	1.40
BoneAge	DenseNet-201	3.5e-3	0.31	0.05	0.05	2.98	0.31	0.05	0.05	2.54	0.09
	EfficientNet-B4	3.5e-3	0.30	0.05	0.10	4.83	0.30	0.03	0.12	3.98
	ResNet-101	4.0e-4	0.04	0.10	0.09	6.07	0.04	0.04	0.04	3.50
EndoVis	DenseNet-201	1.1e-3	0.09	0.05	0.05	3.24	0.04	0.04	0.04	2.57	0.08
	EfficientNet-B4	8.9e-4	0.06	0.05	0.06	2.25	0.06	0.04	0.04	1.79
	ResNet-101	2.0e-3	0.17	0.02	0.02	2.74	0.17	0.01	0.02	2.14
OCT	DenseNet-201	1.3e-3	0.08	0.01	0.02	1.60	0.04	0.03	0.02	1.26	0.67
	EfficientNet-B4	1.4e-3	0.12	0.01	0.01	2.65	0.12	0.01	0.01	1.94

4.1 Posterior Prediction Intervals

In addition to the calibration diagrams, we compute prediction intervals from the uncalibrated and calibrated posterior predictive distribution. Well-calibrated prediction intervals provide a reliable measure of precision of the estimated target value. In Bayesian inference, prediction intervals define an interval within which the true target value $\bm{y}^{\ast}$ of a new, unobserved input $\bm{x}^{\ast}$ is expected to fall with a specific probability (Heskes, 1997; Held and Sabanés Bové, 2014). This is also referred to as the credible interval of the posterior predictive distribution. For $\gamma\in(0,1)$ , a $\gamma\cdot 100\,\%$ prediction interval is defined through $z_{l}$ and $z_{u}$ such that

\int_{z_{l}}^{z_{u}}p(\bm{y}^{\ast}\,|\,\bm{x}^{\ast},\mathcal{D})\,\mathrm{d}\bm{y}^{\ast}=\gamma~{},

(16)

with posterior predictive distribution $p(\bm{y}^{\ast}\,|\,\bm{x}^{\ast},\mathcal{D})$ . We compute the 50 %, 90 %, 95 %, and 99 % prediction interval using the root of the predictive variance from Eq. (10); that is, the $\hat{\bm{y}}\pm z\hat{\Sigma}$ intervals with $z\in\{\Phi(0.5),\Phi(0.9),\Phi(0.95),\Phi(0.99)\}$ (estimated interval), with probit function $\Phi(p)=\sqrt{2}\mathrm{erf}^{-1}(p)$ and $\mathrm{erf}(p)$ is the Gaussian error function. This assumes that the posterior predictive distribution is Gaussian, which is not generally the case. To assess the calibration of the posterior prediction interval, we compute the percentage of how many of the ground truth values of the test set actually fall within the respective intervals (observed interval). In Fig. 6, selected plots of observed vs. estimated prediction intervals are shown. A complete list of prediction intervals can be found in Appendix G.1.

In general, the uncalibrated prediction intervals are estimated to be too narrow, which is a direct consequence of the underestimated predictive variance. For example, the uncalibrated 90 % interval on DenseNet-201/BoneAge actually only contains approx. 50 % of the ground truth values. On this data set, the prediction intervals are considerably improved after recalibration (Fig. 6 left). If a network is already well-calibrated, recalibration can lead to overestimation of the lower prediction intervals (Fig. 6 right). However, in all cases, the 99 % prediction interval contains approx. 99 % of the ground truth test set values after recalibration. This is not the case without the proposed calibration methods. Fig. 5 shows a practical example of the $\hat{\bm{y}}\pm\hat{\Sigma}$ prediction region from the EndoVis test set. Even though the posterior predictive distribution is not necessarily Gaussian, the calibrated results fit the prediction intervals well. This is especially the case for BoneAge, which is the largest data set used in this paper.

4.2 Detection of Out-of-Distribution Data and Unreliable Predictions

Deep neural networks only yield reliable predictions for data which follow the same distribution as the training data. A shift in distribution could occur when a model trained on CT data from a specific CT device is applied to data from another manufacturer’s CT device, for example. This could potentially lead to wrong predictions with low uncertainty, which we tackle with recalibration. To create a moderate distribution shift, we preprocess images from the BoneAge data set using Contrast Limited Adapative Histogram Equalization (CLAHE) (Pizer et al., 1987) with a clip-limit of 0.03 and report histograms of the uncertainties (see Fig. 7). Additionally, a severe distribution shift is created by presenting images from the BreastPathQ data set to the models trained on BoneAge. Lakshminarayanan et al. (2017) state that deep ensembles provide better-calibrated uncertainty than Bayesian neural networks with MC dropout variational inference. We therefore train an ensemble of 5 randomly initialized DenseNet-201 and compare Bayesian uncertainty with $\sigma$ scaling to ensemble uncertainty under distribution shift. The results with $\sigma$ scaling are comparable to those from a deep ensemble for a moderate shift, but without the need to train multiple models on the same data set. A severe shift leads to only slightly increased uncertainties from the calibrated MC dropout model, while the deep ensemble is more sensitive.

Additionally, we apply the well-calibrated models to detect and reject uncertain predictions, as crucial decisions in medical practice should only be made on the basis of reliable predictions. An uncertainty threshold $\Sigma_{\mathrm{max}}^{2}$ is defined and all predictions from the test set are rejected where $\hat{\Sigma}^{2}>\Sigma_{\mathrm{max}}^{2}$ (see Fig. 8). From this, a decrease in overall MSE is expected. We additionally compare rejection on the basis of $\sigma$ scaled uncertainty to uncertainty from the aforementioned ensemble. In case of $\sigma$ scaling, the test set MSE decreases monotonically as a function of the uncertainty threshold, whereas the ensemble initially shows an increasing MSE (see Fig. 8).

5 Discussion & Conclusion

In this paper, well-calibrated predictive uncertainty in medical imaging obtained by variational inference with deep Bayesian models is discussed. Both aux and $\sigma$ scaling calibration methods considerably reduce miscalibration of predictive uncertainty in terms of UCE. If the deep model is already well-calibrated, $\sigma$ scaling does not negatively affect the calibration, which results in $s\rightarrow 1$ . More complex calibration methods such as aux scaling have to be used with caution, as they can overfit the data set used for calibration. If the calibration set is sufficiently large, they can outperform simple scaling. However, models trained on large data sets are generally better calibrated and the benefit diminishes. Compared to the work of Levi et al. (2019), accounting for epistemic uncertainty is particularly beneficial for smaller data sets, which is helpful in medical practice where access to large labeled data sets is less common and is associated with great costs.

Posterior prediction intervals provide another insight into the calibration of deep models. After recalibration, the 99 % posterior prediction intervals correctly contain approx. 99 % of the ground truth test set values. In some cases, lower prediction intervals are estimated to be too wide after calibration. This is especially the case for smaller data sets and we conjecture that small calibration sets may not contain enough i.i.d. data for calibrating lower prediction intervals and that the assumption of a Gaussian predictive distribution is too strong in this case. On the smallest data set BreastPathQ, aux scaling seems to perform better in terms of prediction intervals, but not in terms of UCE.

Well-calibrated uncertainties from MC dropout are able to detect a moderate shift in the data distribution. However, deep ensembles perform better under a severe distribution shift. BNNs with calibrated uncertainty by $\sigma$ scaling outperform ensemble uncertainty in the rejection task, which we attribute to the generally poorer calibration of ensembles on in-distribution data.

$\sigma$ scaling is simple to implement, does not change the predictive mean $\hat{\bm{y}}$ , and therefore guarantees to conserve the model’s accuracy. It is preferable to regularization (e. g. early stopping) or more complex recalibration methods in calibrated uncertainty estimation with Bayesian deep learning. The disconnection between training and test NLL can successfully be closed, which creates highly accurate models with reliable uncertainty estimates. However, there are many factors (e. g. network capacity, weight decay, dropout configuration) influencing the uncertainty that have not been discussed here and will be addressed in future work.

Acknowledgments

We thank Vincent Modes and Mark Wielitzka for their insightful comments. This research has received funding from the European Union as being part of the ERDF OPhonLas project.

References

Bishop (2006) Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. ISBN 978-0-387-31073-2.
Bragman et al. (2018) Felix JS Bragman, Ryutaro Tanno, Zach Eaton-Rosen, Wenqi Li, David J Hawkes, Sebastien Ourselin, Daniel C Alexander, Jamie R McClelland, and M Jorge Cardoso. Uncertainty in multitask learning: joint representations for probabilistic mr-only radiotherapy planning. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 3–11, 2018.
Dalca et al. (2019) Adrian V. Dalca, Guha Balakrishnan, John Guttag, and Mert R. Sabuncu. Unsupervised learning of probabilistic diffeomorphic registration for images and surfaces. Med Image Anal, 57:226–236, 2019. doi: https://doi.org/10.1016/j.media.2019.07.006.
Gal and Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In ICML, pages 1050–1059, 2016.
Gessert et al. (2018) Nils Gessert, Matthias Schlüter, and Alexander Schlaefer. A deep learning approach for pose estimation from volumetric OCT data. Med Image Anal, 46:162–179, 2018. doi: 10.1016/j.media.2018.03.002.
Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In ICML, pages 1321–1330, 2017.
Halabi et al. (2019) Safwan S. Halabi, Luciano M. Prevedello, Jayashree Kalpathy-Cramer, Artem B. Mamonov, Alexander Bilbily, Mark Cicero, Ian Pan, Lucas Araújo Pereira, Rafael Teixeira Sousa, Nitamar Abdala, Felipe Campos Kitamura, Hans H. Thodberg, Leon Chen, George Shih, Katherine Andriole, Marc D. Kohli, Bradley J. Erickson, and Adam E. Flanders. The RSNA pediatric bone age machine learning challenge. Radiol, 290(2):498–503, 2019. doi: 10.1148/radiol.2018180736.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
Held and Sabanés Bové (2014) Leonhard Held and Daniel Sabanés Bové. Applied Statistical Inference. Springer, 1 edition, 2014. ISBN 978-3-642-37887-4. doi: 10.1007/978-3-642-37887-4.
Heskes (1997) Tom Heskes. Practical confidence and prediction intervals. In NeurIPS, pages 176–182, 1997.
Hora (1996) Stephen C Hora. Aleatory and epistemic uncertainty in probability elicitation with an example from hazardous waste management. Reliability Engineering & System Safety, 54(2-3):217–223, 1996.
Huang et al. (2017) Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In CVPR, pages 4700–4708, 2017.
Kendall and Gal (2017) Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In NeurIPS, pages 5574–5584, 2017.
Kingma and Welling (2014) Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.
Kuleshov et al. (2018) Volodymyr Kuleshov, Nathan Fenner, and Stefano Ermon. Accurate uncertainties for deep learning using calibrated regression. In ICML, volume 80, pages 2796–2804, 2018.
Lakshminarayanan et al. (2017) Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In NeurIPS, pages 6402–6413, 2017.
Laves et al. (2017) Max-Heinrich Laves, Andreas Schoob, Lüder A. Kahrs, Tom Pfeiffer, Robert Huber, and Tobias Ortmaier. Feature tracking for automated volume of interest stabilization on 4D-OCT images. In SPIE Medical Imaging, volume 10135, pages 256–262, 2017. doi: 10.1117/12.2255090.
Laves et al. (2019) Max-Heinrich Laves, Sontje Ihler, Karl-Philipp Kortmann, and Tobias Ortmaier. Well-calibrated model uncertainty with temperature scaling for dropout variational inference. In Bayesian Deep Learning Workshop (NeurIPS), 2019. arXiv:1909.13550.
Laves et al. (2020) Max-Heinrich Laves, Sontje Ihler, Jacob F Fast, Lüder A Kahrs, and Tobias Ortmaier. Well-calibrated regression uncertainty in medical imaging with deep learning. In Medical Imaging with Deep Learning, 2020.
Levi et al. (2019) Dan Levi, Liran Gispan, Niv Giladi, and Ethan Fetaya. Evaluating and calibrating uncertainty prediction in regression tasks. In arXiv, 2019. arXiv:1905.11659.
Luo et al. (2019) Jie Luo, Alireza Sedghi, Karteek Popuri, Dana Cobzas, Miaomiao Zhang, Frank Preiswerk, Matthew Toews, Alexandra Golby, Masashi Sugiyama, William M Wells, et al. On the applicability of registration uncertainty. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 410–419, 2019.
Martel et al. (2019) A. L. Martel, S. Nofech-Mozes, S. Salama, S. Akbar, and M. Peikari. Assessment of residual breast cancer cellularity after neoadjuvant chemotherapy using digital pathology [data set]. The Cancer Imaging Archive, 2019. doi: 10.7937/TCIA.2019.4YIBTJNO.
Messay et al. (2015) Temesguen Messay, Russell C. Hardie, and Timothy R. Tuinstra. Segmentation of pulmonary nodules in computed tomography using a regression neural network approach and its application to the lung image database consortium and image database resource initiative dataset. Med Image Anal, 22(1):48–62, 2015. doi: 10.1016/j.media.2015.02.002.
Nix and Weigend (1994) David A Nix and Andreas S Weigend. Estimating the mean and variance of the target probability distribution. In Proceedings of IEEE International Conference on Neural Networks, volume 1, pages 55–60, 1994.
Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, pages 8024–8035, 2019.
Payer et al. (2019) Christian Payer, Darko Štern, Horst Bischof, and Martin Urschler. Integrating spatial configuration into heatmap regression based CNNs for landmark localization. Med Image Anal, 54:207–219, 2019. doi: 10.1016/j.media.2019.03.007.
Phan et al. (2018) Buu Phan, Rick Salay, Krzysztof Czarnecki, Vahdat Abdelzad, Taylor Denouden, and Sachin Vernekar. Calibrating uncertainties in object localization task. In Bayesian Deep Learning Workshop (NeurIPS), 2018. arXiv:1811.11210.
Pizer et al. (1987) Stephen M Pizer, E Philip Amburn, John D Austin, Robert Cromartie, Ari Geselowitz, Trey Greer, Bart ter Haar Romeny, John B Zimmerman, and Karel Zuiderveld. Adaptive histogram equalization and its variations. Computer vision, graphics, and image processing, 39(3):355–368, 1987.
Schlemper et al. (2018) Jo Schlemper, Guang Yang, Pedro Ferreira, Andrew Scott, Laura-Ann McGill, Zohya Khalique, Margarita Gorodezky, Malte Roehl, Jennifer Keegan, Dudley Pennell, et al. Stochastic deep compressive sensing for the reconstruction of diffusion tensor cardiac mri. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 295–303, 2018.
Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 15:1929–1958, 2014.
Štern et al. (2016) Darko Štern, Christian Payer, Vincent Lepetit, and Martin Urschler. Automated age estimation from hand mri volumes using deep learning. In MICCAI, pages 194–202, 2016.
Tan et al. (2017) Li Kuo Tan, Yih Miin Liew, Einly Lim, and Robert A. McLaughlin. Convolutional neural network regression for short-axis left ventricle segmentation in cardiac cine MR sequences. Med Image Anal, 39:78–86, 2017. doi: 10.1016/j.media.2017.04.002.
Tan and Le (2019) Mingxing Tan and Quoc V Le. EfficientNet: Rethinking model scaling for convolutional neural networks. In ICML, 2019.
Tanno et al. (2016) Ryutaro Tanno, Aurobrata Ghosh, Francesco Grussu, Enrico Kaden, Antonio Criminisi, and Daniel C Alexander. Bayesian image quality transfer. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 265–273, 2016.
Tanno et al. (2017) Ryutaro Tanno, Daniel E Worrall, Aurobrata Ghosh, Enrico Kaden, Stamatios N Sotiropoulos, Antonio Criminisi, and Daniel C Alexander. Bayesian image quality transfer with cnns: exploring uncertainty in dmri super-resolution. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 611–619, 2017.
Xie et al. (2018) Yuanpu Xie, Fuyong Xing, Xiaoshuang Shi, Xiangfei Kong, Hai Su, and Lin Yang. Efficient and robust cell detection: A structured regression approach. Med Image Anal, 44:245–254, 2018. doi: 10.1016/j.media.2017.07.003.
Yin et al. (2020) Shi Yin, Qinmu Peng, Hongming Li, Zhengqiang Zhang, Xinge You, Katherine Fischer, Susan L. Furth, Gregory E. Tasian, and Yong Fan. Automatic kidney segmentation in ultrasound images using subsequent boundary distance regression and pixelwise classification networks. Med Image Anal, 60:101602, 2020. doi: 10.1016/j.media.2019.101602.

A Laplacian Model

Using $\mathsf{Laplace}(\hat{\bm{y}}(\bm{x}),\hat{\sigma}(\bm{x}))$ as model, the conditional log-likelihood is given by

\displaystyle\log p(\bm{Y}\,|\,\bm{X},\bm{\theta})=

\displaystyle\sum_{i=1}^{m}\log\left(\frac{1}{2\hat{\sigma}^{(i)}_{\bm{\theta}}}\exp\left\{-\frac{\big{|}\bm{y}^{(i)}-\hat{\bm{y}}_{\bm{\theta}}^{(i)}\big{|}}{\hat{\sigma}^{(i)}_{\bm{\theta}}}\right\}\right)~{},

(17)

which results in the following minimization criterion:

\mathcal{L}_{\mathrm{L}}(\bm{\theta})=\sum_{i=1}^{m}\frac{1}{\hat{\sigma}^{(i)}_{\bm{\theta}}}\big{|}\bm{y}^{(i)}-\hat{\bm{y}}_{\bm{\theta}}^{(i)}\big{|}+\log\big{(}\hat{\sigma}_{\bm{\theta}}^{(i)}\big{)}~{}.

(18)

Using $\mathcal{L}_{\mathrm{L}}(\bm{\theta})$ instead of $\mathcal{L}_{\mathrm{G}}(\bm{\theta})$ results in applying an L1 metric on the predictive mean. In some cases, this led to better results. However, we have not conducted extensive experiments with it and leave it to future work.

B Derivation of $\sigma$ Scaling

See § 2.3. Using a Gaussian model, we scale the standard deviation $\sigma$ with a scalar value $s$ to calibrate the probability density function

p\left(\bm{y}|\bm{x};\hat{\bm{y}}(x),\hat{\sigma}^{2}(x)\right)=\mathcal{N}\left(\bm{y};\hat{\bm{y}}(x),(s\cdot\hat{\sigma}(x))^{2}\right)~{}.

(19)

The conditional log-likelihood is given by

	$\displaystyle\log p(\bm{Y}\,\|\,\bm{X},\bm{\theta})$	$\displaystyle=\sum_{i=1}^{m}\log\left(\frac{1}{\sqrt{2\pi}s\hat{\sigma}_{\theta}^{(i)}}\exp\left(\frac{\big{\\|}\bm{y}^{(i)}-\hat{\bm{y}}^{(i)}_{\bm{\theta}}\big{\\|}^{2}}{2\left(s\hat{\sigma}_{\bm{\theta}}^{(i)}\right)^{2}}\right)\right)$		(20)
		$\displaystyle=-\dfrac{m}{2}\log\left(2\pi\right)-\sum_{i=1}^{m}\log\left(s\hat{\sigma}^{(i)}_{\bm{\theta}}\right)+\frac{1}{2}\left(s\hat{\sigma}_{\bm{\theta}}^{(i)}\right)^{-2}\cdot\big{\\|}\bm{y}^{(i)}-\hat{\bm{y}}_{\bm{\theta}}^{(i)}\big{\\|}^{2}$		(21)

This results in the following optimization objective (ignoring constants):

\mathcal{L}_{\mathrm{G}}(s)=m\log(s)+\tfrac{1}{2}s^{-2}\sum_{i=1}^{m}(\hat{\sigma}_{\bm{\theta}}^{(i)})^{-2}\big{\|}\bm{y}^{(i)}-\hat{\bm{y}}_{\bm{\theta}}^{(i)}\big{\|}^{2}~{}.

(22)

Using a Laplacian model, the optimization criterion follows as

\mathcal{L}_{\mathrm{L}}(s)=m\log(s)+s^{-1}\sum_{i=1}^{m}\frac{1}{\hat{\sigma}_{\bm{\theta}}^{(i)}}\left|\bm{y}^{(i)}-\hat{\bm{y}}_{\bm{\theta}}^{(i)}\right|~{}.

(23)

Eq. (22) and (23) are optimized w.r.t. $s$ with fixed $\bm{\theta}$ using gradient descent in a separate calibration phase after training. The solution to Eq. (22) can also be written in closed form as

s_{\mathrm{G}}=\pm\sqrt{\frac{1}{m}\sum_{i=1}^{m}\big{(}\hat{\sigma}_{\bm{\theta}}^{(i)}\big{)}^{-2}\big{\|}\bm{y}^{(i)}-\hat{\bm{y}}_{\bm{\theta}}^{(i)}\big{\|}^{2}}

(24)

and the solution to Eq. (23) follows as

s_{\mathrm{L}}=\frac{1}{m}\sum_{i=1}^{m}\frac{1}{\hat{\sigma}_{\bm{\theta}}^{(i)}}\left|\bm{y}^{(i)}-\hat{\bm{y}}_{\bm{\theta}}^{(i)}\right|~{},

(25)

respectively. We apply $\sigma$ scaling to jointly calibrate aleatoric and epistemic uncertainty as described in § 2.4.

C Unbiased Estimator of the Approximate Predictive Variance

We show that the expectation of the predictive sample variance from MC dropout, as given in (Kendall and Gal, 2017), equals the true variance of the approximate posterior predictive distribution.

Proposition 1

Given $N$ MC dropout samples $\bm{f}_{\bm{\theta}_{n}}=[\hat{\bm{y}}_{n},\hat{\sigma}^{2}_{n}]$ from our approximate predictive distribution $p(\bm{y}^{\ast}|\bm{x}^{\ast},\mathcal{D})=\mathcal{N}(\bm{y}^{\ast};\bm{y},\Sigma^{2})$ , the predictive sample variance

\hat{\Sigma}^{2}=\frac{1}{N}\sum_{n=1}^{N}\left(\hat{\bm{y}}_{n}-\frac{1}{N}\sum_{n=1}^{N}\hat{\bm{y}}_{n}\right)^{2}+\frac{1}{N}\sum_{n=1}^{N}\hat{\sigma}^{2}_{n}

(26)

is an unbiased estimator of the approximate predictive variance.

Proof

$\displaystyle\mathbb{E}\left[\hat{\Sigma}^{2}\right]$	$\displaystyle=\mathbb{E}\left[\frac{1}{N}\sum_{n=1}^{N}\left(\hat{\bm{y}}_{n}-\frac{1}{N}\sum_{n=1}^{N}\hat{\bm{y}}_{n}\right)^{2}+\frac{1}{N}\sum_{n=1}^{N}\hat{\sigma}^{2}_{n}\right]$		(27)
	$\displaystyle=\mathbb{E}\left[\frac{1}{N}\sum_{n=1}^{N}\left(\hat{\bm{y}}_{n}-\frac{1}{N}\sum_{n=1}^{N}\hat{\bm{y}}_{n}\right)^{2}\right]+\mathbb{E}\left[\frac{1}{N}\sum_{n=1}^{N}\hat{\sigma}^{2}_{n}\right]$		(28)
	$\displaystyle\mathrm{with}\quad\frac{1}{N}\sum_{n=1}^{N}\hat{\bm{y}}_{n}=\bar{\bm{y}}\quad\mathrm{follows}$		(29)
	$\displaystyle=\mathbb{E}\left[\frac{1}{N}\sum_{n=1}^{N}\left(\hat{\bm{y}}_{n}-\bar{\bm{y}}\right)^{2}\right]+\hat{\sigma}^{2}$		(30)
	$\displaystyle=\mathbb{E}\left[\frac{1}{N}\sum_{n=1}^{N}\left(\hat{\bm{y}}_{n}-\bar{\bm{y}}\right)^{2}+\bar{\bm{y}}^{2}-\bar{\bm{y}}^{2}+\bm{y}^{2}-\bm{y}^{2}+2\bar{\bm{y}}\bm{y}-2\bar{\bm{y}}\bm{y}\right]+\hat{\sigma}^{2}$		(31)
	$\displaystyle=\mathbb{E}\left[\frac{1}{N}\sum_{n=1}^{N}\left(\hat{\bm{y}}_{n}-\bm{y}\right)^{2}-\left(\bar{\bm{y}}-\bm{y}\right)^{2}\right]+\hat{\sigma}^{2}$		(32)
	$\displaystyle=\mathbb{E}\left[\left(\hat{\bm{y}}-\bm{y}\right)^{2}\right]-\mathbb{E}\left[\left(\bar{\bm{y}}-\bm{y}\right)^{2}\right]+\hat{\sigma}^{2}$	$\displaystyle=\Sigma^{2}-\hat{\sigma}^{2}+\hat{\sigma}^{2}$	(33)
$\displaystyle\mathbb{E}\left[\hat{\Sigma}^{2}\right]$	$\displaystyle=\Sigma^{2}$		(34)

Note that the predicted heteroscedastic aleatoric uncertainty $\hat{\sigma}^{2}$ equals the bias $\mathbb{E}[(\bar{\bm{y}}-\bm{y})^{2}]$ in Eq. (33) when the aleatoric uncertainty is perfectly calibrated, thus $\mathbb{E}[(\bar{\bm{y}}-\bm{y})^{2}]=\hat{\sigma}^{2}$ .

D Training Procedure

The model implementations from PyTorch 1.3 (Paszke et al., 2019) are used and trained with the following settings:

•
training for 500 epochs with batch size of 16
•
Adam optimizer with initial learn rate of $3\cdot 10^{-4}$ and weight decay with $\lambda=10^{-7}$
•
reduce-on-plateau learn rate scheduler (patience of 20 epochs) with factor of 0.1
•
in MC dropout, $N=25$ forward passes were performed with dropout with $p=0.5$ used for ResNet (as described in (Gal and Ghahramani, 2016)). In DenseNet ( $p=0.2$ ) and EfficientNet ( $p=0.4$ ) standard dropout $p$ of the architecture is used.
•
Additional validation and test sets are used if provided by the data sets; otherwise, a train/validation/test split of approx. 50% / 25% / 25% is used
•
Source code for all experiments is available at github.com/mlaves/well-calibrated-regression-uncertainty

E 3D OCT Needle Pose Data Set

Our data set was created by attaching a surgical needle to a high-precision six-axis hexapod robot (H-826, Physik Instrumente GmbH & Co. KG, Germany) and observing the needle tip with 3D optical coherence tomography (OCS1300SS, Thorlabs Inc., USA). The data set consists of 5,000 OCT acquisitions with $(64\times 64\times 512)$ voxels, covering a volume of approx. $(3\times 3\times 3)$ $\mathrm{mm}^{3}$ . Each acquisition is taken at a different robot configuration and labeled with the corresponding 6DoF pose $\bm{y}\in\mathbb{R}^{6}$ . To process the volumetric data with CNNs for planar images, we calculate 3 planar projections along the spatial dimensions using the $\operatorname*{arg\,max}$ operator, scale them to equal size and stack them together as three-channel image (see Fig. 9). A similar approach was presented in (Laves et al., 2017) and (Gessert et al., 2018). The data are characterized by a high amount of speckle noise, which is a typical phenomenon in optical coherence tomography. The data set is publicly available at github.com/mlaves/3doct-pose-dataset.

F Ablation Study on Auxiliary Model Scaling

Here, we investigate the overfitting behavior of aux scaling by reducing the number of hidden layer units $h$ of the two-layer auxiliary model with ReLU activations. Aux scaling is more powerful than $\sigma$ scaling, which can lead to overfitting the calibration set. Fig. 10 shows calibration diagrams for the auxiliary model ablations. Reducing $h$ leads to a minor calibration improvement, but at $h=2$ , the model outputs a constant uncertainty, which is close to the overall mean of the observed uncertainty. A single-layer single-unit model without bias would be equivalent to $\sigma$ scaling.

G Additional Results and Calibration Diagrams

All test set runs have been repeated 5 times. Solid lines denote mean and shaded areas denote standard deviation calculated from the repeated runs.

Table 2: Negative log-likelihood test set results for different datasets and model architectures (averaged over 5 runs). High NLL values indicate miscalibration. We also report NLL values for an ensemble of DenseNets. Bold font indicates lowest values in each experiment.

			Levi et al.			ours
Data Set	Model	MSE	none	aux	$\sigma$	none	aux	$\sigma$	ensemble
	ResNet-101	6.4e-3	-0.78	-5.06	-5.06	-2.89	-5.17	-5.16
BreastPathQ	DenseNet-201	7.0e-3	-5.16	-5.84	-5.70	-5.67	-6.03	-5.78	0.11
	EfficientNet-B4	6.4e-3	-3.11	-5.99	-5.53	-4.73	-6.16	-5.62
	ResNet-101	5.3e-3	-3.90	-4.34	-4.34	-3.99	-4.34	-4.34
BoneAge	DenseNet-201	3.5e-3	1.74	-4.70	-4.69	-0.75	-4.70	-4.69	0.07
	EfficientNet-B4	3.5e-3	13.61	-4.74	-4.67	6.40	-4.75	-4.64
	ResNet-101	4.0e-4	-0.53	-6.32	-6.33	-3.85	-6.76	-6.72
EndoVis	DenseNet-201	1.1e-3	-0.72	-6.10	-5.99	-4.94	-6.05	-6.04	0.04
	EfficientNet-B4	8.9e-4	-5.10	-6.06	-6.07	-5.94	-6.17	-6.17
	ResNet-101	2.0e-3	-1.08	-5.24	-5.24	-3.38	-5.24	-5.24
OCT	DenseNet-201	1.3e-3	-5.05	-5.61	-5.61	-5.51	-5.62	-5.61	0.10
	EfficientNet-B4	1.4e-3	-1.72	-5.58	-5.57	-4.25	-5.58	-5.57

G.1 Additional Prediction Intervals

[bib.bib1] Bishop (2006) Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. ISBN 978-0-387-31073-2.

[bib.bib2] Bragman et al. (2018) Felix JS Bragman, Ryutaro Tanno, Zach Eaton-Rosen, Wenqi Li, David J Hawkes, Sebastien Ourselin, Daniel C Alexander, Jamie R McClelland, and M Jorge Cardoso. Uncertainty in multitask learning: joint representations for probabilistic mr-only radiotherapy planning. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 3–11, 2018.

[bib.bib3] Dalca et al. (2019) Adrian V. Dalca, Guha Balakrishnan, John Guttag, and Mert R. Sabuncu. Unsupervised learning of probabilistic diffeomorphic registration for images and surfaces. Med Image Anal, 57:226–236, 2019. doi: https://doi.org/10.1016/j.media.2019.07.006.

[bib.bib4] Gal and Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In ICML, pages 1050–1059, 2016.

[bib.bib5] Gessert et al. (2018) Nils Gessert, Matthias Schlüter, and Alexander Schlaefer. A deep learning approach for pose estimation from volumetric OCT data. Med Image Anal, 46:162–179, 2018. doi: 10.1016/j.media.2018.03.002.

[bib.bib6] Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In ICML, pages 1321–1330, 2017.

[bib.bib7] Halabi et al. (2019) Safwan S. Halabi, Luciano M. Prevedello, Jayashree Kalpathy-Cramer, Artem B. Mamonov, Alexander Bilbily, Mark Cicero, Ian Pan, Lucas Araújo Pereira, Rafael Teixeira Sousa, Nitamar Abdala, Felipe Campos Kitamura, Hans H. Thodberg, Leon Chen, George Shih, Katherine Andriole, Marc D. Kohli, Bradley J. Erickson, and Adam E. Flanders. The RSNA pediatric bone age machine learning challenge. Radiol, 290(2):498–503, 2019. doi: 10.1148/radiol.2018180736.

[bib.bib8] He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.

[bib.bib9] Held and Sabanés Bové (2014) Leonhard Held and Daniel Sabanés Bové. Applied Statistical Inference. Springer, 1 edition, 2014. ISBN 978-3-642-37887-4. doi: 10.1007/978-3-642-37887-4.

[bib.bib10] Heskes (1997) Tom Heskes. Practical confidence and prediction intervals. In NeurIPS, pages 176–182, 1997.

[bib.bib11] Hora (1996) Stephen C Hora. Aleatory and epistemic uncertainty in probability elicitation with an example from hazardous waste management. Reliability Engineering & System Safety, 54(2-3):217–223, 1996.

[bib.bib12] Huang et al. (2017) Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In CVPR, pages 4700–4708, 2017.

[bib.bib13] Kendall and Gal (2017) Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In NeurIPS, pages 5574–5584, 2017.

[bib.bib14] Kingma and Welling (2014) Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.

[bib.bib15] Kuleshov et al. (2018) Volodymyr Kuleshov, Nathan Fenner, and Stefano Ermon. Accurate uncertainties for deep learning using calibrated regression. In ICML, volume 80, pages 2796–2804, 2018.

[bib.bib16] Lakshminarayanan et al. (2017) Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In NeurIPS, pages 6402–6413, 2017.

[bib.bib17] Laves et al. (2017) Max-Heinrich Laves, Andreas Schoob, Lüder A. Kahrs, Tom Pfeiffer, Robert Huber, and Tobias Ortmaier. Feature tracking for automated volume of interest stabilization on 4D-OCT images. In SPIE Medical Imaging, volume 10135, pages 256–262, 2017. doi: 10.1117/12.2255090.

[bib.bib18] Laves et al. (2019) Max-Heinrich Laves, Sontje Ihler, Karl-Philipp Kortmann, and Tobias Ortmaier. Well-calibrated model uncertainty with temperature scaling for dropout variational inference. In Bayesian Deep Learning Workshop (NeurIPS), 2019. arXiv:1909.13550.

[bib.bib19] Laves et al. (2020) Max-Heinrich Laves, Sontje Ihler, Jacob F Fast, Lüder A Kahrs, and Tobias Ortmaier. Well-calibrated regression uncertainty in medical imaging with deep learning. In Medical Imaging with Deep Learning, 2020.

[bib.bib20] Levi et al. (2019) Dan Levi, Liran Gispan, Niv Giladi, and Ethan Fetaya. Evaluating and calibrating uncertainty prediction in regression tasks. In arXiv, 2019. arXiv:1905.11659.

[bib.bib21] Luo et al. (2019) Jie Luo, Alireza Sedghi, Karteek Popuri, Dana Cobzas, Miaomiao Zhang, Frank Preiswerk, Matthew Toews, Alexandra Golby, Masashi Sugiyama, William M Wells, et al. On the applicability of registration uncertainty. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 410–419, 2019.

[bib.bib22] Martel et al. (2019) A. L. Martel, S. Nofech-Mozes, S. Salama, S. Akbar, and M. Peikari. Assessment of residual breast cancer cellularity after neoadjuvant chemotherapy using digital pathology [data set]. The Cancer Imaging Archive, 2019. doi: 10.7937/TCIA.2019.4YIBTJNO.

[bib.bib23] Messay et al. (2015) Temesguen Messay, Russell C. Hardie, and Timothy R. Tuinstra. Segmentation of pulmonary nodules in computed tomography using a regression neural network approach and its application to the lung image database consortium and image database resource initiative dataset. Med Image Anal, 22(1):48–62, 2015. doi: 10.1016/j.media.2015.02.002.

[bib.bib24] Nix and Weigend (1994) David A Nix and Andreas S Weigend. Estimating the mean and variance of the target probability distribution. In Proceedings of IEEE International Conference on Neural Networks, volume 1, pages 55–60, 1994.

[bib.bib25] Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, pages 8024–8035, 2019.

[bib.bib26] Payer et al. (2019) Christian Payer, Darko Štern, Horst Bischof, and Martin Urschler. Integrating spatial configuration into heatmap regression based CNNs for landmark localization. Med Image Anal, 54:207–219, 2019. doi: 10.1016/j.media.2019.03.007.

[bib.bib27] Phan et al. (2018) Buu Phan, Rick Salay, Krzysztof Czarnecki, Vahdat Abdelzad, Taylor Denouden, and Sachin Vernekar. Calibrating uncertainties in object localization task. In Bayesian Deep Learning Workshop (NeurIPS), 2018. arXiv:1811.11210.

[bib.bib28] Pizer et al. (1987) Stephen M Pizer, E Philip Amburn, John D Austin, Robert Cromartie, Ari Geselowitz, Trey Greer, Bart ter Haar Romeny, John B Zimmerman, and Karel Zuiderveld. Adaptive histogram equalization and its variations. Computer vision, graphics, and image processing, 39(3):355–368, 1987.

[bib.bib29] Schlemper et al. (2018) Jo Schlemper, Guang Yang, Pedro Ferreira, Andrew Scott, Laura-Ann McGill, Zohya Khalique, Margarita Gorodezky, Malte Roehl, Jennifer Keegan, Dudley Pennell, et al. Stochastic deep compressive sensing for the reconstruction of diffusion tensor cardiac mri. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 295–303, 2018.

[bib.bib30] Srivastava et al. (2014) Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. JMLR, 15:1929–1958, 2014.

[bib.bib31] Štern et al. (2016) Darko Štern, Christian Payer, Vincent Lepetit, and Martin Urschler. Automated age estimation from hand mri volumes using deep learning. In MICCAI, pages 194–202, 2016.

[bib.bib32] Tan et al. (2017) Li Kuo Tan, Yih Miin Liew, Einly Lim, and Robert A. McLaughlin. Convolutional neural network regression for short-axis left ventricle segmentation in cardiac cine MR sequences. Med Image Anal, 39:78–86, 2017. doi: 10.1016/j.media.2017.04.002.

[bib.bib33] Tan and Le (2019) Mingxing Tan and Quoc V Le. EfficientNet: Rethinking model scaling for convolutional neural networks. In ICML, 2019.

[bib.bib34] Tanno et al. (2016) Ryutaro Tanno, Aurobrata Ghosh, Francesco Grussu, Enrico Kaden, Antonio Criminisi, and Daniel C Alexander. Bayesian image quality transfer. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 265–273, 2016.

[bib.bib35] Tanno et al. (2017) Ryutaro Tanno, Daniel E Worrall, Aurobrata Ghosh, Enrico Kaden, Stamatios N Sotiropoulos, Antonio Criminisi, and Daniel C Alexander. Bayesian image quality transfer with cnns: exploring uncertainty in dmri super-resolution. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 611–619, 2017.

[bib.bib36] Xie et al. (2018) Yuanpu Xie, Fuyong Xing, Xiaoshuang Shi, Xiangfei Kong, Hai Su, and Lin Yang. Efficient and robust cell detection: A structured regression approach. Med Image Anal, 44:245–254, 2018. doi: 10.1016/j.media.2017.07.003.

[bib.bib37] Yin et al. (2020) Shi Yin, Qinmu Peng, Hongming Li, Zhengqiang Zhang, Xinge You, Katherine Fischer, Susan L. Furth, Gregory E. Tasian, and Yong Fan. Automatic kidney segmentation in ultrasound images using subsequent boundary distance regression and pixelwise classification networks. Med Image Anal, 60:101602, 2020. doi: 10.1016/j.media.2019.101602.

	$\displaystyle\log p(\bm{Y}\,\|\,\bm{X},\bm{\theta})$	$\displaystyle=\sum_{i=1}^{m}\log\left(\frac{1}{\sqrt{2\pi}s\hat{\sigma}_{\theta}^{(i)}}\exp\left(\frac{\big{\\|}\bm{y}^{(i)}-\hat{\bm{y}}^{(i)}_{\bm{\theta}}\big{\\|}^{2}}{2\left(s\hat{\sigma}_{\bm{\theta}}^{(i)}\right)^{2}}\right)\right)$		(20)
		$\displaystyle=-\dfrac{m}{2}\log\left(2\pi\right)-\sum_{i=1}^{m}\log\left(s\hat{\sigma}^{(i)}_{\bm{\theta}}\right)+\frac{1}{2}\left(s\hat{\sigma}_{\bm{\theta}}^{(i)}\right)^{-2}\cdot\big{\\|}\bm{y}^{(i)}-\hat{\bm{y}}_{\bm{\theta}}^{(i)}\big{\\|}^{2}$		(21)