Learning normal appearance for fetal anomaly screening: Application to the unsupervised detection of Hypoplastic Left Heart Syndrome

Elisa Chotzoglou1, Thomas Day2, Jeremy Tan1, Jacqueline Matthew2, David Lloyd2, Reza Razavi2, John Simpson2, Bernhard Kainz1
1: Imperial College London, UK, 2: King’s College London, UK
Publication date: 2021/10/22
https://doi.org/10.59275/j.melba.2021-g4dg
PDF · arXiv

Abstract

Congenital heart disease is considered as one the most common groups of congenital malformations which affects 6 − 11 per 1000 newborns. In this work, an automated framework for detection of cardiac anomalies during ultrasound screening is proposed and evaluated on the example of Hypoplastic Left Heart Syndrome (HLHS), a sub-category of congenital heart disease. We propose an unsupervised approach that learns healthy anatomy exclusively from clinically confirmed normal control patients. We evaluate a number of known anomaly detection frameworks together with a new model architecture based on the α-GAN network and find evidence that the proposed model performs significantly better than the state-of-the-art in image-based anomaly detection, yielding average 0.81 AUC and a better robustness towards initialisation compared to previous works.

Keywords

fetal screening · detection · unsupervised learning

Bibtex @article{melba:2021:012:chotzoglou, title = "Learning normal appearance for fetal anomaly screening: Application to the unsupervised detection of Hypoplastic Left Heart Syndrome", author = "Chotzoglou, Elisa and Day, Thomas and Tan, Jeremy and Matthew, Jacqueline and Lloyd, David and Razavi, Reza and Simpson, John and Kainz, Bernhard", journal = "Machine Learning for Biomedical Imaging", volume = "1", issue = "September 2021 issue", year = "2021", pages = "1--25", issn = "2766-905X", doi = "https://doi.org/10.59275/j.melba.2021-g4dg", url = "https://melba-journal.org/2021:012" }
RISTY - JOUR AU - Chotzoglou, Elisa AU - Day, Thomas AU - Tan, Jeremy AU - Matthew, Jacqueline AU - Lloyd, David AU - Razavi, Reza AU - Simpson, John AU - Kainz, Bernhard PY - 2021 TI - Learning normal appearance for fetal anomaly screening: Application to the unsupervised detection of Hypoplastic Left Heart Syndrome T2 - Machine Learning for Biomedical Imaging VL - 1 IS - September 2021 issue SP - 1 EP - 25 SN - 2766-905X DO - https://doi.org/10.59275/j.melba.2021-g4dg UR - https://melba-journal.org/2021:012 ER -

2021:012 cover

Disclaimer: the following html version has been automatically generated and the PDF remains the reference version. Feedback can be sent directly to publishing-editor@melba-journal.org

1 Introduction

A contemporary key element in building automated pathology detection systems with machine learning in medical imaging is the availability and accessibility of a sufficient amount of data in order to train supervised discriminator models for accurate results. This is a problem in medical imaging applications, where data accessibility is scarce because of regulatory constraints and economic considerations. To build truly useful diagnostic systems, supervised machine learning methods would require a large amount of data and manual labelling effort for every possible disease to minimise false predictions. This is unrealistic because there are thousands of diseases, some represented only by a few patients ever recorded. Thus, learning representations from healthy anatomy and using anomaly detection to flag unusual image features for further investigation defines a more reasonable paradigm for medicine, especially in high-throughput settings like population screening, e.g. fetal ultrasound imaging. However, anomaly detection suffers from the great variability of healthy anatomical structures from one individual to another within patient populations as well as from the many, often subtle, variants and variations of pathologies. Many medical imaging datasets, e.g. volunteer studies like UK Biobank (Petersen et al., 2013), consist of images from predominantly healthy subjects with a small proportion of them belonging to abnormal cases. Thus, an anomaly detection approach or ’normative’ learning paradigm is also reasonable from a practical point of view for applications like quality control within massive data lakes.

In this work, we formulate the detection of congenital heart disease as an anomaly detection task for fetal screening with ultrasound imaging. We utilise normal control data to learn the normative feature distribution which characterises healthy hearts and distinguishes them from fetuses with hypoplastic left heart syndrome (HLHS). We chose this test pathology because of our access to a well labelled image database from this domain. Theoretically, our method could be evaluated on any congenital heart disease that is visible in the four-chamber view of the heart (NHS, 2015).

Contribution: To the best of our knowledge we propose the first unsupervised working anomaly detection approach for fetal ultrasound screening using only normal samples during training. Previous approaches rely on supervised (Arnaout et al., 2020) discrimination of known diseases, which makes them prone to errors when confronted with unseen classes. Our method extends the α𝛼\alpha-GAN architecture with attention mechanisms and we propose an anomaly score which is based on reconstruction and localisation capabilities of the model. We evaluate our method on a selected congenital heart disease, which can be overlooked during clinical screening examinations in between 30-40% of scans (Chew et al., 2007), and compare to other state-of-the-art methods in image-based anomaly detection. We show evidence that the proposed method outperforms state-of-the-art models and achieves promising results for unsupervised detection of pathologies in fetal ultrasound screening.

2 Background and Related Work

2.1 Pathological Diseases in Fetal Heart

Congenital heart disease (CHD) is the most common group of congenital malformations (Bennasar et al., 2010)(Yeo et al., 2018)(van Velzen et al., 2016). CHD is a defect in the structure of the heart or great vessels that is present at birth. Approximately 6116116-11 per 100010001000 newborns are affected. 2030%20percent3020-30\% of these heart defects require surgery within the first year of life (Yeo et al., 2018). In order to detect the disease, the most common approach is the standard anomaly ultrasound scan at approximately 202020 weeks of gestation (e.g. 18+0 to 20+6 weeks in the UK). In contemporary screening pathways, i.e., 2D ultrasound at GA 12 and 24, the prenatal detection rate of CHD is in a range of 3959%39percent5939-59\%. (Pinto et al., 2012) (van Velzen et al., 2016) In (Yeo et al., 2018), algorithmic support has been used to find diagnostically informative fetal cardiac views. With this aid, clinical experts have been shown to discriminate healthy controls from CHD cases with 98%percent9898\% sensitivity and 90%percent9090\% specificity in 4D ultrasound. However, 4D ultrasound is not commonly used during fetal screening and in the proposed teleradiology setup still all images have to be manually assessed by highly experienced experts to achieve such a high performance.

In this work we focus on a subtype of CHD, Hypoplastic Left Heart Syndrome (HLHS). Examples of HLHS in comparison with healthy fetal hearts are presented in Figure 1. HLHS is rare, but is one of the most prominent pathologies in our cohort. In HLHS the four chamber view is usually grossly abnormal, allowing the identification of CHD (although not necessarily a detailed diagnosis) from a single image plane. A condition that is identifiable on a single view plane provides a clear case study for our proposed method. If HLHS is identified during pregnancy, provisions for the appropriate timing and location of delivery can be made, allowing immediate treatment of the affected infant to be instigated after birth. Postnatal palliative surgery is possible for HLHS, and the antenatal diagnosis of CHD in general has been shown to result in a reduced mortality compared to those infants diagnosed with CHD only after birth (Holland et al., ). However, the detection of this pathology during routine screening still remains challenging. Screening scans are performed by front-line-of-care sonographers with varying degrees of experience and the examination is influenced by factors such as fetal motion and the small size of the fetal heart.

Refer to caption
Figure 1: Examples of four-chamber views of the fetal heart. A shows a normal fetal heart, with the normal sized LV (left ventricle) marked (dashed white arrow). B and C show two examples of fetal HLHS (hypoplastic left heart syndrome), with the hypoplastic LV marked (solid white arrow). Example B represents the mitral stenosis / aortic atresia subtype, with a severely hypoplastic, globular LV. Example C represents the mitral atresia / aortic atresia subtype, with a slit-like LV that is difficult to identify. * marks the right ventricle in each case.

2.2 One-class anomaly detection methods in Medical Imaging

One-class classification is a case of multi-class classification where the data is from a single class. The main goal is to learn either a representation or a classifier (or a combination of both) in order to distinguish and recognise out-of-distribution samples during inference. Discriminative as well as generative methods have been proposed utilizing deep learning, for example one class CNN (Oza and Patel, 2019) and Deep SVDD (Ruff et al., 2018). Usually these methods utilise loss functions, similar to those of OC-SVM (Schölkopf et al., 2001) and SVDD (Tax and Duin, 2004) or use regularisation techniques to make conventional neural networks compatible to one-class classification models (Perera et al., 2021). Generative models are mostly based on autoencoders or Generative Adversarial Networks. In this work we mainly focus on the application of generative adversarial networks for anomaly detection in medical imaging.

Generative adversarial networks for anomaly detection were first proposed by (Schlegl et al., 2017). In (Schlegl et al., 2017), a deep convolutional generative adversarial network, inspired by DCGAN as proposed by (Radford et al., 2016), is used as AnoGAN. During the training phase, only healthy samples are used. This approach consists of two models. A generator, which generates an image from random noise and a discriminator, which classifies real or fake samples as common in GANs. More specifically, the generator learns the mapping from the uniformly distributed input noise sampled from the latent space to the 2D image space of healthy data. The output of the discriminator is a single value, which is interpreted as the probability of an image to be real or generated by the generator network. In their work, a residual loss is introduced, which is defined as the l1𝑙1l1 norm between the real images and the generated image. This enforces the visual similarity between the initial image and the generated one. Furthermore, in order to cope with GAN instability, instead of optimizing the parameters of the generator via maximizing the discriminator’s output on generated examples, the generator is forced to generate data whose intermediate feature representation of the discriminator (DHsubscript𝐷𝐻D_{H}) is similar to those of real images. This is defined as the l1𝑙1l1 norm between intermediate feature representations of the discriminator given as input the real image and the generate image respectively. In AnoGAN, an anomaly score is defined as the loss function at the last iteration, i.e., the residual error plus the discrimination error. AnoGAN has been tested on a high-resolution SD-OCT dataset. For evaluation purposes, the authors report receiver operating characteristic (ROC) curves of the corresponding anomaly detection performance on image level. Based on their results, using the residual loss alone already yields good results for anomaly detection. The combination with the discriminator loss improves the overall performance slightly. During testing, an iterative search in the latent space is used in order to find the closest latent vector that reconstructs the real test image better. This is a time consuming procedure and this optimisation process can get stuck in local minima.

Similar to AnoGAN, a faster approach, f-AnoGAN has been proposed in (Schlegl et al., 2019). In this work, the authors train a GAN on normal images, however instead of the DCGAN model a Wasserstein GAN (WGAN) (Arjovsky et al., 2017)(Gulrajani et al., 2017) has been used. Initially, a WGAN is trained in order to learn a non-linear mapping from latent space to the image space domain. Generator and discriminator are optimised simultaneously. Samples that follow the data distribution are generated through the generator, given input noise sampled from the latent space. Then an encoder (convolutional autoencoder) is training to learn a map from image space to latent space. For the training of the encoder, different approaches are followed, i.e training an encoder with generated images (z-to-z approach-ziz), training an encoder with real images (an image-to-image mapping approach -izi) and training a discriminator guided izi encoder (izif𝑖𝑧subscript𝑖𝑓izi_{f}). As anomaly score, image reconstruction residual plus the residual of the discriminator’s feature representation (DHsubscript𝐷𝐻D_{H}) is used. The method is evaluated on optical coherence tomography imaging data of the retina. Both (Schlegl et al., 2017) as well as (Schlegl et al., 2019) use image patches for training and are modular methods which are not trained in an end-to-end fashion.

Another GAN-based method applied to OCT data has been proposed by (Zhou et al., 2020), in which authors propose a Sparsity-constrained Generative Adversarial Network (Sparse-GAN), a network based on an Image-to-Image GAN (Isola et al., 2017). Sparse-GAN consists of a generator, following the same approach as in (Isola et al., 2017), and a discriminator. Features in the latent space are constrained using a Sparsity Regularizer Net. The model is optimized with a reconstruction loss combined with an adversarial loss. The anomaly score is computed in the latent space and not in image space. Furthermore, an Anomaly Activation Map (AAM) is proposed to visualise lesions.

Subsequently, AnoVAEGAN (Baur et al., 2018) has been proposed, in which the authors discuss a spatial variational autoencoder and a discriminator. It is applied to high resolution MRI images for unsupervised lesion segmentation. AnoVAEGAN uses a variational autoencoder and tries to model the normal data distribution that will lead the model to fully reconstruct the healthy data while it is expected to fail reconstructing abnormal samples. The discriminator classifies the inputs as real or reconstructed data. As anomaly score the l1𝑙1l1 norm of the original image and the reconstructed image is used.

Opposite to reconstruction-based anomaly detection methods as they are discussed above, in (Shen et al., 2020) adGAN, an alternative framework based on GANs, is proposed. The authors introduce two key components: fake pool generation and concentration loss. adGAN follows the structure of WGAN and consists of a generator and discriminator. The WGAN is first trained with gradient penalty using healthy images only and after a number of iterations a pool of fake images is collected from the current generator. Then a discriminator is retrained using the initial set of healthy data as well as the generated images in the fake pool with a concentration loss function. Concentration loss is a combination of the traditional WGAN loss function with a concentration term which aims to decrease the within-class distance of normal data. The output of the discriminator is considered as anomaly score. The method is applied to skin lesion detection and brain lesion detection. Two other methods that utilise discriminator outputs as anomaly score, however not tested for medical imaging, are ALOOC (Sabokrou et al., 2018) and fenceGAN (Ngo et al., 2019). In ALOOC (Sabokrou et al., 2018), the discriminator’s probabilistic output is utilised as abnormality score. In their work an encoder-decoder is used for reconstruction while the discriminator tries to differentiate the reconstructed images from the original ones. An extension of the ALOOC algorithm, is the Old is Gold (OGN) algorithm which is presented in (Zaheer et al., 2020). After training a framework similar to ALOOC, the authors finetune the network using two different types of fake images which are bad quality images and pseudo anomaly images. In this way they try to boost the ability of the discriminator to differentiate normal images from abnormal ones.

In (Ngo et al., 2019) the authors propose an encirclement loss that places the generated images at the boundary of the distribution and then use the discriminator in order to distinguish anomalous images. They propose this loss with the idea that a conventional GAN objective encourages the distribution of generated images to overlap with real images.

In  (Gong et al., 2020) an approach based on the ALOOC algorithm is proposed for the detection of fetal congenital heart disease. However, during training both normal and abnormal samples are available, which is one of the key differences compared to our approach where only healthy subjects are utilised. Furthermore, additional to the encoder-decoder and discriminator networks which are used in ALOOC, they use two additional noise models of the same architecture where the input is an image plus Gaussian noise (x~~𝑥\tilde{x}) in order to make their encoder-decoder networks more robust to distortions. In (Perera et al., 2019) a one-class generative adversarial network (OCGAN) is proposed for anomaly detection. OCGAN consists of two discriminators, a visual and a latent discriminator, a reconstruction network (denoising autoencoder) and a classifier. The latent discriminator learns to discriminate encoded real images and generated images randomly sampled from 𝒰(1,1)similar-to𝒰11\mathcal{U}\sim(-1,1), while the visual discriminator distinguishes real from fake images. Their classifier is trained using binary cross entropy loss and learns to recognise real images from fake images. Finally, in (Pidhorskyi et al., 2018) a probabilistic framework is proposed which is based on a model similar to α𝛼\alpha-GAN. The latent space is forced to be similar to standard normal distribution through an extra discriminator network, called latent discriminator similar to (Rosca et al., 2017). A parameterized data manifold is defined (using adversarial autoencoder) which captures the underlying structure of the inlier distribution (normal data) and a test sample is considered as abnormal if its probability with respect to the inlier distribution is below a threshold. The probability is factorised with respect to local coordinates of the manifold tangent space.
A summary of the key features for the works above is given in Table 1.
To establish consistency between different related works we define x𝑥x as a test image, x^^𝑥\hat{x} as a reconstructed image, D𝐷D as a discriminator network, (DHsubscript𝐷𝐻D_{H} as (intermediate) feature representation of a Discriminator network), E𝐸E as an encoder network (image space \rightarrow latent space), De𝐷eD\text{e} as a decoder network (latent space back to image space), G𝐺G as a generator network (where input is a noise vector), z𝑧z as latent space representation and λ𝜆\lambda as a fixed learning rate.

Table 1: One-class anomaly detection using Generative Adversarial Networks
ReferenceApproachAnomaly scoreDataset
AnoGAN (Schlegl et al., 2017)reconstruction & discrimination score(1λ)xG(z)+λDH(x)DH(G(z))1𝜆norm𝑥𝐺𝑧𝜆normsubscript𝐷𝐻𝑥subscript𝐷𝐻𝐺𝑧(1-\lambda)\|x-G(z)\|+\lambda\|D_{H}(x)-D_{H}(G(z))\|OCT
f-AnoGAN (Schlegl et al., 2019)reconstruction & discrimination scorexG(E(x)))2+λDH(x)DH(G(E(x)))2\|x-G(E(x)))\|^{2}+\lambda\|D_{H}(x)-D_{H}(G(E(x)))\|^{2}OCT
Sparse-GAN (Zhou et al., 2020)reconstruction errorE(x)E(De(E(x)))2subscriptnorm𝐸𝑥𝐸𝐷𝑒𝐸𝑥2\|E(x)-E(De(E(x)))\|_{2}OCT
AnoVAEGAN (Baur et al., 2018)reconstruction errorxDe(E(x))1subscriptnorm𝑥𝐷𝑒𝐸𝑥1\|x-De(E(x))\|_{1}Brain
adGAN (Shen et al., 2020)discriminator scoreD(x)𝐷𝑥D(x)Digit/skin/Brain
ALOOC (Sabokrou et al., 2018)discriminator scoreD(De(E(x)))𝐷𝐷𝑒𝐸𝑥D(De(E(x)))Generic Images/Video
fenceGAN (Ngo et al., 2019)discriminator scoreD(x)𝐷𝑥D(x)Generic Images
OGN (Zaheer et al., 2020)discriminator scoreD(De(E(x)))𝐷𝐷𝑒𝐸𝑥D(De(E(x)))Generic Images/Video
OCGAN (Perera et al., 2019)discriminator/reconstruction scoreD(De(E(x)))𝐷𝐷𝑒𝐸𝑥D(De(E(x)))/xDe(E(x))|2superscriptdelimited-‖|𝑥𝐷𝑒𝐸𝑥2\|x-De(E(x))|^{2}Generic Images
GPND (Pidhorskyi et al., 2018)probabilistic scorepx(x)subscript𝑝𝑥𝑥p_{x}(x)Generic Images
  • * Application field of these works as they are described in the original papers is not the Medical Imaging.

3 Methods

In order to detect anomalies in fetal ultrasound data, we build an end-to-end model which takes as input the whole image and produces an anomaly score together with an attention map in a unsupervised way.

To achieve this, we build a GAN-based model, where the aim of the discriminator networks is to learn the salient features of the fetal images (i.e., heart area) during training. We use an auto-encoding generative adversarial network (α𝛼\alpha-GAN) which makes use of discriminator information in order to predict the anomaly score. α𝛼\alpha-GAN (Rosca et al., 2017)(Kwon et al., 2019) is a fusion of generative adversarial learning (GAN) and a variational autoencoder. It can be considered as autoencoder GAN combining the reconstruction power of an autoencoder with the sampling power of generative adversarial networks. It aims to overcome GAN instabilities during training, which leads to mode collapse while at the same time exploits the advantages of variational autoencoders, producing less blurry images. In α𝛼\alpha-GAN two discriminators focus on the data and latent space respectively. An overview of the proposed architecture is given in Figure 2

Refer to caption
Figure 2: Our proposed GAN-based model.

Input: fetal ultrasound image x𝑥x, parameter λ𝜆\lambda, Number of Epochs: N𝑁N
     
Output: Models:E,G,D,LD𝐸𝐺𝐷𝐿𝐷E,G,D,LD

1:for epoch 1 to N𝑁N do
2:    Update E,G𝐸𝐺E,G using Eqs. 1, 2, 3. D,LD𝐷𝐿𝐷D,LD are fixed.
3:* L{.}L_{\{.\}} indicates the loss function of each network
4:    LEλxx^1+LD(z^,1)subscript𝐿𝐸𝜆subscriptnorm𝑥^𝑥1𝐿𝐷^𝑧1L_{E}\leftarrow{\lambda\|x-\hat{x}\|_{1}+LD(\hat{z},1)}
5:    LGλxx^1+D(x^,1)+D(x~,1)subscript𝐿𝐺𝜆subscriptnorm𝑥^𝑥1𝐷^𝑥1𝐷~𝑥1L_{G}\leftarrow{\lambda\|x-\hat{x}\|_{1}+D({\hat{x}},1)+D(\tilde{x},1)}
6:    LE,GLE+LGsubscript𝐿𝐸𝐺subscript𝐿𝐸subscript𝐿𝐺L_{E,G}\leftarrow{L_{E}+L_{G}}
7:    Update D𝐷D using Eq. 4. E,G,LD𝐸𝐺𝐿𝐷E,G,LD are fixed.
8:    LDD(x,1)+D(x^,0)+D(x~,0)subscript𝐿𝐷𝐷𝑥1𝐷^𝑥0𝐷~𝑥0L_{D}\leftarrow{D(x,1)+D(\hat{x},0)+D(\tilde{x},0)}
9:    Update LD𝐿𝐷LD using Eq. 5. E,G,D𝐸𝐺𝐷E,G,D are fixed.
10:    LLDLD(z^,0)+LD(z,1)subscript𝐿𝐿𝐷𝐿𝐷^𝑧0𝐿𝐷𝑧1L_{LD}\leftarrow{LD(\hat{z},0)+LD(z,1)}
11:end for
12:* G,D,LD,E𝐺𝐷𝐿𝐷𝐸G,D,LD,E indicates the corresponding outputs of each network. 1/0101/0 corresponds to real/fake values.
Algorithm 1: Training procedure of the proposed method.

We assume a generating process of real fetal cardiac images x𝑥x as xpxsimilar-to𝑥subscriptsuperscript𝑝𝑥x\sim p^{*}_{x} and a random prior distribution pzsubscript𝑝𝑧p_{z}. Reconstruction, x^^𝑥\hat{x} , of an input image x𝑥x is defined as G(z^)𝐺^𝑧G(\hat{z}) where z^^𝑧\hat{z} is a sample from the variational distribution qEsubscript𝑞𝐸\Large{q}_{E}, i.e., z^qE(z|x)similar-to^𝑧subscript𝑞𝐸conditional𝑧𝑥\hat{z}\sim\Large q_{E(z|x)}. Furthermore, we define z𝑧z as a sample from a normal prior distribution pzsubscript𝑝𝑧p_{z}, i.e., z𝒩(0,I)similar-to𝑧𝒩0𝐼z\sim\mathcal{N}(0,I).

The encoder (E𝐸E) is mapping each real sample x𝑥x from sample space X𝑋X to a point in the latent space Z𝑍Z, i.e., E:XZ:𝐸𝑋𝑍E:X\rightarrow Z. It consists of four blocks. Each block contains a Convolutional-Batch Normalisation layer followed by Leaky Rectified Linear Unit (LeakyReLU) activation, down-sampling the resolution of data by two in each block. Spectral Normalisation (Miyato et al., 2018), (Zhang et al., 2019) a weight normalisation method, is used after each convolutional layer. In the last block, after the convolutional layer, an attention gate is introduced (Schlemper et al., 2019), (Zhang et al., 2019). The final layer of the encoder is a tangent layer. The dimension of the latent space is equal to 128128128.

The generator synthesises images from latent space Z𝑍Z back to the sample space X𝑋X, i.e., G:ZX:𝐺𝑍𝑋G:Z\rightarrow X. The generator regenerates the initial image using four consecutive blocks of transposed convolution-batch normalisation-Rectified Linear Unit (ReLU) activation layers. The last layer is a Hyperbolic tangent (tanh) activation. Similar to encoder spectral normalisation, attention gate layers are used.

The discriminator (D𝐷D) takes as input an image and tries to discriminate between real and fake images. The output of the discriminator is a probability for the input being a real or fake image. It consists of four blocks. Each block consists of Convolutional-Batch Normalisation-RELU layers. The last layer is a sigmoid layer. The discriminator treats x𝑥x as real images while the reconstruction from the encoder and samples from pzsubscript𝑝𝑧p_{z}, are considered as fake.

A latent discriminator is introduced in order to discriminate latent representations which are generated by the encoder network from samples of a standard Gaussian distribution. The latent code discriminator (LD𝐿𝐷LD) consists of four linear layers followed by a Leaky RELU activation. We randomly initialise the encoder, generator and latent code discriminator. The weights for the discriminator are initialised with a normal distribution 𝒩(0,0.02)similar-to𝒩00.02\mathcal{N}\sim(0,0.02). We train the architecture by first updating the encoder parameters by minimizing:

E=𝔼px[λ×xx^1+(log(LD(z^)))]subscriptEsubscript𝔼subscriptsuperscript𝑝𝑥𝜆subscriptnorm𝑥^𝑥1𝑙𝑜𝑔𝐿𝐷^𝑧\displaystyle\mathcal{L}_{\mbox{\small{E}}}=\operatorname{\mathbb{E}}_{p^{*}_{x}}[\lambda\times\|x-\hat{x}\|_{1}+(-log(LD(\hat{z})))](1)

We define the generator loss as:

G=𝔼px[λ×xx^1+(log(D(G(z^))))]+𝔼pz[log(D(G(z)))]subscriptGsubscript𝔼subscriptsuperscript𝑝𝑥𝜆subscriptnorm𝑥^𝑥1𝑙𝑜𝑔𝐷𝐺^𝑧subscript𝔼subscript𝑝𝑧𝑙𝑜𝑔𝐷𝐺𝑧\displaystyle\mathcal{L}_{\mbox{\small{G}}}=\operatorname{\mathbb{E}}_{p^{*}_{x}}[\lambda\times\|x-\hat{x}\|_{1}+(-log(D(G(\hat{z}))))]+\operatorname{\mathbb{E}}_{p_{z}}[-log(D(G(z)))](2)

Since we consider encoder and generator as one network the loss for the encoder-generator is:

E,G=E+Gsubscript𝐸𝐺subscript𝐸subscript𝐺\mathcal{L}_{E,G}=\mathcal{L}_{E}+\mathcal{L}_{G}(3)

where Esubscript𝐸\mathcal{L}_{E} and Gsubscript𝐺\mathcal{L}_{G} are defined in Eqs. 1 and 2 respectively. The generator is updating twice compared to the encoder in order to stabilize the training procedure.

Then we minimise discriminator loss

D=𝔼px[2logD(x)log(1D(G(z^)))]+𝔼pz[log(1D(G(z)))].subscriptDsubscript𝔼subscriptsuperscript𝑝𝑥2D𝑥1D𝐺^𝑧subscript𝔼subscript𝑝𝑧1D𝐺𝑧\mathcal{L}_{\mbox{\small{D}}}=\operatorname{\mathbb{E}}_{p^{*}_{x}}{[-2*\log{\mbox{D}(x)}}-\log{(1-\mbox{D}(G(\hat{z})))]}+\operatorname{\mathbb{E}}_{p_{z}}{[-\log{(1-\mbox{D}(G(z)))}]}.(4)

Finally, we update the weights of latent code discriminator using

LD=𝔼px[log(1LD(z^))]+𝔼pz[log(LD(z))].subscriptLDsubscript𝔼subscriptsuperscript𝑝𝑥1LD^𝑧subscript𝔼subscript𝑝𝑧LD𝑧\mathcal{L}_{\mbox{\small{LD}}}=\operatorname{\mathbb{E}}_{p^{*}_{x}}{[-\log{(1-\mbox{LD}(\hat{z})})]}+\operatorname{\mathbb{E}}_{p_{z}}{[-\log({\mbox{LD}(z))}]}.(5)

For the learning rate λ𝜆\lambda, we use value of 252525 after grid search.

The training process of the α𝛼\alpha-GAN model is described in algorithm 1. The networks are trained using the Adam optimizer. Encoder and Generator use the same learning rate, λ𝜆\lambda. The same learning rate is also utilised for discriminator and latent code discriminator.

We additionally replace the latent discriminator with an approximation of KL divergence. For a latent vector z^^𝑧\hat{z} of M𝑀M dimension we define KL divergence as (Ulyanov et al., 2018):

KL(q(z^|x)||𝒩(0,I))M2+1Mi=1Msi2+mi22log(si),KL(q(\hat{z}|x)||\mathcal{N}(0,I))\approx-\frac{M}{2}+\frac{1}{M}\sum_{i=1}^{M}\frac{s_{i}^{2}+m_{i}^{2}}{2}-log(s_{i}),

where misubscript𝑚𝑖m_{i} and sisubscript𝑠𝑖s_{i} is the mean and standard deviation of each component of the Mthsubscript𝑀𝑡M_{th} dimensional latent space. Performance in this configuration is subpar, thus we limit the discussion to results with the latent code discriminator.

Furthermore, we apply an analytic estimation of KL divergence using a one-class variational autoencoder (VAE-GAN) similar to (Baur et al., 2018) (Dosovitskiy and Brox, 2016). The VAE-GAN is trained using reconstruction error plus the KL divergence between the latent space (z^^𝑧\hat{z}) and the normal distribution pzsubscript𝑝𝑧p_{z}. For training the VAE-GAN, we first update the encoder and decoder networks as following:

E=𝔼px[βxx^p]+KL(q(z^|x)||pz)\mathcal{L}_{E}=\operatorname{\mathbb{E}}_{p^{*}_{x}}[\beta*\|x-\hat{x}\|_{p}]+KL(q(\hat{z}|x)||p_{z})
G=𝔼px[γ×xx^p+(log(D(G(z^))))]+𝔼pz[log(D(G(z)))]subscriptGsubscript𝔼subscriptsuperscript𝑝𝑥𝛾subscriptnorm𝑥^𝑥𝑝𝑙𝑜𝑔𝐷𝐺^𝑧subscript𝔼subscript𝑝𝑧𝑙𝑜𝑔𝐷𝐺𝑧\displaystyle\mathcal{L}_{\mbox{\small{G}}}=\operatorname{\mathbb{E}}_{p^{*}_{x}}[\gamma\times\|x-\hat{x}\|_{p}+(-log(D(G(\hat{z}))))]+\operatorname{\mathbb{E}}_{p_{z}}[-log(D(G(z)))]

Finally, the discriminator is trained based on the:

D=𝔼px[2logD(x)log(1D(G(z^)))]+𝔼pz[log(1D(G(z)))].subscriptDsubscript𝔼subscriptsuperscript𝑝𝑥2D𝑥1D𝐺^𝑧subscript𝔼subscript𝑝𝑧1D𝐺𝑧\mathcal{L}_{\mbox{\small{D}}}=\operatorname{\mathbb{E}}_{p^{*}_{x}}{[-2*\log{\mbox{D}(x)}}-\log{(1-\mbox{D}(G(\hat{z})))]}+\operatorname{\mathbb{E}}_{p_{z}}{[-\log{(1-\mbox{D}(G(z)))}]}.

where β𝛽\beta, γ𝛾\gamma are set to 101010 and 555 respectively after grid search.

A ResNet18 (He et al., 2016)-based architecture encoder and decoder/generator are utilised (with random initialisation). In the ResNet18 encoder/decoder architecture each layer consists of 444 residual blocks and each block is 2limit-from22- layer deep. We use the same discriminator as in α𝛼\alpha-GAN.

The dimensions of the latent space are 128128128. p=2𝑝2p=2 since we use the l2subscript𝑙2l_{2} norm (i.e., mean square error).

All networks are implemented in Python using Pytorch, on a workstation with a NVIDIA Titan X GPU.

3.1 Anomaly detection score

In order to predict an anomaly score s𝑠s, three different strategies are utilised. For an unseen image xunseensubscript𝑥𝑢𝑛𝑠𝑒𝑒𝑛x_{unseen} and its reconstructed image x^unseensubscript^𝑥𝑢𝑛𝑠𝑒𝑒𝑛\hat{x}_{unseen}, we utilise as baseline the reconstruction error which is defined as the l2𝑙2l2 norm, i.e., srec=xunseenx^unseen22subscript𝑠𝑟𝑒𝑐superscriptsubscriptnormsubscript𝑥𝑢𝑛𝑠𝑒𝑒𝑛subscript^𝑥𝑢𝑛𝑠𝑒𝑒𝑛22s_{rec}=\|x_{unseen}-\hat{x}_{unseen}\|_{2}^{2} between image and reconstructed image (residual).

The second candidate for s𝑠s is the output of the discriminator. D𝐷D should give high scores for reconstructions of original, normal images, but low scores for abnormal images, sdiscr=1D(xunseen)subscript𝑠𝑑𝑖𝑠𝑐𝑟1𝐷subscript𝑥𝑢𝑛𝑠𝑒𝑒𝑛s_{discr}=1-D(x_{unseen}). Finally, we compute an anomaly score using a gradient-based method, GradCam++,  (Chattopadhay et al., 2018). Inspired by (Kimura et al., 2020)  (Venkataramanan et al., 2019) (Liu et al., 2020) we apply GradCam++ to the score of the discriminator with regards to the last rectified convolutional layer of the discriminator. This produces attention maps and is also valuable for the localisation of the pathology. The intuition of using attention maps for computing anomaly scores, is based on the hypothesis that after training the discriminator not only learns to discriminate between normal and abnormal samples but also learns to focus on relevant features in the image. Thus, specifically for HLHS, where the left artery is missing or is occluded compared to normal samples, a discriminator should identify and locate this difference. The GradCam++ is computed following:

Let y𝑦y be the logits of the last layer as they are derived from the discriminator network D(xunseen)𝐷subscript𝑥𝑢𝑛𝑠𝑒𝑒𝑛D(x_{unseen}). For the same operators (i,j)𝑖𝑗(i,j) and (a,b)𝑎𝑏(a,b) applied to the feature map Aksuperscript𝐴𝑘A^{k} we compute weights:

wk=αijkRELU(yAijk),subscript𝑤𝑘superscriptsubscript𝛼𝑖𝑗𝑘RELU𝑦superscriptsubscript𝐴𝑖𝑗𝑘w_{k}=\alpha_{ij}^{k}\mbox{RELU}(\frac{\partial y}{\partial A_{ij}^{k}}),(6)

where the gradient weights aijksuperscriptsubscript𝑎𝑖𝑗𝑘a_{ij}^{k} can be computed as:

αijk=2y(Aijk)222y(Aijk)2+abAabk{3y(Aijk)3},superscriptsubscript𝛼𝑖𝑗𝑘superscript2𝑦superscriptsuperscriptsubscript𝐴𝑖𝑗𝑘22superscript2𝑦superscriptsuperscriptsubscript𝐴𝑖𝑗𝑘2subscript𝑎subscript𝑏superscriptsubscript𝐴𝑎𝑏𝑘superscript3𝑦superscriptsuperscriptsubscript𝐴𝑖𝑗𝑘3\alpha_{ij}^{k}=\frac{\frac{\partial^{2}y}{(\partial A_{ij}^{k})^{2}}}{2\frac{\partial^{2}y}{(\partial A_{ij}^{k})^{2}}+\sum_{a}\sum_{b}A_{ab}^{k}\{{\frac{\partial^{3}y}{(\partial A_{ij}^{k})^{3}}\}}},(7)

and the saliency map (SM) is computed as a linear combination of the forward activation maps followed by a ReLU layer:

SMij=RELU(kwkAijk).𝑆subscript𝑀𝑖𝑗𝑅𝐸𝐿𝑈subscript𝑘subscript𝑤𝑘superscriptsubscript𝐴𝑖𝑗𝑘SM_{ij}=RELU(\sum_{k}w_{k}A_{ij}^{k}).(8)

We then computed the sum of the attention maps of image xunseensubscript𝑥𝑢𝑛𝑠𝑒𝑒𝑛x_{unseen} and its reconstruction from the Generator network, x^unseensubscript^𝑥𝑢𝑛𝑠𝑒𝑒𝑛\hat{x}_{unseen}:

M=SM(Dxunseen)+SM(Dx^unseen)𝑀SMsubscript𝐷subscript𝑥𝑢𝑛𝑠𝑒𝑒𝑛SMsubscript𝐷subscript^𝑥𝑢𝑛𝑠𝑒𝑒𝑛M=\mbox{SM}(D_{x_{unseen}})+\mbox{SM}(D_{\hat{x}_{unseen}})(9)

and finally computed the anomaly score sattnsubscript𝑠𝑎𝑡𝑡𝑛s_{attn} as

sattn=M×(xunseenx^unseen)22M22subscript𝑠𝑎𝑡𝑡𝑛superscriptsubscriptnorm𝑀subscript𝑥𝑢𝑛𝑠𝑒𝑒𝑛subscript^𝑥𝑢𝑛𝑠𝑒𝑒𝑛22superscriptsubscriptnorm𝑀22s_{attn}=\frac{\|M\times(x_{unseen}-\hat{x}_{unseen})\|_{2}^{2}}{\|M\|_{2}^{2}}(10)

To compute the anomaly score we encapsulate the information of reconstruction (Kimura et al., 2020). Reconstruction of a normal image should be crisper compared to reconstructions from an anomalous observation. Finally, we attempt to combine anomaly scores, such as srecsubscript𝑠𝑟𝑒𝑐s_{rec} with sdiscrsubscript𝑠𝑑𝑖𝑠𝑐𝑟s_{discr}. However, the anomaly detection performance does not improve noteworthily.

3.2 Data

The available dataset contains 2D ultrasound images of four-chamber cardiac views. These are standard diagnostic views according to (NHS, 2015). The images contain labelled examples from normal fetal hearts and hearts with Hypoplastic Left Heart Syndrome (HLHS) (HLHS, 2019) from the same clinic, using exclusively an Aplio i800 GI system for both groups to avoid systematic domain differences. HLHS is a birth defect that affects normal blood flow through the heart. It affects a number of structures on the left side of the heart that do not fully develop.

Our dataset consists of 231723172317 4-chamber view images for which 222422242224 cases are normal and 939393 are abnormal cases. Healthy control view planes have been automatically extracted from examination screen capture videos using a Sononet network (Baumgartner et al., 2017) and manual cleaning from visually trivial classification errors. A set of HLHS view planes that would resemble a 4-chamber view in healthy subjects has been extracted with the same automated Sononet pipeline. Another set has been manually extracted from the examination videos by a fetal cardiologist and 38 cases that are not within 19+0 - 20+6 weeks or show a mix of pathologies have been rejected.

For training, 213121312131 4-chamber view images, which are considered as normal cases are used. During training, only images from normal fetuses are used. For testing, two different datasets are derived for three different testing scenarios:

For 𝐝𝐚𝐭𝐚𝐬𝐞𝐭𝟏subscript𝐝𝐚𝐭𝐚𝐬𝐞𝐭1\mathbf{dataset_{1}}(Figure  2) we use 4-chamber views from all available HLHS cases, extracted by Sononet and cleaned from gross classification errors; in total 939393 cases. Further 939393 normal cases have been randomly selected from the remaining test split of the healthy controls and added to this dataset. HLHS cases are challenging for Sononet, which has been trained only on healthy views. Thus, in HLHS cases, it will only select views that are close to the feature distribution of healthy 4-chamber views, which are not necessarily the views a clinician would have chosen. For 𝐝𝐚𝐭𝐚𝐬𝐞𝐭𝟐subscript𝐝𝐚𝐭𝐚𝐬𝐞𝐭2\mathbf{dataset_{2}}(Figure  2), we use the 939393 normal cases from dataset1𝑑𝑎𝑡𝑎𝑠𝑒subscript𝑡1dataset_{1} and the expert-curated HLHS images from the remaining, nonexcluded 535353 cases. For each of these cases 111 to 444 different view planes have been identified as clinically conclusive. With this dataset we perform two different subject-level experiments: a) selecting one of the four frames randomly and b) using all of the 177 clinically selected views in these 53 subjects and fusing the individual abnormality scores to gain a subject-level assessment. We also evaluate per-frame anomaly results.

Refer to caption
Figure 2: Graphical description of Dataset 111 and Dataset 222

The images are rescaled to 64×64646464\times 64 and normalised to a [0,1]01[0,1] value range. No image augmentation is used.

4 Evaluation and Results

We evaluate our algorithm both quantitatively as well as qualitatively. The capability of the proposed method to localise the pathology is also examined.

4.1 Quantitative analysis

For evaluation purposes, the anomaly score is computed as described in Section 3.1. For α𝛼\alpha-GAN and VAE-GAN we use sattnsubscript𝑠𝑎𝑡𝑡𝑛s_{attn}, srecsubscript𝑠𝑟𝑒𝑐s_{rec} and sdiscrsubscript𝑠𝑑𝑖𝑠𝑐𝑟s_{discr} as anomaly scores as they are presented in Section 3.1.

For comparison with the state-of-the-art we train four algorithms: convolutional autoencoder (CAE) (Makhzani and Frey, 2015; Masci et al., 2011), One-class Deep Support Vector Data Description (DeepSVDD) (Ruff et al., 2018) and f-AnoGAN (Schlegl et al., 2019).

Deep Convolutional autoencoder (DCAE)  (Makhzani and Frey, 2015; Masci et al., 2011) is also trained as a baseline. For training, MSE loss is utilised. For DCAE and One-class DeepSVDD we use the same architectures as the ones used for the CIFAR10 dataset in the original work (Ruff et al., 2018). Reconstruction error, i.e., xDe(E(x))2subscriptnorm𝑥𝐷𝑒𝐸𝑥2\|x-De(E(x))\|_{2}, is defined as anomaly score (sDCAEsubscript𝑠𝐷𝐶𝐴𝐸s_{DCAE}).

Deep Support Vector Data Description (DeepSVDD) (Ruff et al., 2018) computes the hypersphere of minimum volume that contains every point in the training set. By minimising the sphere’s volume, the chance of including points that do not belong to the target class distribution is minimised. Since in our case all the training data belongs to one class (negative class-healthy data) we focus on (Ruff et al., 2018) . Let f𝑓f be the network function of the deep neural network with L𝐿L layers and θlsuperscript𝜃𝑙\theta^{l} the weights’s parameters of the lthsubscript𝑙𝑡l_{th} layer. We denote the center of the hypersphere as o𝑜o. The objective of the network is to minimize the loss which is defined as:

SVDD=minθ1Ni=1Nf(x)o2+λ2l=1Lθl2.subscript𝑆𝑉𝐷𝐷subscript𝜃1𝑁superscriptsubscript𝑖1𝑁superscriptnorm𝑓𝑥𝑜2𝜆2superscriptsubscript𝑙1𝐿superscriptnormsuperscript𝜃𝑙2\mathcal{L}_{SVDD}=\min_{\theta}\frac{1}{N}\sum_{i=1}^{N}\|f(x)-o\|^{2}+\frac{\lambda}{2}\sum_{l=1}^{L}\|\theta^{l}\|^{2}.

The center o𝑜o is set to be the mean of outputs which is obtained at the initial forward pass. The anomaly score (ssvddsubscript𝑠𝑠𝑣𝑑𝑑s_{svdd}) is then defined at inference stage as the distance between a new test sample to the center of the hyper-sphere, i.e., f(x)o2superscriptnorm𝑓𝑥𝑜2\|f(x)-o\|^{2}

f-AnoGAN (Schlegl et al., 2019) is described in Section 2.2. We were not able to successfully train f-AnoGAN using the same networks as we used for α𝛼\alpha-GAN, hence we utilise similar networks and an identical training framework as described in (Schlegl et al., 2019). We follow the izif𝑖𝑧subscript𝑖𝑓izi_{f} training procedure for the encoder network. As anomaly detection score (sanogansubscript𝑠𝑎𝑛𝑜𝑔𝑎𝑛s_{anogan}) a combination of L2𝐿2L2 residual loss between the image and its reconstruction and the L2𝐿2L2 norm of the discriminator’s features of an intermediate layer is utilised as it is defined in Table 1.

In all algorithms the latent dimension is chosen as 128128128. We run all experiments 555 times using different random seeds (Mario Lucic et al., 2018). We report the average precision, recall at the Youden index of the receiver operating characteristic (ROC) curves as well as the average corresponding area under curve (AUC) of the 555 runs of each experiment. Furthermore, we apply the DeLong’s test (DeLong et al., 1988) to obtain z-scores and p-values in order to test how statistically different the AUC curve of the proposed model compared to the corresponding curves of the state-of-the-art models (CAE and DeepSVDD, f-AnoGAN and VAE-GAN) is. We perform four different experiments:

Experiment 1 uses dataset1𝑑𝑎𝑡𝑎𝑠𝑒subscript𝑡1dataset_{1} and aims to evaluate general, frame-level outlier detection performance, including erroneous classifications and fetuses below the expected age range. In Table 2, the best performing model based on AUC score is the α𝛼\alpha-GAN method using sattnsubscript𝑠𝑎𝑡𝑡𝑛s_{attn} as anomaly score which achieves an average of 0.82±0.012plus-or-minus0.820.0120.82\pm 0.012 AUC. The α𝛼\alpha-GAN model achieves the best precision score. However, regarding F1 score and Recall VAE-GAN outperforms α𝛼\alpha-GAN with 0.880.880.88 and 0.780.780.78 respectively. DeepSVDD shows the best specificity at 0.760.760.76. Figure 3 shows the ROC for the best performing (AUC, F1) initialisation and the distribution of normal and abnormal scores for the best model of α𝛼\alpha-GAN at the Youden index. We present confusion matrices for the α𝛼\alpha-GAN and the VAE-GAN models in Figure LABEL:fig:conf and Figure LABEL:fig:confvae. For normal cases both models achieve similar classification performance. However, for identifying abnormal cases α𝛼\alpha-GAN seems to have an advantage.

Based on the DeLong’s test, for Exp. 1, for the average scores (of five experiments), α𝛼\alpha-GAN compared to f-AnoGAN yields z=5.22𝑧5.22z=-5.22 and p=1.80e07𝑝1.80𝑒07p=1.80e-07. Similarly, the values for α𝛼\alpha-GAN compared to CAE are z=4.82𝑧4.82z=-4.82 and p=1.37e06𝑝1.37𝑒06p=1.37e-06. Finally, comparing α𝛼\alpha-GAN and DeepSVDD results in z=6.49𝑧6.49z=-6.49 and p=8.52e11𝑝8.52𝑒11p=8.52e-11. Since p<0.01𝑝0.01p<0.01 for all comparisons, we can assume that α𝛼\alpha-GAN performs significantly better than the state-of-the-art when applied to fetal cardiac ultrasound screening for HLHS. Comparing α𝛼\alpha-GAN with VAE-GAN the values are z=1.21𝑧1.21z=-1.21 and p=0.22𝑝0.22p=0.22 which does not indicate a significant difference between AUC curves. As can be seen from the results, the GAN-based methods achieve better performance for detecting HLHS.

Quantitative performance scores
MethodPrecisionRecallSpecificityF1 scoreAUC
CAE (Ruff et al., 2018)0.65±0.027plus-or-minus0.650.0270.65\pm 0.0270.64±0.061plus-or-minus0.640.0610.64\pm 0.0610.65±0.074plus-or-minus0.650.0740.65\pm 0.0740.64±0.061plus-or-minus0.640.0610.64\pm 0.0610.65±0.016plus-or-minus0.650.0160.65\pm 0.016
DeepSVDD (Ruff et al., 2018)0.67±0.106plus-or-minus0.670.1060.67\pm 0.1060.37±0.258plus-or-minus0.370.2580.37\pm 0.2580.76±0.260plus-or-minus0.760.260\mathbf{0.76}\pm 0.2600.41±0.150plus-or-minus0.410.1500.41\pm 0.1500.53±0.039plus-or-minus0.530.0390.53\pm 0.039
f-AnoGAN (Schlegl et al., 2019)0.58±0.022plus-or-minus0.580.0220.58\pm 0.0220.58±0.130plus-or-minus0.580.1300.58\pm 0.1300.59±0.097plus-or-minus0.590.0970.59\pm 0.0970.57±0.072plus-or-minus0.570.0720.57\pm 0.0720.57±0.039plus-or-minus0.570.0390.57\pm 0.039
srecsubscript𝑠𝑟𝑒𝑐s_{rec} (VAE-GAN)0.69±0.018plus-or-minus0.690.0180.69\pm 0.0180.88±0.060plus-or-minus0.880.060\mathbf{0.88}\pm 0.0600.61±0.057plus-or-minus0.610.0570.61\pm 0.0570.78±0.015plus-or-minus0.780.015\mathbf{0.78}\pm 0.0150.78±0.010plus-or-minus0.780.0100.78\pm 0.010
sdiscrsubscript𝑠𝑑𝑖𝑠𝑐𝑟s_{discr} (VAE-GAN)0.75±0.220plus-or-minus0.750.2200.75\pm 0.2200.29±0.360plus-or-minus0.290.3600.29\pm 0.3600.75±0.360plus-or-minus0.750.3600.75\pm 0.3600.27±0.230plus-or-minus0.270.2300.27\pm 0.2300.42±0.027plus-or-minus0.420.0270.42\pm 0.027
sattnsubscript𝑠𝑎𝑡𝑡𝑛s_{attn} (VAE-GAN)0.72±0.014plus-or-minus0.720.0140.72\pm 0.0140.83±0.043plus-or-minus0.830.0430.83\pm 0.0430.68±0.037plus-or-minus0.680.0370.68\pm 0.0370.77±0.014plus-or-minus0.770.0140.77\pm 0.0140.79±0.008plus-or-minus0.790.0080.79\pm 0.008
srecsubscript𝑠𝑟𝑒𝑐s_{rec} (α𝛼\alpha-GAN)0.64±0.017plus-or-minus0.640.0170.64\pm 0.0170.87±0.054plus-or-minus0.870.0540.87\pm 0.0540.50±0.038plus-or-minus0.500.0380.50\pm 0.0380.74±0.024plus-or-minus0.740.0240.74\pm 0.0240.71±0.029plus-or-minus0.710.0290.71\pm 0.029
sdiscrsubscript𝑠𝑑𝑖𝑠𝑐𝑟s_{discr} (α𝛼\alpha-GAN)0.65±0.056plus-or-minus0.650.0560.65\pm 0.0560.51±0.240plus-or-minus0.510.2400.51\pm 0.2400.70±0.205plus-or-minus0.700.2050.70\pm 0.2050.53±0.170plus-or-minus0.530.1700.53\pm 0.1700.61±0.067plus-or-minus0.610.0670.61\pm 0.067
sattnsubscript𝑠𝑎𝑡𝑡𝑛s_{attn} (α𝛼\alpha-GAN)0.73±0.026plus-or-minus0.730.026\mathbf{0.73}\pm 0.0260.82±0.068plus-or-minus0.820.0680.82\pm 0.0680.70±0.059plus-or-minus0.700.0590.70\pm 0.0590.77±0.029plus-or-minus0.770.0290.77\pm 0.0290.82±0.012plus-or-minus0.820.012\mathbf{0.82}\pm 0.012
Table 2: Anomaly detection performance for Exp. 1 using dataset1𝑑𝑎𝑡𝑎𝑠𝑒subscript𝑡1dataset_{1}. Best performance in bold.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 3: (a) ROC-AUC curves in Exp. 1; (b) Distribution of normal/abnormal score values for the α𝛼\alpha-GAN model with sattnsubscript𝑠𝑎𝑡𝑡𝑛s_{attn} as anomaly score (c) Confusion matrix for the best performing run of the proposed α𝛼\alpha-GAN (d) Confusion matrix for the best performing run of the VAE-GAN. This figure focuses on the results of the best performing initialisation from five experiments with α𝛼\alpha-GAN (or VAE-GAN) while Table 2 shows average metrics.

Experiment 2 uses dataset2𝑑𝑎𝑡𝑎𝑠𝑒subscript𝑡2dataset_{2} for specific disease detection capabilities with expert-curated, clinically conclusive 4-chamber views for 53 HLHS cases. We choose one of the relevant views per subject randomly. Table 3 summarises these results. VAE-GAN has the highest AUC, F1, precision and specificity scores using sattnsubscript𝑠𝑎𝑡𝑡𝑛s_{attn} as anomaly score. Also, we note from Figure LABEL:fig:conf1 and Figure LABEL:fig:conf11 that the VAE-GAN method misclassified less HLHS cases while achieving better performance for confirming normal cases. Average F1 score is 0.890.890.89. Figure 4 shows ROC, anomaly score distribution and confusion matrices at the Youden index of this experiment.

Quantitative performance scores
MethodPrecisionRecallSpecificityF1 scoreAUC
CAE (Ruff et al., 2018)0.63±0.095plus-or-minus0.630.0950.63\pm 0.0950.56±0.120plus-or-minus0.560.1200.56\pm 0.1200.78±0.130plus-or-minus0.780.1300.78\pm 0.1300.57±0.025plus-or-minus0.570.0250.57\pm 0.0250.72±0.015plus-or-minus0.720.0150.72\pm 0.015
DeepSVDD (Ruff et al., 2018)0.39±0.016plus-or-minus0.390.0160.39\pm 0.0160.80±0.160plus-or-minus0.800.1600.80\pm 0.1600.28±0.160plus-or-minus0.280.1600.28\pm 0.1600.52±0.032plus-or-minus0.520.0320.52\pm 0.0320.49±0.038plus-or-minus0.490.0380.49\pm 0.038
f-AnoGAN (Schlegl et al., 2019)0.56±0.077plus-or-minus0.560.0770.56\pm 0.0770.52±0.097plus-or-minus0.520.0970.52\pm 0.0970.75±0.140plus-or-minus0.750.1400.75\pm 0.1400.53±0.041plus-or-minus0.530.0410.53\pm 0.0410.63±0.043plus-or-minus0.630.0430.63\pm 0.043
srecsubscript𝑠𝑟𝑒𝑐s_{rec} (VAE-GAN)0.64±0.067plus-or-minus0.640.0670.64\pm 0.0670.80±0.060plus-or-minus0.800.0600.80\pm 0.0600.74±0.078plus-or-minus0.740.0780.74\pm 0.0780.71±0.020plus-or-minus0.710.0200.71\pm 0.0200.84±0.009plus-or-minus0.840.0090.84\pm 0.009
sdiscrsubscript𝑠𝑑𝑖𝑠𝑐𝑟s_{discr} (VAE-GAN)0.36±0.220plus-or-minus0.360.2200.36\pm 0.2200.56±0.450plus-or-minus0.560.4500.56\pm 0.4500.46±0.430plus-or-minus0.460.4300.46\pm 0.4300.34±0.205plus-or-minus0.340.2050.34\pm 0.2050.39±0.037plus-or-minus0.390.0370.39\pm 0.037
sattnsubscript𝑠𝑎𝑡𝑡𝑛s_{attn} (VAE-GAN)0.71±0.046plus-or-minus0.710.046\mathbf{0.71}\pm 0.0460.85±0.038plus-or-minus0.850.0380.85\pm 0.0380.80±0.058plus-or-minus0.800.058\mathbf{0.80}\pm 0.0580.77±0.016plus-or-minus0.770.016\mathbf{0.77}\pm 0.0160.89±0.009plus-or-minus0.890.009\mathbf{0.89}\pm 0.009
srecsubscript𝑠𝑟𝑒𝑐s_{rec}(α𝛼\alpha-GAN)0.59±0.050plus-or-minus0.590.0500.59\pm 0.0500.81±0.060plus-or-minus0.810.060\mathbf{0.81}\pm 0.0600.66±0.010plus-or-minus0.660.0100.66\pm 0.0100.68±0.015plus-or-minus0.680.0150.68\pm 0.0150.79±0.030plus-or-minus0.790.0300.79\pm 0.030
sdiscrsubscript𝑠𝑑𝑖𝑠𝑐𝑟s_{discr}(α𝛼\alpha-GAN)0.48±0.100plus-or-minus0.480.1000.48\pm 0.1000.51±0.280plus-or-minus0.510.2800.51\pm 0.2800.61±0.280plus-or-minus0.610.2800.61\pm 0.2800.43±0.110plus-or-minus0.430.1100.43\pm 0.1100.53±0.030plus-or-minus0.530.0300.53\pm 0.030
sattnsubscript𝑠𝑎𝑡𝑡𝑛s_{attn}(α𝛼\alpha-GAN)0.59±0.098plus-or-minus0.590.0980.59\pm 0.0980.76±0.150plus-or-minus0.760.1500.76\pm 0.1500.66±0.180plus-or-minus0.660.1800.66\pm 0.1800.64±0.037plus-or-minus0.640.0370.64\pm 0.0370.77±0.046plus-or-minus0.770.0460.77\pm 0.046
Table 3: Anomaly detection performance using dataset2𝑑𝑎𝑡𝑎𝑠𝑒subscript𝑡2dataset_{2} for Exp. 2. Best performance in bold.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 4: dataset2𝑑𝑎𝑡𝑎𝑠𝑒subscript𝑡2dataset_{2}, Exp. 2: (a) ROC-AUC curves in Exp. 2; (b) Distribution of normal/abnormal score values for the VAE-GAN model with sattnsubscript𝑠𝑎𝑡𝑡𝑛s_{attn} as anomaly score (c) Confusion matrix for the best performing run using srecsubscript𝑠𝑟𝑒𝑐s_{rec} of the proposed α𝛼\alpha-GAN. (d) Confusion matrix for the best performing run using sattnsubscript𝑠𝑎𝑡𝑡𝑛s_{attn} of the VAE-GAN. This figure focuses on the results of the best performing initialisation from five experiments with α𝛼\alpha-GAN (or VAE-GAN) while Table 3 shows average metrics.

Experiment 3 uses dataset2𝑑𝑎𝑡𝑎𝑠𝑒subscript𝑡2dataset_{2} and is similar to Exp. 2 except that we take all clinically identified views for each subject into account. We average the individual anomaly scores for each frame, depending on the number of frames that are available per subject. VAE-GAN achieves a better AUC score with 0.860.860.86 compared to 0.840.840.84 of α𝛼\alpha-GAN as can be seen in Table 4. However, as can be seen from the confusion matrices (best performing initialisation), α𝛼\alpha-GAN shows a better true positive rate at the cost of a higher number of false positives (Figure LABEL:fig:conf2). This configuration might be preferred in a clinical setting since it reduces the number of missed cases at the cost of a slightly higher number of false referrals.

Quantitative performance scores
MethodPrecisionRecallSpecificityF1 scoreAUC
CAE (Ruff et al., 2018)0.51±0.061plus-or-minus0.510.0610.51\pm 0.0610.80±0.136plus-or-minus0.800.1360.80\pm 0.1360.54±0.150plus-or-minus0.540.1500.54\pm 0.1500.61±0.018plus-or-minus0.610.0180.61\pm 0.0180.70±0.024plus-or-minus0.700.0240.70\pm 0.024
DeepSVDD (Ruff et al., 2018)0.42±0.063plus-or-minus0.420.0630.42\pm 0.0630.69±0.312plus-or-minus0.690.3120.69\pm 0.3120.39±0.311plus-or-minus0.390.3110.39\pm 0.3110.47±0.140plus-or-minus0.470.1400.47\pm 0.1400.48±0.038plus-or-minus0.480.0380.48\pm 0.038
f-AnoGAN (Schlegl et al., 2019)0.55±0.029plus-or-minus0.550.0290.55\pm 0.0290.79±0.067plus-or-minus0.790.0670.79\pm 0.0670.62±0.068plus-or-minus0.620.0680.62\pm 0.0680.64±0.016plus-or-minus0.640.0160.64\pm 0.0160.74±0.013plus-or-minus0.740.0130.74\pm 0.013
srecsubscript𝑠𝑟𝑒𝑐s_{rec} (VAE-GAN)0.60±0.029plus-or-minus0.600.0290.60\pm 0.0290.87±0.049plus-or-minus0.870.0490.87\pm 0.0490.67±0.056plus-or-minus0.670.0560.67\pm 0.0560.71±0.014plus-or-minus0.710.0140.71\pm 0.0140.81±0.076plus-or-minus0.810.0760.81\pm 0.076
sdiscrsubscript𝑠𝑑𝑖𝑠𝑐𝑟s_{discr} (VAE-GAN)0.37±0.150plus-or-minus0.370.1500.37\pm 0.1500.98±0.400plus-or-minus0.980.4000.98\pm 0.4000.032±0.39plus-or-minus0.0320.390.032\pm 0.390.53±0.021plus-or-minus0.530.0210.53\pm 0.0210.14±0.034plus-or-minus0.140.0340.14\pm 0.034
sattnsubscript𝑠𝑎𝑡𝑡𝑛s_{attn} (VAE-GAN)0.66±0.036plus-or-minus0.660.036\mathbf{0.66}\pm 0.0360.88±0.035plus-or-minus0.880.0350.88\pm 0.0350.74±0.050plus-or-minus0.740.050\mathbf{0.74}\pm 0.0500.75±0.014plus-or-minus0.750.014\mathbf{0.75}\pm 0.0140.86±0.017plus-or-minus0.860.017\mathbf{0.86}\pm 0.017
srecsubscript𝑠𝑟𝑒𝑐s_{rec} (α𝛼\alpha-GAN)0.57±0.041plus-or-minus0.570.0410.57\pm 0.0410.86±0.091plus-or-minus0.860.0910.86\pm 0.0910.62±0.098plus-or-minus0.620.0980.62\pm 0.0980.68±0.022plus-or-minus0.680.0220.68\pm 0.0220.78±0.019plus-or-minus0.780.0190.78\pm 0.019
sdiscrsubscript𝑠𝑑𝑖𝑠𝑐𝑟s_{discr} (α𝛼\alpha-GAN)0.42±0.035plus-or-minus0.420.0350.42\pm 0.0350.89±0.110plus-or-minus0.890.1100.89\pm 0.1100.28±0.155plus-or-minus0.280.1550.28\pm 0.1550.57±0.067plus-or-minus0.570.0670.57\pm 0.0670.48±0.017plus-or-minus0.480.0170.48\pm 0.017
sattnsubscript𝑠𝑎𝑡𝑡𝑛s_{attn} (α𝛼\alpha-GAN)0.62±0.040plus-or-minus0.620.0400.62\pm 0.0400.92±0.100plus-or-minus0.920.100\mathbf{0.92}\pm 0.1000.67±0.069plus-or-minus0.670.0690.67\pm 0.0690.73±0.024plus-or-minus0.730.0240.73\pm 0.0240.84±0.018plus-or-minus0.840.0180.84\pm 0.018
Table 4: Anomaly detection performance on subject level for dataset2𝑑𝑎𝑡𝑎𝑠𝑒subscript𝑡2dataset_{2} and Exp. 3. Best performance in bold.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 5: dataset2𝑑𝑎𝑡𝑎𝑠𝑒subscript𝑡2dataset_{2}, Exp. 3: (a) ROC-AUC curves in Exp. 3; (b) Distribution of normal/abnormal score values for the VAE-GAN model with sattnsubscript𝑠𝑎𝑡𝑡𝑛s_{attn} as anomaly score (c) Confusion matrix for the best performing run of the proposed α𝛼\alpha-GAN (d) Confusion matrix for the best performing run of the VAE-GAN. This figure focuses on the results of the best performing initialisation from five experiments with α𝛼\alpha-GAN (or VAE-GAN)) while Table 4 shows average metrics.

Experiment 4 is similar with the Exp. 3 except that we evaluate frame-level performance in Table 5. VAE-GAN is again better in terms of precision and AUC performance. However, similar to Exp. 3 α𝛼\alpha-GAN has an advantage when recognising the cases with pathology at a cost of a higher false positive rate.

Quantitative performance scores
MethodPrecisionRecallSpecificityF1 scoreAUC
CAE (Ruff et al., 2018)0.80±0.026plus-or-minus0.800.0260.80\pm 0.0260.57±0.081plus-or-minus0.570.0810.57\pm 0.0810.71±0.075plus-or-minus0.710.0750.71\pm 0.0750.66±0.051plus-or-minus0.660.0510.66\pm 0.0510.67±0.020plus-or-minus0.670.0200.67\pm 0.020
DeepSVDD (Ruff et al., 2018)0.86±0.100plus-or-minus0.860.1000.86\pm 0.1000.09±0.030plus-or-minus0.090.0300.09\pm 0.0300.96±0.025plus-or-minus0.960.025\mathbf{0.96}\pm 0.0250.15±0.053plus-or-minus0.150.0530.15\pm 0.0530.44±0.025plus-or-minus0.440.0250.44\pm 0.025
f-AnoGAN (Schlegl et al., 2019)0.82±0.041plus-or-minus0.820.0410.82\pm 0.0410.56±0.070plus-or-minus0.560.0700.56\pm 0.0700.75±0.095plus-or-minus0.750.0950.75\pm 0.0950.66±0.040plus-or-minus0.660.0400.66\pm 0.0400.66±0.013plus-or-minus0.660.0130.66\pm 0.013
srecsubscript𝑠𝑟𝑒𝑐s_{rec} (VAE-GAN)0.82±0.023plus-or-minus0.820.0230.82\pm 0.0230.74±0.062plus-or-minus0.740.0620.74\pm 0.0620.69±0.073plus-or-minus0.690.0730.69\pm 0.0730.77±0.024plus-or-minus0.770.0240.77\pm 0.0240.77±0.009plus-or-minus0.770.0090.77\pm 0.009
sdiscrsubscript𝑠𝑑𝑖𝑠𝑐𝑟s_{discr} (VAE-GAN)0.80±0.130plus-or-minus0.800.1300.80\pm 0.1300.03±0.007plus-or-minus0.030.0070.03\pm 0.0070.99±0.008plus-or-minus0.990.0080.99\pm 0.0080.05±0.012plus-or-minus0.050.0120.05\pm 0.0120.37±0.047plus-or-minus0.370.0470.37\pm 0.047
sattnsubscript𝑠𝑎𝑡𝑡𝑛s_{attn} (VAE-GAN)0.86±0.016plus-or-minus0.860.016\mathbf{0.86}\pm 0.0160.78±0.051plus-or-minus0.780.0510.78\pm 0.0510.76±0.046plus-or-minus0.760.0460.76\pm 0.0460.82±0.023plus-or-minus0.820.0230.82\pm 0.0230.82±0.023plus-or-minus0.820.023\mathbf{0.82}\pm 0.023
srecsubscript𝑠𝑟𝑒𝑐s_{rec} (α𝛼\alpha-GAN)0.80±0.016plus-or-minus0.800.0160.80\pm 0.0160.80±0.032plus-or-minus0.800.0320.80\pm 0.0320.62±0.051plus-or-minus0.620.0510.62\pm 0.0510.80±0.012plus-or-minus0.800.0120.80\pm 0.0120.75±0.017plus-or-minus0.750.0170.75\pm 0.017
sdiscrsubscript𝑠𝑑𝑖𝑠𝑐𝑟s_{discr} (α𝛼\alpha-GAN)0.71±0.060plus-or-minus0.710.0600.71\pm 0.0600.72±0.300plus-or-minus0.720.3000.72\pm 0.3000.38±0.320plus-or-minus0.380.3200.38\pm 0.3200.66±0.180plus-or-minus0.660.1800.66\pm 0.1800.48±0.055plus-or-minus0.480.0550.48\pm 0.055
sattnsubscript𝑠𝑎𝑡𝑡𝑛s_{attn} (α𝛼\alpha-GAN)0.82±0.030plus-or-minus0.820.0300.82\pm 0.0300.85±0.110plus-or-minus0.850.110\mathbf{0.85}\pm 0.1100.64±0.094plus-or-minus0.640.0940.64\pm 0.0940.83±0.047plus-or-minus0.830.047\mathbf{0.83}\pm 0.0470.81±0.018plus-or-minus0.810.0180.81\pm 0.018
Table 5: Anomaly detection performance using dataset2𝑑𝑎𝑡𝑎𝑠𝑒subscript𝑡2dataset_{2} in Exp. 4 for evaluation per frame. Best performance in bold.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 6: dataset2𝑑𝑎𝑡𝑎𝑠𝑒subscript𝑡2dataset_{2}, Exp. 4: (a) ROC-AUC curves in Exp. 4; (b) Distribution of normal/abnormal score values for the α𝛼\alpha-GAN model with sattnsubscript𝑠𝑎𝑡𝑡𝑛s_{attn} as anomaly score (c) Confusion matrix for the best performing run of the proposed α𝛼\alpha-GAN. (d) Confusion matrix for the best performing run of the VAE-GAN. This figure focuses on the results of the best performing initialisation from five experiments with α𝛼\alpha-GAN (or VAE-GAN) while Table 5 shows average metrics.

4.2 Qualitative analysis

In order to evaluate the ability of the algorithm to localise anomalies, we plot the class activation maps as they are derived from the proposed model. We present results from abnormal cases in dataset1𝑑𝑎𝑡𝑎𝑠𝑒subscript𝑡1dataset_{1} (Exp.1) Figure 7. In the abnormal cases, attention focus exactly in the area of heart. As a consequence, anomaly scores in such cases are higher compared to normal cases and correctly indicating the anomalous cases. All anomaly scores are normalised in the range of [0,1]01[0,1]. There are cases that our algorithm fails to classify correctly. Either they are abnormal and they are classified as normal (False Negative-FN) or they are healthy and identified as anomalous (False Positive-FP). In Figure 8 examples for False Positive cases are presented alongside False Negative cases. Bad image reconstruction quality is a limiting factor. For instance, in some reconstructions either a part of the heart is missing (left or right ventricle/ atrium) or the shape of the heart is quite different from a normal heart (e.g., a very “long” ventricle). As a consequence, not only the reconstruction error is high, but also the attention mechanism focuses in this area, since it is recognised (by the network) as anomalous. Consequently, the total anomaly score is high. In fewer examples the signal-to-noise ratio (SNR) is low, i.e., images are blurry, and so the network fails to reconstruct the images at all. Furthermore, in the False Positive examples Figure 8a, from clinical perspective, the angle is not quite right, so it makes the ventricles look shorter than they are. This confuses the model, forcing the discriminator’s attention to indicate this area as anomalous. Another point which is very interesting to highlight, is that there are cases where some frames are very difficult, even for experts. Such an example is given in Figure 8b, where although the second image from left belongs to an abnormal subject, the specific frame appears normal at the first glance. Such cases also highlight limitations of single-view approaches. In practice, all relevant frames showing the four chamber view could be processed with our method and a majority vote could regarding referral be calibrated on a ROC curve.

All the above plots and comparisons utilise the top-111 performing experiment among all the runs of the experiments for α𝛼\alpha-GAN.

Refer to caption
Figure 7: Top row: Pathological subjects Bottom row: GradCam++ visualisation of attention maps using α𝛼\alpha-GAN (Exp. 1).
*= dominant RV with no visible LV cavity, solid white arrow = deceptively normal-looking LV, dashed white arrow = globular, hypoplastic LV
Refer to caption
(a)
Refer to caption
(b)
Figure 8: (a) Examples of False Positive along with the anomaly scores sattnsubscript𝑠𝑎𝑡𝑡𝑛s_{attn} (b) False Negative cases along with the anomaly scores sattnsubscript𝑠𝑎𝑡𝑡𝑛s_{attn} (Exp. 1). *= dominant RV with no visible LV cavity, solid white arrow = deceptively normal-looking LV, dashed white arrow = globular, hypoplastic LV. Low Signal-to-Noise Ratio (SNR)

5 Discussion

Our results are promising and confirm that automated anomaly detection can work in fetal 2D ultrasound as shown on the example of HLHS. For this pathology we achieve an average accuracy of 0.810.810.81 AUC, improving significantly the detection rate of front-line-of-care sonographers during screening, which is often below 60% (Chew et al., 2007). However, there are open issues.

False negative rates are critical for clinical diagnosis and downstream treatment. In a clinical setting, a method with zero false negative predictions would be preferred, i.e., a method that never misses an anomaly, but potentially predicts a few false positives. Assuming that the false positive rate of such an algorithm is significantly below the status quo, the benefits for antenatal detection and potentially better postnatal outcomes would outweigh the costs. Of course, an algorithm with a 100% false positive rate is also not desirable, hence calibration on the ROC must be performed.

A key aspect of the proposed algorithm is the ability of the discriminator to highlight decisive areas in images. In order to achieve this, it is necessary to produce good reconstructions of normal images. However, reconstruction quality can be limited, depending on the given sample. A larger dataset could provide a mitigation strategy for this. Furthermore, alternative ways for visualising attention could be explored for disease-specific applications such as implicit mechanisms of attention like attention gates (Schlemper et al., 2019).

Although we have experimented with different type of noise (e.g Uniform) and various augmentation techniques (e.g horizontal flip, intensity changes) we did not notice an improvement in anomaly detection performance. However, a further investigation of other augmentation techniques should be done.

Moreover, it would be interesting to explore the sensitivity of our method for other sub-types of congenital heart disease. Intuitively, accuracy of a general anomaly detection method should be similarly high for other syndromes that affect the morphological appearance of the fetal four-chamber view. HLHS has a particularly grossly abnormal appearance. There are a lot of other CHD examples with a subtly abnormal four chamber view that would probably be much harder to detect even for human experts. Additionally, in practice, confounding factors may bias anomaly detection methods towards more obvious outliers, while subtle signs of disease or indicators encoded in other dimensions like the spatio-temporal domain may still be missed.

Finally, robust time-series analysis is still a challenging fundamental research question and we are looking forward to extending our method to full video sequences in future work.

6 Conclusion

In this paper we attempt to consider the detection of congenital heart disease as a one-class anomaly detection problem, learning only from normal samples. The proposed unsupervised architecture shows promising results and achieves better performance compared to existing state-of-the-art image anomaly detection methods. However, since clinical practice requires highly reliable anomaly detection methods, more work will need to be done to avoid false positives to mitigate patient stress and strain on healthcare systems and false negatives to prevent missed diagnoses.

Acknowledgements

EC was supported by an EPSRC DTP award. TD was supported by an NIHR Doctoral Fellowship. We thank the volunteers and sonographers from routine fetal screening at St. Thomas’ Hospital London. This work was supported by the Wellcome Trust IEH Award [102431] for the Intelligent Fetal Imaging and Diagnosis project (www.ifindproject.com) and EPSRC EP/S013687/1. The study has been granted NHS R&D and ethics approval, NRES ref no = 14/LO/1086. The research was funded/supported by the National Institute for Health Research (NIHR) Biomedical Research Center based at Guy’s and St Thomas’ NHS Foundation Trust, King’s College London and the NIHR Clinical Research Facility (CRF) at Guy’s and St Thomas’. Data access only in line with the informed consent of the participants, subject to approval by the project ethics board and under a formal Data Sharing Agreement. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.

References

  • Arjovsky et al. (2017) M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. arXiv preprint arXiv:1701.07875, 2017.
  • Arnaout et al. (2020) R. Arnaout, L. Curran, Y. Zhao, J. Levine, E. Chinn, and A. Moon-Grady. Expert-level prenatal detection of complex congenital heart disease from screening ultrasound using deep learning. medRxiv, 2020. doi: 10.1101/2020.06.22.20137786.
  • Baumgartner et al. (2017) C F. Baumgartner, K. Kamnitsas, J. Matthew, T. P. Fletcher, S. Smith, L M. Koch, B. Kainz, and D. Rueckert. Sononet: Real-time detection and localisation of fetal standard scan planes in freehand ultrasound. IEEE Trans. Medical Imaging, 36(11):2204–2215, 2017.
  • Baur et al. (2018) C. Baur, B. Wiestler, S. Albarqouni, and N. Navab. Deep autoencoding models for unsupervised anomaly segmentation in brain mr images. International MICCAI Brainlesion Workshop, pages 161–169, 2018.
  • Bennasar et al. (2010) M. Bennasar, J.M Martínez, O. Gómez, J. Bartrons, A. Olivella, B. Puerto, and E. Gratacós. Accuracy of four-dimensional spatiotemporal image correlation echocardiography in the prenatal diagnosis of congenital heart defects. Ultrasound in Obstetrics and Gynecology, 36(4), pages 458–464, 2010.
  • Chattopadhay et al. (2018) A. Chattopadhay, A. Sarkar, P. Howlader, and V. N. Balasubramanian. Grad-CAM++: Generalized Gradient-Based Visual Explanations for Deep Convolutional Networks. 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 839–847, 2018.
  • Chew et al. (2007) C Chew, JL Halliday, MM Riley, and DJ Penny. Population-based study of antenatal detection of congenital heart disease by ultrasound examination. Ultrasound in Obstetrics and Gynecology: The Official Journal of the International Society of Ultrasound in Obstetrics and Gynecology, 29(6):619–624, 2007.
  • DeLong et al. (1988) E.R DeLong, D.M DeLong, and D.L Clarke-Pearson. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, 44(3):837–845, 1988. doi: 10.2307/2531595.
  • Dosovitskiy and Brox (2016) A. Dosovitskiy and T. Brox. Generating images with perceptual similarity metrics based on deep networks. Advances in Neural Information Processing Systems 29 (NIPS), 29, 2016.
  • Gong et al. (2020) Y. Gong, Y. Zhang, H. Zhu, J. Lv, H. Cheng, Q. Zhang, Y. He, and S. Wang. Fetal congenital heart disease echocardiogram screening based on dgacnn: Adversarial one-class classification combined with video transfer learning. IEEE Transactions on Medical Imaging, 39(4):1206–1222, 2020.
  • Gulrajani et al. (2017) I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and AC. Courville. Improved training of Wasserstein GANs. arXiv preprint arXiv:1704.00028, 2017.
  • He et al. (2016) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • HLHS (2019) HLHS. Facts about Hypoplastic Left Heart Syndrome, National Center on Birth Defects and Developmental Disabilities, Centers for Disease Control and Prevention. https://www.cdc.gov/ncbddd/heartdefects/hlhs.html, 2019.
  • (14) B. J. Holland, J. A. Myers, and C. R. Woods Jr. Prenatal diagnosis of critical congenital heart disease reduces risk of death from cardiovascular compromise prior to planned neonatal cardiac surgery: a meta-analysis. Ultrasound in Obstetrics and Gynecology, 45:631 – 638.
  • Isola et al. (2017) P. Isola, J-H. Zhu, T. Zhou, and AA. Efros. Image-to-image translation with conditional adversarial networks. CVPR, 2017.
  • Kimura et al. (2020) D. Kimura, S. Chaudhury, M. Narita, A. Munawar, and R. Tachibana. Adversarial Discriminative Attention for Robust Anomaly Detection. IEEE Winter Conference on Applications of Computer Vision, WACV, pages 2161–2170, 2020.
  • Kwon et al. (2019) G. Kwon, C. Han, and R.V Daeshik. Generation of 3D Brain MRI Using Auto-Encoding Generative Adversarial Networks. Medical Image Computing and Computer Assisted Intervention - MICCAI 2019 - 22nd International Conference, Shenzhen, China, October 13-17, 2019, Proceedings, Part III, pages 118–126, 2019. doi: 10.1007/978-3-030-32248-9\_14.
  • Liu et al. (2020) W. Liu, R. Li, M. Zheng, S. Karanam, Z. Wu, B. Bhanu, R.J Radke, and O.I Camps. Towards Visually Explaining Variational Autoencoders. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 8639–8648, 2020. doi: 10.1109/CVPR42600.2020.00867.
  • Makhzani and Frey (2015) A. Makhzani and B.J Frey. Winner-take-all autoencoders. Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 2791–2799, 2015.
  • Mario Lucic et al. (2018) M. Mario Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet. Are GANs Created Equal? A Large-Scale Study. Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems, pages 698–707, 2018.
  • Masci et al. (2011) J. Masci, U. Ueli Meier, D. C. Dan C. Ciresan, and J. Schmidhuber. Stacked convolutional auto-encoders for hierarchical feature extraction. Artificial Neural Networks and Machine Learning - ICANN 2011 - 21st International Conference on Artificial Neural Networks, Espoo, Finland, June 14-17, 2011, Proceedings, Part I, pages 52–59, 2011.
  • Miyato et al. (2018) T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral Normalization for Generative Adversarial Networks. 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.
  • Ngo et al. (2019) CP. Ngo, AA. Winarto, Khor Li Kou C., S. Park, F. Akram, and HK. Lee. Fence GAN: Towards Better Anomaly Detection. arXiv preprint arXiv:1904.01209, 2019.
  • NHS (2015) NHS. Fetal anomaly screening programme: programme handbook June 2015. Public Health England, 2015.
  • Oza and Patel (2019) P. Oza and V. M. Patel. One-class convolutional neural network. IEEE Signal Processing Letters, 26:277–281, 2019.
  • Perera et al. (2019) P. Perera, R. Nallapati, and B. Xiang. Ocgan: One-class novelty detection using gans with constrained latent representations. IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pages 2898–2906, 2019.
  • Perera et al. (2021) P. Perera, P. Oza, and V. M. Patel. One-class classification: A survey. CoRR, arXiv:2101.03064, 2021.
  • Petersen et al. (2013) Steffen E Petersen, Paul M Matthews, Fabian Bamberg, David A Bluemke, Jane M Francis, Matthias G Friedrich, Paul Leeson, Eike Nagel, Sven Plein, Frank E Rademakers, et al. Imaging in population science: cardiovascular magnetic resonance in 100,000 participants of uk biobank-rationale, challenges and approaches. Journal of Cardiovascular Magnetic Resonance, 15(1):46, 2013.
  • Pidhorskyi et al. (2018) S. Pidhorskyi, R. Almohsen, and G. Doretto. Generative probabilistic novelty detection with adversarial autoencoders. Advances in neural information processing systems, pages 6822–6833, 2018.
  • Pinto et al. (2012) NM. Pinto, HT. Keenan, LL. Minich, MD. Puchalski, M. Heywood, and LD. Botto. Barriers to prenatal detection of congenital heart disease: a population-based study. Ultrasound in Obstetrics and Gynecology, 40(4),, pages 418–425, 2012.
  • Radford et al. (2016) A. Radford, L. Metz, and S Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. 4th International Conference on Learning Representations, ICLR 2016,San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016.
  • Rosca et al. (2017) M. Rosca, B. Lakshminarayanan, D. Warde-Farley, and S. Mohamed. Variational Approaches for Auto-Encoding Generative Adversarial Networks. arXiv preprint arXiv:1706.04987, 2017.
  • Ruff et al. (2018) L. Ruff, R. Vandermeulen, N. Goernitz, L. Deecke, S. A. Siddiqui, A. Binder, E. Müller, and M. Kloft. Deep One-Class Classification. Proceedings of Machine Learning Research, pages 4393–4402, 2018.
  • Sabokrou et al. (2018) M. Sabokrou, M. Khalooei, M. Fathy, and E. Adeli. Adversarially Learned One-Class Classifier for Novelty Detection. 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pages 3379–3388, 2018.
  • Schlegl et al. (2017) T. Schlegl, P. Seeböck, SM. Waldstein, U. Schmidt-Erfurth, and G. Langs. Unsupervised Anomaly Detection with Generative Adversarial Network to Guide Marker Discovery. Information Processing in Medical Imaging - 25th International Conference, IPMI 2017, Boone, NC, USA, June 25-30, 2017, Proceedings, pages 146–157, 2017. doi: 10.1007/978-3-319-59050-9\_12.
  • Schlegl et al. (2019) T. Schlegl, P. Seeböck, SM. Waldstein, G Langs, and U.. Schmidt-Erfurth. f-AnoGAN: Fast unsupervised anomaly detection with generative adversarial networks. Medical Image Analysis, pages 30–44, 2019. doi: 10.1016/j.media.2019.01.010.
  • Schlemper et al. (2019) J. Schlemper, O. Oktay, M. Schaap, M. Heinrich, B. Kainz, B. Glocker, and D. Rueckert. Attention gated networks: Learning to leverage salient regions in medical images. Medical Image Analysis, pages 197 – 207, 2019. doi: https://doi.org/10.1016/j.media.2019.01.012.
  • Schölkopf et al. (2001) B. Schölkopf, J. C. Platt, J. C. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a high-dimensional distribution. Neural Computation, 13:1443–1471, 2001.
  • Shen et al. (2020) H. Shen, J. Chen, R. Wang, and J. Zhang. Counterfeit Anomaly Using Generative Adversarial Network for Anomaly Detection. IEEE Access, pages 133051–133062, 2020.
  • Tax and Duin (2004) D.M.J Tax and R.P.W Duin. Support vector data description. Machine Learning, 54:45–66, 2004.
  • Ulyanov et al. (2018) D. Ulyanov, A. Vedaldi, and V. S. Lempitsky. It takes (only) two: Adversarial generator-encoder networks. AAAI, 2018.
  • van Velzen et al. (2016) CL. van Velzen, SA. Clur, MEB. Rijlaarsdam, CJ. Bax, E. Pajkrt, MW. Heymans, MN. Bekker, J. Hruda, CJM. de Groot, NA. Blom, and MC. Haak. Prenatal detection of congenital heart disease–results of a national screening programme. BJOG: An international journal in Obstetrics and Gynaecology , 123(3), pages 400–407, 2016.
  • Venkataramanan et al. (2019) S. Venkataramanan, K-C. Peng, R.V. Singh, and A. Mahalanobis. Adversarial Discriminative Attention for Robust Anomaly Detection. arXiv preprint arXiv:1911.08616, 2019.
  • Yeo et al. (2018) L. Yeo, S. Luewan, and R. Romero. Fetal Intelligent Navigation Echocardiography (FINE) detects 98%percent9898\% of Congenital Heart Disease. Journal of ultrasound in medicine, 37(11),, page 2577–2593, 2018.
  • Zaheer et al. (2020) M. Z. Zaheer, J.-h Lee, M. Astrid, and S-I Lee. Old is gold: Redefining the adversarially learned one-class classifier training paradigm. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14183–14193, 2020.
  • Zhang et al. (2019) H. Zhang, I. J Goodfellow, D.N Metaxas, and A. Odena. Self-attention generative adversarial networks. Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, 97:7354–7363, 2019.
  • Zhou et al. (2020) K. Zhou, S. Gao, J. Cheng, Z. Gu, H. Fu, Z. Tu, J. Yang, Y. Zhao, and J. Liu. Sparse-Gan: Sparsity-Constrained Generative Adversarial Network for Anomaly Detection in Retinal OCT Image. 17th IEEE International Symposium on Biomedical Imaging, ISBI 2020, Iowa City, IA, USA, April 3-7, 2020, pages 1227–1231, 2020.