A Differentially Private Probabilistic Framework for Modeling the Variability Across Federated Datasets of Heterogeneous Multi-View Observations

Irene Balelli1Orcid, Santiago Silva1Orcid, Marco Lorenzi1Orcid
1: Centre INRIA d’Université Côte d'Azur
Publication date: 2022/04/22
https://doi.org/10.59275/j.melba.2022-7175
PDF · Code · HAL · Video · arXiv

Abstract

We propose a novel federated learning paradigm to model data variability among heterogeneous clients in multi-centric studies. Our method is expressed through a hierarchical Bayesian latent variable model, where client-specific parameters are assumed to be realization from a global distribution at the master level, which is in turn estimated to account for data bias and variability across clients. We show that our framework can be effectively optimized through expectation maximization (EM) over latent master's distribution and clients' parameters. We also introduce formal differential privacy (DP) guarantees compatibly with our EM optimization scheme. We tested our method on the analysis of multi-modal medical imaging data and clinical scores from distributed clinical datasets of patients affected by Alzheimer's disease. We demonstrate that our method is robust when data is distributed either in iid and non-iid manners, even when local parameters perturbation is included to provide DP guarantees. Our approach allows to quantify the variability of data, views and centers, while guaranteeing high-quality data reconstruction as compared to the state-of-the-art autoencoding models and federated learning schemes.
The code is available at https://gitlab.inria.fr/epione/federated-multi-views-ppca

Keywords

federated learning · hierarchical generative model · heterogeneity · differential privacy

Bibtex @article{melba:2022:012:balelli, title = "A Differentially Private Probabilistic Framework for Modeling the Variability Across Federated Datasets of Heterogeneous Multi-View Observations", author = "Balelli, Irene and Silva, Santiago and Lorenzi, Marco", journal = "Machine Learning for Biomedical Imaging", volume = "1", issue = "IPMI 2021 special issue", year = "2022", pages = "1--36", issn = "2766-905X", doi = "https://doi.org/10.59275/j.melba.2022-7175", url = "https://melba-journal.org/2022:012" }
RISTY - JOUR AU - Balelli, Irene AU - Silva, Santiago AU - Lorenzi, Marco PY - 2022 TI - A Differentially Private Probabilistic Framework for Modeling the Variability Across Federated Datasets of Heterogeneous Multi-View Observations T2 - Machine Learning for Biomedical Imaging VL - 1 IS - IPMI 2021 special issue SP - 1 EP - 36 SN - 2766-905X DO - https://doi.org/10.59275/j.melba.2022-7175 UR - https://melba-journal.org/2022:012 ER -

2022:012 cover

Disclaimer: the following html version has been automatically generated and the PDF remains the reference version. Feedback can be sent directly to publishing-editor@melba-journal.org

1 Introduction

The analysis of medical imaging datasets requires the joint modeling of multiple views (or modalities), such as clinical scores and multi-modal medical imaging data. For example, in dataset from neurological studies, views are generated through different medical imaging data acquisition processes, as for instance Magnetic Resonance Imaging (MRI) or Positron Emission Tomography (PET). Each view provides specific information about the pathology, and the joint analysis of all views is necessary to improve diagnosis, for the discovery of pathological relationships or for predicting disease evolution. Nevertheless, the integration of multi-views data, accounting for their mutual interactions and their joint variability, presents a number of challenges.

When dealing with high dimensional and noisy data it is crucial to be able to extract an informative lower dimensional representation to disentangle the relationships among observations, accounting for the intrinsic heterogeneity of the original complex data structure. From a statistical perspective, this implies the estimation of a model of the joint variability across views, or equivalently the development of a joint generative model, assuming the existence of a common latent representation generating all views.

Several data assimilation methods based on dimensionality reduction have been developed (Cunningham and Ghahramani, 2015), and successfully applied to a variety of domains. The main goal of these methods is to identify a suitable lower dimensional latent space, where meaningful statistical properties of the original dataset are identified after projection. The most basic among such methods is Principal Component Analysis (PCA) (Jolliffe, 1986), where data are projected over the axes of maximal variability. More flexible approaches based on non-linear representation of the data variability are Auto-Encoders (Kramer, 1991; Goodfellow et al., 2016), enabling to learn a low-dimensional representation minimizing the reconstruction error. In the medical imaging community, non-linear counterparts of PCA have been also proposed by extending the notion of principal components and variability to the Riemannian setting (Sommer et al., 2010; Banerjee et al., 2017).

In some cases, Bayesian counterparts of the original dimensionality reduction methods have been developed, such as Probabilistic Principal Component Analysis (PPCA) (Tipping and Bishop, 1999), based on factor analysis, or, more recently, Variational Auto-Encoders (VAEs) (Kingma and Welling, 2019), and Bayesian principal geodesic analysis (Zhang and Fletcher, 2013; Hromatka et al., 2015; Fletcher and Zhang, 2016). In particular, VAEs are machine learning algorithms based on a generative function which allows probabilistic data reconstruction from the latent space. Encoder and decoder can be flexibly parametrized by neural networks (NNs), and efficiently optimized through Stochastic Gradient Descent (SGD). The added values of Bayesian methods is to provide a tool for sampling new observations from the estimated data distribution, and quantify the uncertainty of data and parameters. In addition, Bayesian model selection criteria, such as the Watanabe-Akaike Information Criteria (WAIC) (Gelman et al., 2014), allow to perform automatic model selection.

Multi-centric biomedical studies offer a great opportunity to significantly increase the quantity and quality of available data, hence to improve the statistical reliability of their analysis. Nevertheless, in this context, three main data-related challenges should be considered. 1) Statistical heterogeneity of local datasets (i.e. center-specific datasets): observations may be non-identically distributed across centers with respect to some characteristics affecting the output (e.g. diagnosis). Additional variability in local datasets can also come from data collection and acquisition bias (Kalter et al., 2019). 2) Missing views: not all views are usually available for each center, due for example to heterogeneous data acquisition and processing pipelines. 3) Privacy concerns: privacy-preserving laws are currently enforced to ensure protection of personal data (e.g. the European General Data Protection Regulation - GDPR111https://gdpr-info.eu/), often preventing the centralized analysis of data collected in multiple centers (Iyengar et al., 2018; Chassang, 2017). These limitations impose the need for extending currently available data assimilation methods to handle decentralized heterogeneous data and missing views in local datasets.

Federated learning (FL) is an emerging analysis paradigm specifically developed for the decentralized training of machine learning models. The standard aggregation method in FL is Federated Averaging (FedAvg) (McMahan et al., 2017a), which combines locally trained models via weighted averaging. This aggregation scheme is generally sensitive to statistical heterogeneity, which naturally arises in federated datasets (Li et al., 2020), for example when dealing with multi-view data, or when data are not uniformly represented across data centers (e.g. non-iid distributed). In this case a faithful representation of the variability across centers is not guaranteed.

In order to guarantee data governance, FL methods are conceived to avoid sensitive data transfer among centers: raw data are processed within each center, and only local parameters are shared with the master. Nevertheless, no formal privacy guarantees are provided on the shared statistics, which may still reveal sensitive information about individual data points used to train the model. Differential privacy (DP) is an established framework to provide theoretical guarantees about the anonymity of the shared statistics with respect to the training data points. Recent works (Abadi et al., 2016; Geyer et al., 2017; Triastcyn and Faltings, 2019) show the importance of combining FL and DP to prevent potential information leakage form the shared parameters, while providing theoretical privacy guarantees for both clients and server.

Refer to caption
Figure 1: Hierarchical structure of Fed-mv-PPCA. Global parameters 𝜽~~𝜽\widetilde{\boldsymbol{\theta}} characterize the distribution of the local 𝜽csubscript𝜽𝑐\boldsymbol{\theta}_{c}, which parametrize the local data distribution in each center.

We present here Federated multi-view PPCA (Fed-mv-PPCA), a novel FL framework for data assimilation of heterogeneous multi-view datasets. Our framework is designed to account for the heterogeneity of federated datasets through a fully Bayesian formulation. Fed-mv-PPCA is based on a hierarchical dependency of the model’s parameters to handle different sources of variability in the federated dataset (Figure 1). The method is based on a linear generative model, assuming Gaussian latent variables and noise, and allows to account for missing views and observations across datasets. In practice, we assume that there exists an ideal global distribution of each parameter, from which local parameters are generated to account for the local data distribution for each center. We show, in addition, that the privacy of the shared parameters of Fed-mv-PPCA can be explicitly quantified and guaranteed by means of DP. The code developed in Python is publicly available at https://gitlab.inria.fr/epione/federated-multi-views-ppca.

The paper is organized as follows: in Section 2 we provide a brief overview of the state-of-the-art and highlight the advancements introduced with Fed-mv-PPCA. In Section 3 we describe Fed-mv-PPCA, while its extension to improve privacy preservation through DP is provided in Section 3.2. In Section 4 we show results with applications to synthetic data and to data from the Alzheimer’s Disease Neuroimaging Initiative dataset (ADNI). Section 5 concludes the paper with a brief discussion.

2 Related Works

Several methods for dimensionality reduction based on generative models have been developed in the past years, starting from the seminal work of PPCA by Tipping and Bishop (1999), to Bayesian Canonical Correlation Analysis (CCA) (Klami et al., 2013), which Matsuura et al. (2018) extended to include multiple views and missing modalities, up to more complex methods based on multi-variate association models (Shen and Thompson, 2019), developed, for example, to integrate multi-modal brain imaging data and high-throughput genomics data. Other works with interesting applications to medical imaging data are based on Riemannian approaches to better deal with non linearity (Sommer et al., 2010; Banerjee et al., 2017), and have been extended to a latent variable formulation (Probabilistic Principal Geodesic Analysis - PPGA - by Zhang and Fletcher (2013)).

More recent methods for the probabilistic analysis of multi-views datasests include the multi channel Variational Autoencoder (mc-VAE) by Antelmi et al. (2019) and Multi-Omics Factor Analysis (MOFA) by Argelaguet et al. (2018). MOFA generalizes PPCA for the analysis of multi-omics data types, supporting different noise models to adapt to continuous, binary and count data, while mc-VAE extends the classic VAE (Kingma and Welling, 2014) to jointly account for multi-views data. Additionally, mc-VAE can handle sparse datasets: data reconstruction in testing can be inferred from available views, if some are missing.

Despite the possibility offered by the above methods for performing data assimilation and integrating multiple views, these approaches have not been conceived to handle federated datasets.

Statistical heterogeneity is a key challenge in FL and, more generally, in multi-centric studies (Li et al., 2020). To tackle this problem, Li et al. (2018) proposed the FedProx algorithm, which improves FedAvg by allowing for partial local work (i.e. adapting the number of local epochs) and by introducing a proximal term to the local objective function to avoid divergence due to data heterogeneity. Other methods have been developed under the Bayesian non-parametric formalism, such as probabilistic neural matching (Yurochkin et al., 2019), where the local parameters of NNs are federated depending on neurons similarities.

Since the development of FedAvg, researchers have been focusing in developing FL frameworks robust to the statistical heterogeneity across clients (Sattler et al., 2019; Liang et al., 2020). Most of these frameworks are however formulated for training schemes based on stochastic gradient descent, with principal applications to NNs models. Nevertheless, beyond applications taylored around NNs, we still lack of a consistent and privacy-compliant Bayesian framework for the estimation of local and global data variability, as part of a global optimization model, while accounting for data heterogeneity. In particular, FL alone does not provide clear theoretical guarantees for privacy preservation, leaving the door open to potential data leakage from malicious clients or the central server, such as through model inversion (Fredrikson et al., 2015), and researchers are currently focusing to adapt FL schemes to account for DP mechanisms  (McMahan et al., 2017b).

All these considerations ultimately motivate for the development of Fed-mv-PPCA and its differential private extension. The main contributions of the work presented in this paper are the following:

  • we theoretically develop a novel Bayesian hierarchical framework, Fed-mv-PPCA, for data assimilation from heterogeneous multi-views private federated datasets;

  • we investigate the improvement our framework’s security against data leakage by coupling it with differential privacy, and propose DP-Fed-mv-PPCA;

  • We apply both models to synthetic data and real multi-modal imaging data and clinical scores form the Alzheimer’s Disease Neuroimaging Initiative, demonstrating the robustness of our framework against non-iid data distribution across centers and missing modalities.

3 Methods

3.1 Federated multi-views PPCA

3.1.1 Problem setup

We consider C𝐶C independent centers. Each center c{1,,C}𝑐1𝐶c\in\{1,\dots,C\} owns a private local dataset Tc={𝐭c,n}n=1,,Ncsubscript𝑇𝑐subscriptsubscript𝐭𝑐𝑛𝑛1subscript𝑁𝑐T_{c}=\left\{\mathbf{t}_{c,n}\right\}_{n=1,\dots,N_{c}}, where we denote by 𝐭c,nsubscript𝐭𝑐𝑛\mathbf{t}_{c,n} the data row for subject n𝑛n in center c𝑐c, with n=1,,Nc𝑛1subscript𝑁𝑐n=1,\dots,N_{c}. We assume that a total of K𝐾K distinct views have been measured across all centers, and we allow missing views in some local dataset (i.e. some local dataset could be incomplete, including only measurements for Kc<Ksubscript𝐾𝑐𝐾K_{c}<K views). For every k{1,,K}𝑘1𝐾k\in\{1,\dots,K\}, the dimension of the kthsuperscript𝑘thk^{\textrm{th}}-view (i.e. the number of features defining the kthsuperscript𝑘thk^{\textrm{th}}-view) is dksubscript𝑑𝑘d_{k}, and we define d:=k=1Kdkassign𝑑superscriptsubscript𝑘1𝐾subscript𝑑𝑘d:=\sum_{k=1}^{K}d_{k}. We denote by 𝐭c,n(k)superscriptsubscript𝐭𝑐𝑛𝑘\mathbf{t}_{c,n}^{(k)} the raw data of subject n𝑛n in center c𝑐c corresponding to the kthsuperscript𝑘thk^{\textrm{th}}-view, hence 𝐭c,n=(𝐭c,n(1),,𝐭c,n(K))subscript𝐭𝑐𝑛superscriptsubscript𝐭𝑐𝑛1superscriptsubscript𝐭𝑐𝑛𝐾\mathbf{t}_{c,n}=\left(\mathbf{t}_{c,n}^{(1)},\dots,\mathbf{t}_{c,n}^{(K)}\right).

3.1.2 Modeling assumptions

The main assumption at the basis of Fed-mv-PPCA is the existence of a hierarchical structure underlying the data distribution. In particular, we assume that there exist global parameters 𝜽~~𝜽\widetilde{\boldsymbol{\theta}}, following a distribution P(𝜽~)𝑃~𝜽P(\widetilde{\boldsymbol{\theta}}), able to describe the global data variability, i.e. the ensemble of local datasets. For each center, local parameters 𝜽csubscript𝜽𝑐\boldsymbol{\theta}_{c} are generated from P(𝜽c|𝜽~)𝑃conditionalsubscript𝜽𝑐~𝜽P(\boldsymbol{\theta}_{c}|\widetilde{\boldsymbol{\theta}}), to account for the specific variability of the local dataset. Finally, local data 𝐭csubscript𝐭𝑐\mathbf{t}_{c} are obtained from their local distribution P(𝐭c|𝜽c)𝑃conditionalsubscript𝐭𝑐subscript𝜽𝑐P(\mathbf{t}_{c}|\boldsymbol{\theta}_{c}). Given the federated datasets, Fed-mv-PPCA provides a consistent Bayesian framework to solve the inverse problem and estimate the model’s parameters across the entire hierarchy.

We assume that in each center c𝑐c, the local data of subject n𝑛n corresponding to the kthsuperscript𝑘thk^{\textrm{th}}-view, 𝐭c,n(k)superscriptsubscript𝐭𝑐𝑛𝑘\mathbf{t}_{c,n}^{(k)}, follows the generative model:

𝐭c,n(k)=Wc(k)𝐱c,n+𝝁c(k)+𝜺c(k),superscriptsubscript𝐭𝑐𝑛𝑘superscriptsubscript𝑊𝑐𝑘subscript𝐱𝑐𝑛superscriptsubscript𝝁𝑐𝑘superscriptsubscript𝜺𝑐𝑘\mathbf{t}_{c,n}^{(k)}=W_{c}^{(k)}\mathbf{x}_{c,n}+\boldsymbol{\mu}_{c}^{(k)}+\boldsymbol{\varepsilon}_{c}^{(k)},(1)

where 𝐱c,n𝒩(0,𝕀q)similar-tosubscript𝐱𝑐𝑛𝒩0subscript𝕀𝑞\mathbf{x}_{c,n}\sim\mathcal{N}(0,\mathbb{I}_{q}) is a q𝑞q-dimensional latent variable, and q<mink(dk)𝑞subscript𝑘subscript𝑑𝑘q<\min_{k}(d_{k}) is the dimension of the latent-space. Wc(k)dk×qsuperscriptsubscript𝑊𝑐𝑘superscriptsubscript𝑑𝑘𝑞W_{c}^{(k)}\in\mathbb{R}^{d_{k}\times q} provides the linear mapping between latent space and observations for the kthsuperscript𝑘thk^{\textrm{th}}-view, 𝝁c(k)dksuperscriptsubscript𝝁𝑐𝑘superscriptsubscript𝑑𝑘\boldsymbol{\mu}_{c}^{(k)}\in\mathbb{R}^{d_{k}} is the offset of the data corresponding to view k𝑘k, and 𝜺c(k)𝒩(0,σc(k)2𝕀dk)similar-tosuperscriptsubscript𝜺𝑐𝑘𝒩0superscriptsuperscriptsubscript𝜎𝑐𝑘2subscript𝕀subscript𝑑𝑘\boldsymbol{\varepsilon}_{c}^{(k)}\sim\mathcal{N}\left(0,{\sigma_{c}^{(k)}}^{2}\mathbb{I}_{d_{k}}\right) is the Gaussian noise for the kthsuperscript𝑘thk^{\textrm{th}}-view. This formulation induces a Gaussian distribution over 𝐭c,n(k)superscriptsubscript𝐭𝑐𝑛𝑘\mathbf{t}_{c,n}^{(k)}, implying:

𝐭c,n(k)𝒩(𝝁c(k),Cc(k)),similar-tosuperscriptsubscript𝐭𝑐𝑛𝑘𝒩superscriptsubscript𝝁𝑐𝑘superscriptsubscript𝐶𝑐𝑘\mathbf{t}_{c,n}^{(k)}\sim\mathcal{N}(\boldsymbol{\mu}_{c}^{(k)},C_{c}^{(k)}),(2)

where Cc(k)=Wc(k)Wc(k)T+σc(k)2𝕀dkdk×dksuperscriptsubscript𝐶𝑐𝑘superscriptsubscript𝑊𝑐𝑘superscriptsuperscriptsubscript𝑊𝑐𝑘𝑇superscriptsuperscriptsubscript𝜎𝑐𝑘2subscript𝕀subscript𝑑𝑘superscriptsubscript𝑑𝑘subscript𝑑𝑘C_{c}^{(k)}=W_{c}^{(k)}{W_{c}^{(k)}}^{T}+{\sigma_{c}^{(k)}}^{2}\mathbb{I}_{d_{k}}\in\mathbb{R}^{d_{k}\times d_{k}}. Finally, a compact formulation for 𝐭c,nsubscript𝐭𝑐𝑛\mathbf{t}_{c,n} (i.e. considering all views concatenated) can be derived from Equation (1):

𝐭c,n=Wc𝐱c,n+𝝁c+Ψc,subscript𝐭𝑐𝑛subscript𝑊𝑐subscript𝐱𝑐𝑛subscript𝝁𝑐subscriptΨ𝑐\mathbf{t}_{c,n}=W_{c}\mathbf{x}_{c,n}+\boldsymbol{\mu}_{c}+\Psi_{c},(3)

where Wc,𝝁csubscript𝑊𝑐subscript𝝁𝑐W_{c},\boldsymbol{\mu}_{c} are obtained by concatenating all Wc(k),𝝁c(k)superscriptsubscript𝑊𝑐𝑘superscriptsubscript𝝁𝑐𝑘W_{c}^{(k)},\boldsymbol{\mu}_{c}^{(k)}, and ΨcsubscriptΨ𝑐\Psi_{c} is a block diagonal matrix, where the kthsuperscript𝑘thk^{\textrm{th}}-block is given by 𝜺c(k)superscriptsubscript𝜺𝑐𝑘\boldsymbol{\varepsilon}_{c}^{(k)}. The local parameters describing the center-specific dataset thus are 𝜽c:={𝝁c(k),Wc(k),σc(k)2}kassignsubscript𝜽𝑐subscriptsuperscriptsubscript𝝁𝑐𝑘superscriptsubscript𝑊𝑐𝑘superscriptsuperscriptsubscript𝜎𝑐𝑘2𝑘\boldsymbol{\theta}_{c}:=\left\{\boldsymbol{\mu}_{c}^{(k)},W_{c}^{(k)},{\sigma_{c}^{(k)}}^{2}\right\}_{k}. According to our hierarchical formulation, we assume that each local parameter in 𝜽csubscript𝜽𝑐\boldsymbol{\theta}_{c} is a realization of a common global prior distribution described by 𝜽~:={𝝁~(k),σ𝝁~(k),W~(k),σW~(k),α~(k),β~(k)}kassign~𝜽subscriptsuperscript~𝝁𝑘subscript𝜎superscript~𝝁𝑘superscript~𝑊𝑘subscript𝜎superscript~𝑊𝑘superscript~𝛼𝑘superscript~𝛽𝑘𝑘\widetilde{\boldsymbol{\theta}}:=\left\{\widetilde{\boldsymbol{\mu}}^{(k)},\sigma_{\widetilde{\boldsymbol{\mu}}^{(k)}},\widetilde{W}^{(k)},\sigma_{\widetilde{W}^{(k)}},\widetilde{\alpha}^{(k)},\widetilde{\beta}^{(k)}\right\}_{k}. In particular we assume that 𝝁c(k)superscriptsubscript𝝁𝑐𝑘\boldsymbol{\mu}_{c}^{(k)} and Wc(k)superscriptsubscript𝑊𝑐𝑘W_{c}^{(k)} are normally distributed, while the variance of the Gaussian error, σc(k)2superscriptsuperscriptsubscript𝜎𝑐𝑘2{\sigma_{c}^{(k)}}^{2}, follows an inverse-gamma distribution. Formally:

𝝁c(k)|𝝁~(k),σ𝝁~(k)conditionalsuperscriptsubscript𝝁𝑐𝑘superscript~𝝁𝑘subscript𝜎superscript~𝝁𝑘\displaystyle\boldsymbol{\mu}_{c}^{(k)}|\widetilde{\boldsymbol{\mu}}^{(k)},\sigma_{\widetilde{\boldsymbol{\mu}}^{(k)}}similar-to\displaystyle\sim𝒩(𝝁~(k),σ𝝁~(k)2𝕀dk),𝒩superscript~𝝁𝑘superscriptsubscript𝜎superscript~𝝁𝑘2subscript𝕀subscript𝑑𝑘\displaystyle\mathcal{N}\left(\widetilde{\boldsymbol{\mu}}^{(k)},\sigma_{\widetilde{\boldsymbol{\mu}}^{(k)}}^{2}\mathbb{I}_{d_{k}}\right),(4)
Wc(k)|W~(k),σW~(k)conditionalsuperscriptsubscript𝑊𝑐𝑘superscript~𝑊𝑘subscript𝜎superscript~𝑊𝑘\displaystyle W_{c}^{(k)}|\widetilde{W}^{(k)},\sigma_{\widetilde{W}^{(k)}}similar-to\displaystyle\sim𝒩k,q(W~(k),𝕀dk,σW~(k)2𝕀q),subscript𝒩𝑘𝑞superscript~𝑊𝑘subscript𝕀subscript𝑑𝑘superscriptsubscript𝜎superscript~𝑊𝑘2subscript𝕀𝑞\displaystyle\mathcal{MN}_{k,q}\left(\widetilde{W}^{(k)},\mathbb{I}_{d_{k}},\sigma_{\widetilde{W}^{(k)}}^{2}\mathbb{I}_{q}\right),(5)
σc(k)2|α~(k),β~(k)conditionalsuperscriptsuperscriptsubscript𝜎𝑐𝑘2superscript~𝛼𝑘superscript~𝛽𝑘\displaystyle{\sigma_{c}^{(k)}}^{2}|\widetilde{\alpha}^{(k)},\widetilde{\beta}^{(k)}similar-to\displaystyle\simInverse-Gamma(α~(k),β~(k)),Inverse-Gammasuperscript~𝛼𝑘superscript~𝛽𝑘\displaystyle\textrm{Inverse-Gamma}(\widetilde{\alpha}^{(k)},\widetilde{\beta}^{(k)}),(6)

where 𝒩k,qsubscript𝒩𝑘𝑞\mathcal{MN}_{k,q} denotes the matrix normal distribution of dimension dk×qsubscript𝑑𝑘𝑞d_{k}\times q.

3.1.3 Proposed framework

The assumptions made in Section 3.1.2 allow to naturally define an optimization scheme based on Expectation Maximization (EM) locally, and on Maximum Likelihood estimation (ML) at the master level (Algorithm 1). Figure 2 shows the graphical model of Fed-mv-PPCA.

Refer to caption
Figure 2: Graphical model of Fed-mv-PPCA. Thick double-sided red arrows relate nodes which are shared between center and master, while plain black arrows define the relations between the local dataset and the generative model parameters. Grey filled circles correspond to raw data: the dashed double-sided arrow highlights the complexity of the dataset, composed by multiple views.
Input : Rounds R𝑅R; Iterations I𝐼I; Latent space dimension q𝑞q
Output : Global parameters 𝜽~~𝜽\widetilde{\boldsymbol{\theta}}
for r=1,,R𝑟1𝑅r=1,\dots,R do
       for c=1,,C𝑐1𝐶c=1,\dots,C in parallel do
             Each center c𝑐c initializes 𝜽csubscript𝜽𝑐\boldsymbol{\theta}_{c} using P(𝜽c|𝜽~)𝑃conditionalsubscript𝜽𝑐~𝜽P(\boldsymbol{\theta}_{c}|\widetilde{\boldsymbol{\theta}});
             I𝐼I iterations of MAP estimation of 𝜽csubscript𝜽𝑐\boldsymbol{\theta}_{c} using 𝜽~~𝜽\widetilde{\boldsymbol{\theta}} as prior;
            
       end for
      Each center c𝑐c returns 𝜽csubscript𝜽𝑐\boldsymbol{\theta}_{c} to the master;
       The master collects 𝜽csubscript𝜽𝑐\boldsymbol{\theta}_{c}, c=1,,C𝑐1𝐶c=1,\dots,C and estimates 𝜽~~𝜽\widetilde{\boldsymbol{\theta}} through ML;
       The master sends 𝜽~~𝜽\widetilde{\boldsymbol{\theta}} to all centers
end for
Algorithm 1 Fed-mv-PPCA algorithm

With reference to Algorithm 1, the optimization of Fed-mv-PPCA is as follows:

Optimization.

The master collects the local parameters 𝜽csubscript𝜽𝑐\boldsymbol{\theta}_{c} for c{1,,C}𝑐1𝐶c\in\{1,\dots,C\} and estimates the ML updated global parameters characterizing the prior distributions of Equations (4) to (6). Updated global parameters 𝜽~~𝜽\widetilde{\boldsymbol{\theta}} are returned to each center, and serve as priors to update the MAP estimation of the local parameters 𝜽csubscript𝜽𝑐\boldsymbol{\theta}_{c}, through the M step on the functional 𝐄p(𝐱c,n|𝐭c,n)ln(p(𝐭c,n,𝐱c,n|𝜽c)p(𝜽c|𝜽~))subscript𝐄𝑝conditionalsubscript𝐱𝑐𝑛subscript𝐭𝑐𝑛𝑝subscript𝐭𝑐𝑛conditionalsubscript𝐱𝑐𝑛subscript𝜽𝑐𝑝conditionalsubscript𝜽𝑐~𝜽\mathbf{E}_{p(\mathbf{x}_{c,n}|\mathbf{t}_{c,n})}\ln{\left(p(\mathbf{t}_{c,n},\mathbf{x}_{c,n}|\boldsymbol{\theta}_{c})p(\boldsymbol{\theta}_{c}|\widetilde{\boldsymbol{\theta}})\right)}, where:

𝐱c,n|𝐭c,n𝒩(Σc1WcTΨc1(𝐭c,n𝝁c),Σc1),Σc:=(𝕀q+WcTΨc1Wc)formulae-sequencesimilar-toconditionalsubscript𝐱𝑐𝑛subscript𝐭𝑐𝑛𝒩superscriptsubscriptΣ𝑐1superscriptsubscript𝑊𝑐𝑇superscriptsubscriptΨ𝑐1subscript𝐭𝑐𝑛subscript𝝁𝑐superscriptsubscriptΣ𝑐1assignsubscriptΣ𝑐subscript𝕀𝑞superscriptsubscript𝑊𝑐𝑇superscriptsubscriptΨ𝑐1subscript𝑊𝑐\displaystyle\mathbf{x}_{c,n}|\mathbf{t}_{c,n}\sim\mathcal{N}\left(\Sigma_{c}^{-1}{W_{c}}^{T}\Psi_{c}^{-1}(\mathbf{t}_{c,n}-\boldsymbol{\mu}_{c}),\Sigma_{c}^{-1}\right),\Sigma_{c}:=(\mathbb{I}_{q}+W_{c}^{T}\Psi_{c}^{-1}W_{c})

and

ln(p(𝐭c,n,𝐱c,n|𝜽c))delimited-⟨⟩𝑝subscript𝐭𝑐𝑛conditionalsubscript𝐱𝑐𝑛subscript𝜽𝑐\displaystyle\langle\ln{\left(p(\mathbf{t}_{c,n},\mathbf{x}_{c,n}|\boldsymbol{\theta}_{c})\right)}\rangle=\displaystyle=n=1Nc{k=1K[dk2ln(σc(k)2)+12σc(k)2𝐭c,n(k)𝝁c(k)2+\displaystyle-\sum_{n=1}^{N_{c}}\left\{\sum_{k=1}^{K}\left[\frac{d_{k}}{2}\ln{\left({\sigma_{c}^{(k)}}^{2}\right)}+\frac{1}{2{\sigma_{c}^{(k)}}^{2}}\|\mathbf{t}_{c,n}^{(k)}-\boldsymbol{\mu}_{c}^{(k)}\|^{2}+\right.\right.
12σc(k)2tr(Wc(k)TWc(k)𝐱c,n𝐱c,nT)12superscriptsuperscriptsubscript𝜎𝑐𝑘2𝑡𝑟superscriptsuperscriptsubscript𝑊𝑐𝑘𝑇superscriptsubscript𝑊𝑐𝑘delimited-⟨⟩subscript𝐱𝑐𝑛superscriptsubscript𝐱𝑐𝑛𝑇\displaystyle\frac{1}{2{\sigma_{c}^{(k)}}^{2}}tr\left({W_{c}^{(k)}}^{T}{W_{c}^{(k)}}\langle\mathbf{x}_{c,n}\mathbf{x}_{c,n}^{T}\rangle\right)
1σc(k)2𝐱c,nTWc(k)T(𝐭c,i(k),g𝝁c(k))]+12tr(𝐱c,n𝐱c,nT)},\displaystyle\left.-\frac{1}{{\sigma_{c}^{(k)}}^{2}}\langle\mathbf{x}_{c,n}\rangle^{T}{W_{c}^{(k)}}^{T}\left(\mathbf{t}_{c,i}^{(k),g}-\boldsymbol{\mu}_{c}^{(k)}\right)\right]\left.+\frac{1}{2}tr\left(\langle\mathbf{x}_{c,n}\mathbf{x}_{c,n}^{T}\rangle\right)\right\},
Initialization at round r=1.

The latent-space dimension q𝑞q, the number of local iterations I𝐼I and the number of communication rounds R𝑅R (i.e. number of complete cycles centers-master) are user-defined parameters. For the sake of simplicity, we set here the same number of local iterations for every center. Note that this constraint can be easily adapted to take into account systems heterogeneity among centers, as well as the size of each local dataset. At the first round, local parameters initialization, hence optimization, can be performed in two distinct ways: 1) each center can initialize randomly every local parameter, then perform EM through I𝐼I iterations, maximizing the functional ln(p(𝐭c,n,𝐱c,n|𝜽c))delimited-⟨⟩𝑝subscript𝐭𝑐𝑛conditionalsubscript𝐱𝑐𝑛subscript𝜽𝑐\langle\ln{\left(p(\mathbf{t}_{c,n},\mathbf{x}_{c,n}|\boldsymbol{\theta}_{c})\right)}\rangle; 2) the master can provide priors for at least some parameters, which will be optimized using MAP estimation as described above. In case of a random initialization of local parameters, the number of EM iterations for the first round can be increased: this can be seen as an exploratory phase.

The reader can refer to Appendix A for further details on the theoretical formulation of Fed-mv-PPCA and the corresponding optimization scheme.

3.2 Fed-mv-PPCA with Differential Privacy

Despite the Bayesian federated learning scheme deployed prevents data transfer, it does not provide theoretical privacy guarantees on the shared statistics. Differential privacy (DP) (Dwork et al., 2014; Abadi et al., 2016) is a standard framework for privacy-preserving computations allowing to quantify a privacy protection budget attached to a given operation, and to sanitize model parameters through output perturbation machanisms based on the addition of a random noise. The noise strength has to be tuned to ensure a good balance between privacy and utility of the outputs.

In Section 3.2.1 we recall the standard definition of differential privacy and established results on classical random perturbation mechanisms, as well as the composition theorem (Dwork et al., 2014). A differentially private version of Fed-mv-PPCA (DP-Fed-mv-PPCA) is subsequently derived in Section 3.2.2.

3.2.1 Differential privacy: background

We denote by D,D𝐷superscript𝐷D,D^{\prime} two datasets: D𝐷D and Dsuperscript𝐷D^{\prime} are said to be neighboring or adjacent datasets if they only differ by a datapoint 𝐭superscript𝐭\mathbf{t}^{\prime}, D=D{𝐭}𝐷superscript𝐷superscript𝐭D=D^{\prime}\cup\{\mathbf{t}^{\prime}\}. In this case we write DD=1norm𝐷superscript𝐷1\|D-D^{\prime}\|=1, where \|\cdot\| denotes the cardinality of a given set.

Definition 1

A randomized algorithm :𝒟:𝒟\mathcal{M}:\mathcal{D}\to\mathcal{R} with domain 𝒟𝒟\mathcal{D} and range \mathcal{R} is (ε,δ)𝜀𝛿(\varepsilon,\delta)-differentially private if for any D,D𝒟𝐷superscript𝐷𝒟D,D^{\prime}\in\mathcal{D} s.t. DD=1norm𝐷superscript𝐷1\|D-D^{\prime}\|=1 and for any 𝒮𝒮\mathcal{S}\in\mathcal{R}:

[(D)𝒮]eε[(D)𝒮]+δdelimited-[]𝐷𝒮superscript𝑒𝜀delimited-[]superscript𝐷𝒮𝛿\mathbb{P}\left[\mathcal{M}(D)\in\mathcal{S}\right]\leq e^{\varepsilon}\left[\mathcal{M}(D^{\prime})\in\mathcal{S}\right]+\delta

When δ=0𝛿0\delta=0, we simply say that the algorithm \mathcal{M} is ε𝜀\varepsilon-differentially private.

A common mechanism to approximate a deterministic function or a query f:𝒟d:𝑓𝒟superscript𝑑f:\mathcal{D}\to\mathbb{R}^{d} with differential privacy is the addition of a random noise calibrated on the sensitivity of f𝑓f.

Definition 2

The lpsubscript𝑙𝑝l_{p}-sensitivity of a function f:𝒟d:𝑓𝒟superscript𝑑f:\mathcal{D}\to\mathbb{R}^{d} is defined as:

Δpf=maxDD=1f(D)f(D)psubscriptΔ𝑝𝑓subscriptnorm𝐷superscript𝐷1subscriptnorm𝑓𝐷𝑓superscript𝐷𝑝\Delta_{p}f=\max_{\|D-D^{\prime}\|=1}\|f(D)-f(D^{\prime})\|_{p}

Classical mechanisms used for perturbation are the Laplace mechanism and the Gaussian mechanism. A Laplace (resp. Gaussian) mechanism is simply obtained by computing f𝑓f, hence perturbing it with noise added from a Laplace (resp. Gaussian) distribution centered in the origin and with variance depending on the sensitivity of f𝑓f:

(D):=f(D)+Noise,assign𝐷𝑓𝐷𝑁𝑜𝑖𝑠𝑒\mathcal{M}(D):=f(D)+Noise,

where NoiseLaplace(0,stdL(Δpf))similar-to𝑁𝑜𝑖𝑠𝑒Laplace0subscriptstd𝐿subscriptΔ𝑝𝑓Noise\sim\textrm{Laplace}\left(0,\textrm{std}_{L}(\Delta_{p}f)\right) (resp. Noise𝒩(0,varG(Δpf))similar-to𝑁𝑜𝑖𝑠𝑒𝒩0subscriptvar𝐺subscriptΔ𝑝𝑓Noise\sim\mathcal{N}\left(0,\textrm{var}_{G}(\Delta_{p}f)\right)).

Hereafter we recall the condition of a Laplace (resp. Gaussian) mechanism to preserve (ε,δ)𝜀𝛿(\varepsilon,\delta)-DP.

Theorem 3

Given any function f:𝒟d:𝑓𝒟superscript𝑑f:\mathcal{D}\to\mathbb{R}^{d} and ε>0𝜀0\varepsilon>0, the Laplace mechanism defined as

(D):=f(D)+(L1,,Ld),assign𝐷𝑓𝐷subscript𝐿1subscript𝐿𝑑\mathcal{M}(D):=f(D)+(L_{1},\dots,L_{d}),

where Lisubscript𝐿𝑖L_{i} are iid drawn from Laplace(0,Δ1f/ε)Laplace0subscriptΔ1𝑓𝜀\textrm{Laplace}(0,\Delta_{1}f/\varepsilon), preserves ε𝜀\varepsilon-DP.

Theorem 4

Given any function f:𝒟d:𝑓𝒟superscript𝑑f:\mathcal{D}\to\mathbb{R}^{d} and (ε,δ)(0,1)2𝜀𝛿superscript012(\varepsilon,\delta)\in(0,1)^{2}, the Gaussian mechanism defined as

(D):=f(D)+𝒩(𝟎,(2ln(1.25δ)Δ2fε)2𝕀d),assign𝐷𝑓𝐷𝒩0superscript21.25𝛿subscriptΔ2𝑓𝜀2subscript𝕀𝑑\mathcal{M}(D):=f(D)+\mathcal{N}\left(\mathbf{0},\left(\frac{\sqrt{2\ln{(1.25\delta)}}\Delta_{2}f}{\varepsilon}\right)^{2}\mathbb{I}_{d}\right),

preserves (ε,δ)𝜀𝛿(\varepsilon,\delta)-DP.

The formal proofs of Theorems 3-4 are provided e.g. by Dwork et al. (2014).

An improved Gaussian mechanism is further described by Zhao et al. (2019), with the advantages of 1) remaining valid for ε>1𝜀1\varepsilon>1 given δ0.5𝛿0.5\delta\leq 0.5, and 2) adding a smaller noise as compared to the result of Theorem 4 in the case 0<ε10𝜀10<\varepsilon\leq 1.

Theorem 5

Given any function f:𝒟d:𝑓𝒟superscript𝑑f:\mathcal{D}\to\mathbb{R}^{d}, ε>0𝜀0\varepsilon>0, and δ(0,0.5)𝛿00.5\delta\in(0,0.5), the Gaussian mechanism defined as

(D):=f(D)+𝒩(𝟎,((c+c2+ε)Δ2fε2)2𝕀d),assign𝐷𝑓𝐷𝒩0superscript𝑐superscript𝑐2𝜀subscriptΔ2𝑓𝜀22subscript𝕀𝑑\mathcal{M}(D):=f(D)+\mathcal{N}\left(\mathbf{0},\left(\frac{(c+\sqrt{c^{2}+\varepsilon})\Delta_{2}f}{\varepsilon\sqrt{2}}\right)^{2}\mathbb{I}_{d}\right),

where c=ln(2/(16δ+11))𝑐216𝛿11c=\sqrt{\ln{\left(2/(\sqrt{16\delta+1}-1)\right)}}, preserves (ε,δ)𝜀𝛿(\varepsilon,\delta)-DP.

It is worth noting that Theorems 4 and 5 can be naturally extended to queries mapping to d×qsuperscript𝑑𝑞\mathbb{R}^{d\times q} and matrix normal mechanisms:

Corollary 6

Given any function f:𝒟d×q:𝑓𝒟superscript𝑑𝑞f:\mathcal{D}\to\mathbb{R}^{d\times q}, ε>0𝜀0\varepsilon>0, and δ(0,0.5)𝛿00.5\delta\in(0,0.5), the matrix normal mechanism defined as

(D):=f(D)+𝒩d,q(𝟎d,q,𝕀d,((c+c2+ε)Δ2fε2)2𝕀q),assign𝐷𝑓𝐷subscript𝒩𝑑𝑞subscript0𝑑𝑞subscript𝕀𝑑superscript𝑐superscript𝑐2𝜀subscriptΔ2𝑓𝜀22subscript𝕀𝑞\mathcal{M}(D):=f(D)+\mathcal{MN}_{d,q}\left(\mathbf{0}_{d,q},\mathbb{I}_{d},\left(\frac{(c+\sqrt{c^{2}+\varepsilon})\Delta_{2}f}{\varepsilon\sqrt{2}}\right)^{2}\mathbb{I}_{q}\right),

where c=ln(2/(16δ+11))𝑐216𝛿11c=\sqrt{\ln{\left(2/(\sqrt{16\delta+1}-1)\right)}}, preserves (ε,δ)𝜀𝛿(\varepsilon,\delta)-DP.

We conclude this section by recalling the well known composition theorem (Dwork et al., 2014), which will be useful to quantify the global privacy budget for each center in the next sections.

Theorem 7

For i=1,,k𝑖1𝑘i=1,\dots,k, let i:𝒟i:subscript𝑖𝒟subscript𝑖\mathcal{M}_{i}:\mathcal{D}\to\mathcal{R}_{i} be an (εi,δi)subscript𝜀𝑖subscript𝛿𝑖(\varepsilon_{i},\delta_{i})-differentially private algorithm, and :𝒟i=1ki:𝒟superscriptsubscriptproduct𝑖1𝑘subscript𝑖\mathcal{M}:\mathcal{D}\to\prod_{i=1}^{k}\mathcal{R}_{i} defined as (𝒟):=(1(𝒟),,k(𝒟))assign𝒟subscript1𝒟subscript𝑘𝒟\mathcal{M}(\mathcal{D}):=(\mathcal{M}_{1}(\mathcal{D}),\dots,\mathcal{M}_{k}(\mathcal{D})). Then \mathcal{M} is (i=1kεi,i=1kδi)superscriptsubscript𝑖1𝑘subscript𝜀𝑖superscriptsubscript𝑖1𝑘subscript𝛿𝑖\left(\sum_{i=1}^{k}\varepsilon_{i},\sum_{i=1}^{k}\delta_{i}\right)-differentially private.

3.2.2 Differential privacy for local parameters

In this section we propose a novel federated learning scheme for Fed-mv-PPCA with DP to protect client-level privacy and avoid potential private information leakage from the shared local parameters.

We are interested in preserving the privacy of the shared local parameters 𝜽c={𝝁c(k),Wc(k),σc(k)2}ksubscript𝜽𝑐subscriptsuperscriptsubscript𝝁𝑐𝑘superscriptsubscript𝑊𝑐𝑘superscriptsuperscriptsubscript𝜎𝑐𝑘2𝑘\boldsymbol{\theta}_{c}=\{\boldsymbol{\mu}_{c}^{(k)},W_{c}^{(k)},{\sigma_{c}^{(k)}}^{2}\}_{k}, which can be done by the addition of some properly tuned random noise, as detailed in Section 3.2.1. Nevertheless, the client-level optimization scheme in Fed-mv-PPCA is based on an iterative algorithm: therefore we do not have a closed formula to evaluate the sensitivity of each local parameter (i.e. the queries), nor an upper bound. To overcome this problem, we propose to perform difference clipping (Geyer et al., 2017; Zhang et al., 2021), one of the clipping strategies proposed for differentially private SGD models. Algorithm 2 outlines the optimization scheme for the DP-Fed-mv-PPCA framework.

Input : Rounds R𝑅R; Iterations I𝐼I; Latent space dimension q𝑞q; Privacy parameters ε𝜀\varepsilon, δ𝛿\delta
Output : Global parameters 𝜽~~𝜽\widetilde{\boldsymbol{\theta}}
for r=1,,R𝑟1𝑅r=1,\dots,R do
       for c=1,,C𝑐1𝐶c=1,\dots,C in parallel do
             Initialize 𝜽csubscript𝜽𝑐\boldsymbol{\theta}_{c} using P(𝜽c|𝜽~[r1])𝑃conditionalsubscript𝜽𝑐~𝜽delimited-[]𝑟1P(\boldsymbol{\theta}_{c}|\widetilde{\boldsymbol{\theta}}[r-1]);
             Update local parameters: I𝐼I iterations of MAP estimation (EM + prior) to optimize 𝜽c[r]subscript𝜽𝑐delimited-[]𝑟\boldsymbol{\theta}_{c}[r] using 𝜽~[r1]~𝜽delimited-[]𝑟1\widetilde{\boldsymbol{\theta}}[r-1] as prior;
             Compute difference: Δ𝜽c[r]:=(𝜽c[r]𝜽~[r1])assignΔsubscript𝜽𝑐delimited-[]𝑟subscript𝜽𝑐delimited-[]𝑟~𝜽delimited-[]𝑟1\Delta\boldsymbol{\theta}_{c}[r]:=(\boldsymbol{\theta}_{c}[r]-\widetilde{\boldsymbol{\theta}}[r-1]);
             Clip: Δ𝜽¯c[r]:=Δ𝜽c[r]/max(1,Δ𝜽c[r]p/g(σ𝜽~[r1]))assignsubscript¯Δ𝜽𝑐delimited-[]𝑟Δsubscript𝜽𝑐delimited-[]𝑟1subscriptnormΔsubscript𝜽𝑐delimited-[]𝑟𝑝𝑔subscript𝜎~𝜽delimited-[]𝑟1\overline{\Delta\boldsymbol{\theta}}_{c}[r]:=\Delta\boldsymbol{\theta}_{c}[r]/\max{\left(1,\|\Delta\boldsymbol{\theta}_{c}[r]\|_{p}/g(\sigma_{\widetilde{\boldsymbol{\theta}}}[r-1])\right)};
             Perturb: 𝜽c[r]:=Δ𝜽¯c[r]+Noise(2g(σ𝜽~[r1]),ε,δ)assignsubscriptsubscript𝜽𝑐delimited-[]𝑟subscript¯Δ𝜽𝑐delimited-[]𝑟𝑁𝑜𝑖𝑠𝑒2𝑔subscript𝜎~𝜽delimited-[]𝑟1𝜀𝛿\mathcal{M}_{\boldsymbol{\theta}_{c}}[r]:=\overline{\Delta\boldsymbol{\theta}}_{c}[r]+Noise(2g(\sigma_{\widetilde{\boldsymbol{\theta}}}[r-1]),\varepsilon,\delta);
             Return 𝜽¯c[r]:=𝜽c[r]+𝜽~[r1]assignsubscript¯𝜽𝑐delimited-[]𝑟subscriptsubscript𝜽𝑐delimited-[]𝑟~𝜽delimited-[]𝑟1\overline{\boldsymbol{\theta}}_{c}[r]:=\mathcal{M}_{\boldsymbol{\theta}_{c}}[r]+\widetilde{\boldsymbol{\theta}}[r-1] to the master;
            
       end for
      The master collects all 𝜽¯c[r]subscript¯𝜽𝑐delimited-[]𝑟\overline{\boldsymbol{\theta}}_{c}[r] and estimates 𝜽~[r]~𝜽delimited-[]𝑟\widetilde{\boldsymbol{\theta}}[r] through ML;
       The master sends 𝜽~[r]~𝜽delimited-[]𝑟\widetilde{\boldsymbol{\theta}}[r] to all centers
end for
Algorithm 2 DP-Fed-mv-PPCA algorithm
Difference clipping and perturbation.

With respect to Algorithm 1, difference clipping and perturbation are performed at the client level compatibly with the probabilistic formulation of the model:

  1. 1.

    The client computes the difference between the current local update and the initial prior (i.e. the corresponding global parameter obtained at the previous communication round, r1𝑟1r-1):

    Δ𝜽c[r]:=(𝜽c[r]𝜽~[r1])assignΔsubscript𝜽𝑐delimited-[]𝑟subscript𝜽𝑐delimited-[]𝑟~𝜽delimited-[]𝑟1\Delta\boldsymbol{\theta}_{c}[r]:=(\boldsymbol{\theta}_{c}[r]-\widetilde{\boldsymbol{\theta}}[r-1])
  2. 2.

    The updated difference is clipped according to the standard deviation of the prior:

    Δ𝜽¯c[r]:=Δ𝜽c[r](max(1,Δ𝜽c[r]pg(σ𝜽~[r1])))1,assignsubscript¯Δ𝜽𝑐delimited-[]𝑟Δsubscript𝜽𝑐delimited-[]𝑟superscript1subscriptnormΔsubscript𝜽𝑐delimited-[]𝑟𝑝𝑔subscript𝜎~𝜽delimited-[]𝑟11\overline{\Delta\boldsymbol{\theta}}_{c}[r]:=\Delta\boldsymbol{\theta}_{c}[r]\cdot\left(\max{\left(1,\frac{\|\Delta\boldsymbol{\theta}_{c}[r]\|_{p}}{g(\sigma_{\widetilde{\boldsymbol{\theta}}}[r-1])}\right)}\right)^{-1},

    where g(σ𝜽~[r1]):=const(σ𝜽~[r1])assign𝑔subscript𝜎~𝜽delimited-[]𝑟1constsubscript𝜎~𝜽delimited-[]𝑟1g(\sigma_{\widetilde{\boldsymbol{\theta}}}[r-1]):=\textrm{const}\cdot(\sigma_{\widetilde{\boldsymbol{\theta}}}[r-1]), and the multiplicative constant is fixed by the user. This clipping mechanism enforces the lpsubscript𝑙𝑝l_{p} norm of Δ𝜽c[r]Δsubscript𝜽𝑐delimited-[]𝑟\Delta\boldsymbol{\theta}_{c}[r] to be at most g(σ𝜽~[r1])𝑔subscript𝜎~𝜽delimited-[]𝑟1g(\sigma_{\widetilde{\boldsymbol{\theta}}}[r-1]). Consequently, the lpsubscript𝑙𝑝l_{p} sensitivity of Δ𝜽c¯[r]¯Δsubscript𝜽𝑐delimited-[]𝑟\overline{\Delta\boldsymbol{\theta}_{c}}[r] is bounded by 2g(σ𝜽~[r1])2𝑔subscript𝜎~𝜽delimited-[]𝑟12\cdot g(\sigma_{\widetilde{\boldsymbol{\theta}}}[r-1]).

  3. 3.

    The clipped difference is perturbed:

    𝜽c[r]:=Δ𝜽¯c[r]+Noise(2g(σ𝜽~[r1]),ε,δ)assignsubscriptsubscript𝜽𝑐delimited-[]𝑟subscript¯Δ𝜽𝑐delimited-[]𝑟𝑁𝑜𝑖𝑠𝑒2𝑔subscript𝜎~𝜽delimited-[]𝑟1𝜀𝛿\mathcal{M}_{\boldsymbol{\theta}_{c}}[r]:=\overline{\Delta\boldsymbol{\theta}}_{c}[r]+Noise(2g(\sigma_{\widetilde{\boldsymbol{\theta}}}[r-1]),\varepsilon,\delta)

    In particular, for Δ𝝁(k)¯csubscript¯Δsuperscript𝝁𝑘𝑐\overline{\Delta\boldsymbol{\mu}^{(k)}}_{c} and ΔW(k)¯csubscript¯Δsuperscript𝑊𝑘𝑐\overline{\Delta W^{(k)}}_{c}, we propose to use a Gaussian (resp. matrix normal) mechanism (Theorem 5, resp. Corollary 6), in accordance with the Gaussian prior distributions of these parameters, while a Laplace mechanism (Theorem 3) is used to perturb Δσ(k)2¯csubscript¯Δsuperscriptsuperscript𝜎𝑘2𝑐\overline{\Delta{\sigma^{(k)}}^{2}}_{c}.

  4. 4.

    The client adds again the prior and finally sends to the master 𝜽¯c[r]:=𝜽c[r]+𝜽~[r1]assignsubscript¯𝜽𝑐delimited-[]𝑟subscriptsubscript𝜽𝑐delimited-[]𝑟~𝜽delimited-[]𝑟1\overline{\boldsymbol{\theta}}_{c}[r]:=\mathcal{M}_{\boldsymbol{\theta}_{c}}[r]+\widetilde{\boldsymbol{\theta}}[r-1].

Conversely to model clipping (Abadi et al., 2016; Wei et al., 2020), where the parameter update is directly clipped and perturbed, difference clipping has the advantage to allow reducing the magnitude of the perturbation: indeed, we expect the lpsubscript𝑙𝑝l_{p} norm of the difference Δ𝜽cΔsubscript𝜽𝑐\Delta\boldsymbol{\theta}_{c} to be small compared to the lpsubscript𝑙𝑝l_{p} norm of 𝜽csubscript𝜽𝑐\boldsymbol{\theta}_{c}. Moreover, our framework provides a natural way to define the clipping parameter according to the prior. Indeed, the clipping parameter is defined here as the standard deviation of global parameters. Hence, from a conceptual viewpoint, we are enforcing local parameters updates to remain closer to the global ones by some ratio of their standard deviation. This allows to obfuscate the participation of the individual centers at the expense of a reduction of the ability of the framework in capturing the between-centers variability.

Privacy budget
Theorem 8

For sake of simplicity, let us choose the same ε,δ𝜀𝛿\varepsilon,\delta for all mechanisms considered above (a generalization to a parameter-specific choice of εi,δisubscript𝜀𝑖subscript𝛿𝑖\varepsilon_{i},\delta_{i} is straightforward). The total privacy budget for the outputs of Algorithm 2 is (3Kε,2Kδ)3𝐾𝜀2𝐾𝛿(3K\varepsilon,2K\delta), where K𝐾K is the total number of views.

Proof  The proof of Theorem 8 follows from Theorems 3-5 and Corollary 6, and by noting that data in each center are disjoint. In all centers, we are dealing with the mechanism :=(𝝁c(k),Wc(k),σc(k)2)kassignsubscriptsubscriptsuperscriptsubscript𝝁𝑐𝑘subscriptsuperscriptsubscript𝑊𝑐𝑘subscriptsuperscriptsuperscriptsubscript𝜎𝑐𝑘2𝑘\mathcal{M}:=(\mathcal{M}_{\boldsymbol{\mu}_{c}^{(k)}},\mathcal{M}_{W_{c}^{(k)}},\mathcal{M}_{{\sigma_{c}^{(k)}}^{2}})_{k}, where for all k𝑘k, 𝝁c(k)subscriptsuperscriptsubscript𝝁𝑐𝑘\mathcal{M}_{\boldsymbol{\mu}_{c}^{(k)}} and Wc(k)subscriptsuperscriptsubscript𝑊𝑐𝑘\mathcal{M}_{W_{c}^{(k)}} are (ε,δ)𝜀𝛿(\varepsilon,\delta)-differentially private, while for all k𝑘k, Mσc(k)2subscript𝑀superscriptsuperscriptsubscript𝜎𝑐𝑘2{M}_{{\sigma_{c}^{(k)}}^{2}} is ε𝜀\varepsilon-differentially private. The result follows thanks to composition Theorem 7 and the invariance of differential privacy under post-processing.  

Corollary 9

If for local parameter θc𝛉csubscript𝜃𝑐subscript𝛉𝑐\theta_{c}\in\boldsymbol{\theta}_{c} the client-specific differential parameters are (εc,δc)subscript𝜀𝑐subscript𝛿𝑐(\varepsilon_{c},\delta_{c}), then the total privacy budget for the corresponding global parameter θ~~𝜃\widetilde{\theta} is bounded by (max(εc),max(δc))subscript𝜀𝑐subscript𝛿𝑐(\max{(\varepsilon_{c})},\max{(\delta_{c})}).

Proof  The result directly follows from Theorem 8 and by considering Definition 1 and the monotonicity of the exponential function.  

3.3 Computational complexity and communication cost

The computational complexity of local parameters optimization in Fed-mv-PPCA (with or without the introduction of the DP mechanism depicted in Section 3.2) can be derived from the complexity of standard PPCA (Chen et al., 2009). We recall that performing simple PPCA locally in center c{1,,C}𝑐1𝐶c\in\{1,\dots,C\} implies a computational complexity of 𝒪(Ncdcq)𝒪subscript𝑁𝑐subscript𝑑𝑐𝑞\mathcal{O}(N_{c}d_{c}q), where Ncsubscript𝑁𝑐N_{c} and dcsubscript𝑑𝑐d_{c} are respectively the number of samples and dimensions in center c𝑐c, while q𝑞q is the chosen latent dimension. In the multi-view extension here considered, the total dimension is decomposed across views, meaning that dc:=kKcdkassignsubscript𝑑𝑐subscript𝑘subscript𝐾𝑐subscript𝑑𝑘d_{c}:=\sum_{k\in K_{c}}d_{k}, where Kcsubscript𝐾𝑐K_{c} is the set of observed views in center c𝑐c, and dksubscript𝑑𝑘d_{k} the dimension of view k𝑘k. The complete data log-likelihood to be maximized in the M step of the expectation-maximization algorithm, can consequently be written as a sum over the number of samples and number of observed views in center c𝑐c (see Section 3.1.3 and Appendix A.). For each kKc𝑘subscript𝐾𝑐k\in K_{c}, the computational complexity to optimize all k𝑘k-specific local parameters is 𝒪(Ncdkq)𝒪subscript𝑁𝑐subscript𝑑𝑘𝑞\mathcal{O}(N_{c}d_{k}q). This finally implies a computational complexity of 𝒪(NcqkKcdk)𝒪(NckKcdk)𝒪subscript𝑁𝑐𝑞subscriptproduct𝑘subscript𝐾𝑐subscript𝑑𝑘𝒪subscript𝑁𝑐subscriptproduct𝑘subscript𝐾𝑐subscript𝑑𝑘\mathcal{O}(N_{c}q\prod_{k\in K_{c}}d_{k})\approx\mathcal{O}(N_{c}\prod_{k\in K_{c}}d_{k}) when qmink(dk)much-less-than𝑞subscript𝑘subscript𝑑𝑘q\ll\min_{k}(d_{k}).

The communication cost of both (DP-)Fed-mv-PPCA can be derived as well from the communication cost of distributed PPCA (Elgamal et al., 2015), by considering that each center c𝑐c will communicate to the central server the parameter set: 𝜽c:={𝝁c(k),Wc(k),σc(k)2}kKcassignsubscript𝜽𝑐subscriptsuperscriptsubscript𝝁𝑐𝑘superscriptsubscript𝑊𝑐𝑘superscriptsuperscriptsubscript𝜎𝑐𝑘2𝑘subscript𝐾𝑐\boldsymbol{\theta}_{c}:=\{\boldsymbol{\mu}_{c}^{(k)},W_{c}^{(k)},{\sigma_{c}^{(k)}}^{2}\}_{k\in K_{c}}. For every k𝑘k, the communication cost of θc(k)superscriptsubscript𝜃𝑐𝑘\theta_{c}^{(k)} is 𝒪(dkq)𝒪subscript𝑑𝑘𝑞\mathcal{O}(d_{k}q). Consequently, the global communication cost of 𝜽csubscript𝜽𝑐\boldsymbol{\theta}_{c} will be 𝒪(qkKcdk):=𝒪(dcq)assign𝒪𝑞subscript𝑘subscript𝐾𝑐subscript𝑑𝑘𝒪subscript𝑑𝑐𝑞\mathcal{O}(q\sum_{k\in K_{c}}d_{k}):=\mathcal{O}(d_{c}q), which is the same communication cost of standard PPCA for a dcsubscript𝑑𝑐d_{c}-dimensional dataset.

4 Applications

4.1 Materials

In the preparation of this article we used two datasets.

Synthetic dataset (SD): using the generative model described in Section 3.1.2, we generated 400 observations consisting of k=3𝑘3k=3 views of dimension d1=15,d2=8,d3=10formulae-sequencesubscript𝑑115formulae-sequencesubscript𝑑28subscript𝑑310d_{1}=15,d_{2}=8,d_{3}=10 respectively. Each view was generated from a common 5-dimensional latent space. We randomly chose parameters W(k),𝝁(k),σ(k)superscript𝑊𝑘superscript𝝁𝑘superscript𝜎𝑘W^{(k)},\boldsymbol{\mu}^{(k)},{\sigma^{(k)}}. Finally, to simulate heterogeneity, a randomly chosen sub-sample composed by 250 observations was shifted in the latent space by a randomly generated vector: this allowed to simulate the existence of two distinct groups in the population.

Alzheimer’s Disease Neuroimaging Initiative dataset (ADNI)222The ADNI project was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI was to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessments can be combined to measure the progression of early Alzheimer’s disease (AD) (see www.adni-info.org for up-to-date information).: we consider 311 participants extracted from the ADNI dataset, among cognitively normal (NL) (104 subjects) and patients diagnosed with AD (207 subjects). All participants are associated with multiple data views: cognitive scores including MMSE, CDR-SB, ADAS-Cog-11 and RAVLT (CLINIC), Magnetic resonance imaging (MRI), Fluorodeoxyglucose-PET (FDG) and AV45-Amyloid PET (AV45) images. MRI morphometrical biomarkers were obtained as regional volumes using the cross-sectional pipeline of FreeSurfer v6.0 and the Desikan-Killiany parcellation (Fischl, 2012). Measurements from AV45-PET and FDG-PET were estimated by co-registering each modality to their respective MRI space, normalizing by the cerebellum uptake and by computing regional amyloid load and glucose hypometabolism using PetSurfer pipeline (Greve et al., 2014) and the same parcellation. Features were corrected beforehand with respect to intra-cranial volume, sex and age using a multivariate linear model. Data dimensions for each view are: dCLINIC=7subscript𝑑CLINIC7d_{\textrm{CLINIC}}=7, dMRI=41subscript𝑑MRI41d_{\textrm{MRI}}=41, dFDG=41subscript𝑑FDG41d_{\textrm{FDG}}=41 and dAV45=41subscript𝑑AV4541d_{\textrm{AV45}}=41. Further details on the demographics of the ADNI sample are provided in Appendix B, Table 3.

4.2 Benchmark

We compare our method to two state-of-the art data assimilation methods: Variational Autoencoder (VAE) (Kingma and Welling, 2014) and multi-channel VAE (mc-VAE) (Antelmi et al., 2019). To maintain the modeling setup consistent across methods, both auto-encoders were tested by considering linear encoding and decoding mappings. In order to obtain the federated version of VAE and mc-VAE we use FedAvg (McMahan et al., 2017a), which is specifically conceived for stochastic gradient descent optimization. Additional tests were performed by considering non-linear VAEs (2-layers for both encoding and decoding architectures), and FedProx as additional regularized FL aggregation method (results in Supp. Table 7 and Supp. Figure 11). For all optimization methods and federation schemes we set to 100 the total number of communication rounds, of 15 epochs each, with the default learning rate (103superscript10310^{-3}).

4.3 Results

We apply Fed-mv-PPCA to both SD and ADNI datasets, and quantify the quality of reconstruction and identification of the latent space with respect to the increasing number of centers, C𝐶C, and the increasing data heterogeneity. We investigate also the ability of Fed-mv-PPCA in estimating the data variability and predicting the distribution of missing views. To this end, we consider 4 different scenarios of data distribution across multiple centers, detailed in Table 1.

Table 1: Distribution of Datasets Across Centers.
ScenarioDescription
IIDData are iid distributed across C𝐶C centers with respect to groups and for all subjects a complete data raw is provided
GData are non-iid distributed with respect to groups across C𝐶C centers: C/3𝐶3C/3 centers includes subjects from both groups; C/3𝐶3C/3 centers only subjects from group 1 (AD in the ADNI case); C/3𝐶3C/3 centers only subjects from group 2 (NL for ADNI). All views have been measured in each center.
KC/3𝐶3C/3 centers contribute with observations for all views; in C/3𝐶3C/3 centers the second view (MRI for ADNI) is missing; in C/3𝐶3C/3 centers the third view (FDG for ADNI) is missing. Data are iid distributed across C𝐶C centers with respect to groups.
G/KData are non-iid distributed (scenario G) and there are missing views (scenario K).

For each experiment considered hereafter with Fed-mv-PPCA, we perform 3-fold Cross Validation (3CV) tests. For every test, local parameters are initialized randomly (i.e. no prior is provided by the master at the beginning), and the number of rounds is set to 100. Each round consists of 15 iterations for local MAP optimization, except the initialization round, which consists of 30 EM iterations. Finally, when a centralized setting is tested, the number of rounds is set to 1 and the number of EM iterations to 800.

4.3.1 Model selection

The latent space dimension q𝑞q is an user defined parameter, with the only constraint q<mink{dk}𝑞subscript𝑘subscript𝑑𝑘q<\min_{k}\{d_{k}\}. To assess the optimal q𝑞q, we consider the IID scenario and let q𝑞q vary. We perform 10 times a 3-fold Cross Validation (3-CV), and split the train dataset across 3 centers. The resulting models are compared using the WAIC criterion (Gelman et al., 2014). In addition, we consider the Mean Absolute reconstruction Error (MAE) in an hold-out test dataset: the MAE is obtained by evaluating the mean absolute distance between real data and data reconstructed using the global distribution. Figure 3 shows the evolution of WAIC and MAE with respect to the latent space dimension.

Refer to caption

(a) SD

Refer to caption

(b) ADNI
Figure 3: WAIC score and MAE for (a) the SD dataset and (b) the ADNI dataset. In both figures, the left y-axis scaling describes the MAE while the right y-axis scaling corresponds to the WAIC score.
Refer to caption
Figure 4: Standardized mean differences between the WAIC score for ADNI data, computed as (mean(WAICq)mean(WAICq1))/var(WAICq)/Nqvar(WAICq1)/Nq1meansubscriptWAIC𝑞meansubscriptWAIC𝑞1varsubscriptWAIC𝑞subscript𝑁𝑞varsubscriptWAIC𝑞1subscript𝑁𝑞1(\textrm{mean}(\textrm{WAIC}_{q})-\textrm{mean}(\textrm{WAIC}_{q-1}))/\sqrt{\textrm{var}(\textrm{WAIC}_{q})/N_{q}-\textrm{var}(\textrm{WAIC}_{q-1})/N_{q-1}}, 2<q152𝑞152<q\leq 15.

Concerning the SD dataset, the WAIC suggests q=5𝑞5q=5 latent dimensions (Figure 3 (a)), hence demonstrating the ability of Fed-mv-PPCA to correctly recover the ground truth latent space dimension used to generate the data. Analogously, the MAE improves drastically up to the dimension q=5𝑞5q=5, and subsequently stabilizes. For ADNI, the MAE improves for increasing latent space dimensions, and we obtain the best WAIC score for q=6𝑞6q=6. In this case, one can notice that both the WAIC and MAE keep decreasing when considering q𝑞q varying from 1 to 6. In Figure 4 we display the standardized mean differences of WAIC scores for 2<q152𝑞152<q\leq 15: increasing the latent dimension q𝑞q above 6 implies a mild relative improvement of the WAIC, while requiring a computationally more complex model and higher communication costs (see Section 3.3). This ultimately indicates that the choice of q=6𝑞6q=6 is a reasonable compromise for the ADNI database, allowing to efficiently capture most data variability, while remaining coherent with the model hypotheses (cf q<minkdk𝑞subscript𝑘subscript𝑑𝑘q<\min_{k}{d_{k}}). Additionally, we should stress that when the k𝑘k-th view dimension is smaller or equal to the latent dimension (dkqsubscript𝑑𝑘𝑞d_{k}\leq q for some k𝑘k), we assumed that only the first dk1subscript𝑑𝑘1d_{k}-1 columns of Wc(k)superscriptsubscript𝑊𝑐𝑘W_{c}^{(k)} were effectively contributing for the latent projection of view k𝑘k, and we forced the remaining columns of Wc(k)superscriptsubscript𝑊𝑐𝑘W_{c}^{(k)} to be filled of zeros. For completeness, Supplementary Figure 9 provides the evolution of WAIC for q>6𝑞6q>6 and shows that a latent dimension choice above q=6𝑞6q=6 is associated to a generally higher variance, suggesting less stable models and results.

It is worth noting that despite the agreement of MAE and WAIC for both datasets, the WAIC has the competitive advantage of providing a natural and automatic model selection measure in Bayesian models, which does not require testing data, conversely to MAE.

In the following experiments, we set the latent space dimension q=5𝑞5q=5 for the SD dataset and q=6𝑞6q=6 for the ADNI dataset.

4.3.2 Increasing heterogeneity across datasets

Table 2: Results on ADNI dataset for all scenarios, and comparison with VAE and mc-VAE.
ScenarioCentersMethodMAE TrainMAE TestAccuracy in LS
IID 1(centralizedcase) Fed-mv-PPCA0.0805±plus-or-minus\boldsymbol{\pm}0.00030.1110±plus-or-minus\boldsymbol{\pm}0.00110.8680±plus-or-minus\pm0.0379
VAE0.1055±plus-or-minus\pm0.00170.1344±plus-or-minus\pm0.00190.8003±plus-or-minus\pm0.0409
mc-VAE0.1382±plus-or-minus\pm0.00090.1669±plus-or-minus\pm0.00200.8727±plus-or-minus\boldsymbol{\pm}0.0319
3Fed-mv-PPCA0.1027±plus-or-minus\boldsymbol{\pm}0.00150.1073±plus-or-minus\boldsymbol{\pm}0.00040.8652±plus-or-minus\pm0.0270
DP-Fed-mv-PPCA0.1304±plus-or-minus\pm0.00470.1304±plus-or-minus\pm0.00410.8321±plus-or-minus\pm0.0388
VAE0.1172±plus-or-minus\pm0.00220.1192±plus-or-minus\pm0.00150.8289±plus-or-minus\pm0.0383
mc-VAE0.1602±plus-or-minus\pm0.00350.1567±plus-or-minus\pm0.00170.8850±plus-or-minus\boldsymbol{\pm}0.0262
6Fed-mv-PPCA0.1203±plus-or-minus\boldsymbol{\pm}0.00420.1074±plus-or-minus\boldsymbol{\pm}0.00070.8742±plus-or-minus\pm0.0267
DP-Fed-mv-PPCA0.1489±plus-or-minus\pm0.00510.1295±plus-or-minus\pm0.00290.8502±plus-or-minus\pm0.0347
VAE0.1357±plus-or-minus\pm0.00420.1191±plus-or-minus\pm0.00140.8224±plus-or-minus\pm0.0377
mc-VAE0.1840±plus-or-minus\pm0.00540.1563±plus-or-minus\pm0.00170.8894±plus-or-minus\boldsymbol{\pm}0.0230
G3Fed-mv-PPCA0.1077±plus-or-minus\boldsymbol{\pm}0.00900.1096±plus-or-minus\boldsymbol{\pm}0.00110.8409±plus-or-minus\pm0.0293
DP-Fed-mv-PPCA0.1362±plus-or-minus\pm0.01170.1340±plus-or-minus\pm0.00670.7977±plus-or-minus\pm0.0480
VAE0.1212±plus-or-minus\pm0.00770.1219±plus-or-minus\pm0.00150.7962±plus-or-minus\pm0.0440
mc-VAE0.1677±plus-or-minus\pm0.01560.1611±plus-or-minus\pm0.00250.8210±plus-or-minus\pm0.0464
6Fed-mv-PPCA0.1264±plus-or-minus\boldsymbol{\pm}0.01260.10912±plus-or-minus\boldsymbol{\pm}0.00110.8168±plus-or-minus\boldsymbol{\pm}0.0324
DP-Fed-mv-PPCA0.1585±plus-or-minus\pm0.01580.1340±plus-or-minus\pm0.00650.7898±plus-or-minus\pm0.0407
VAE0.1401±plus-or-minus\pm0.01140.1202±plus-or-minus\pm0.00160.7882±plus-or-minus\pm0.0534
mc-VAE0.1924±plus-or-minus\pm0.02190.1589±plus-or-minus\pm0.00180.8085±plus-or-minus\pm0.0464
K3Fed-mv-PPCA0.0951±plus-or-minus\pm0.00860.1212±plus-or-minus\pm0.01090.8624±plus-or-minus\pm0.0303
DP-Fed-mv-PPCA0.1208±plus-or-minus\pm0.00810.1462±plus-or-minus\pm0.00920.8357±plus-or-minus\pm0.0329
6Fed-mv-PPCA0.1107±plus-or-minus\pm0.01060.1293±plus-or-minus\pm0.01620.8720±plus-or-minus\pm0.0308
DP-Fed-mv-PPCA0.1434±plus-or-minus\pm0.00990.1604±plus-or-minus\pm0.01640.8515±plus-or-minus\pm0.0375
G/K3Fed-mv-PPCA0.0995±plus-or-minus\pm0.00290.1271±plus-or-minus\pm0.00870.7338±plus-or-minus\pm0.0308
DP-Fed-mv-PPCA0.1287±plus-or-minus\pm0.00810.1547±plus-or-minus\pm0.01250.7164±plus-or-minus\pm0.0474
6Fed-mv-PPCA0.1173±plus-or-minus\pm0.00610.1268±plus-or-minus\pm0.00880.7469±plus-or-minus\pm0.0202
DP-Fed-mv-PPCA0.1463±plus-or-minus\pm0.00880.1523±plus-or-minus\pm0.01040.7174±plus-or-minus\pm0.0387

To test the robustness of Fed-mv-PPCA’s results, for each scenario of Table 1, we perform 10 times 3-CV to obtain train and test datasets, hence we split the train dataset across C𝐶C centers. We compare our method to VAE and mc-VAE, using the same partition of train and test datasets for CV. For all methods we consider the MAE in both the train and test datasets, as well as the accuracy score in the Latent Space (LS) discriminating the groups (synthetically defined in SD or corresponding to the clinical diagnosis in ADNI). The classification was performed via Linear Discriminant Analysis (LDA) on the individual projection of test data in the latent space.

In what follows we present a detailed description of results corresponding to the ADNI dataset. Results for the SD dataset are in line with what we observe for ADNI (see Supplementary Table 6 in Appendix B), and confirm that our method outperforms both VAE and mc-VAE in reconstruction in all scenarios. In addition, Fed-mv-PPCA outperforms in discrimination both methods in the non-iid setting, while mc-VAE shows slightly improved discriminating ability in the IID scenario.

Moreover, for the sake of completeness, supplementary Table 7 and supplementary Figure 11 provide results for both VAE and mc-VAE with two layers, as well as both methods with one layer and using FedProx as robust aggregation scheme with the proximal term λ𝜆\lambda varying from 0.01 to 0.5: this method aims at improving convergence in case of heterogeneous data distributions. No significant improvement as been observed comparing to the FedAvg scheme for the considered datasets and settings, while non linear models are associated with a negligible improvement in testing compared to the linear variational autoencoders.

IID distribution.

We consider the IID scenario and split the train dataset across 1 to 6 centers. Table 2 shows that results from Fed-mv-PPCA are stable when moving from a centralized to a federated setting, and when considering an increasing number of centers C𝐶C. We only observe a degradation of the MAE in the train dataset, but this does not affect the performance of Fed-mv-PPCA in reconstructing the test data. Moreover, irrespectively from the number of training centers, Fed-mv-PPCA outperforms VAE and mc-VAE in reconstruction.

Heterogeneous distribution.

Refer to caption

(a) Original space

Refer to caption

(b) Latent space
Refer to caption
(c) Missing views imputation
Figure 5: G/K scenario. First two dimensions for (a) sampling from posterior distribution of latent variables 𝐱c,nsubscript𝐱𝑐𝑛\mathbf{x}_{c,n}, and (b) predicted distribution 𝐭c,n(k)superscriptsubscript𝐭𝑐𝑛𝑘\mathbf{t}_{c,n}^{(k)} against real data. (c) Predicted testing distribution (blue curve) of sample features of the missing MRI view against real data (histogram).

We simulate an increasing degree of heterogeneity in 3 to 6 local datasets, to further challenge the models in properly recovering the global data. In particular, we consider both a non-iid distribution of subjects across centers, and missing views in some local dataset. It is worth noting that scenarios implying datasets with missing views cannot be handled by VAE nor by mc-VAE, hence in these cases we reported only results obtained with our method.

In Table 2 we report the average MAEs and Accuracy in the latent space for each scenario, obtained over 10 tests for the ADNI dataset. Fed-mv-PPCA is robust despite an increasing degree of heterogeneity in the local datasests. We observe a slight deterioration of the MAE in the test dataset in the more challenging non-iid cases (scenarios K and G/K), while we note a drop of the classification accuracy in the most heterogeneous setup (G/K). Nevertheless, Fed-mv-PPCA demostrates to be more stable and to perform better than VAE and mc-VAE when statistical heterogeneity is introduced.

Figure 5 (a) shows the sampling posterior distribution of the latent variables, while in Figure 5 (b) we plot the predicted global distribution against observations, for the G/K scenario and considering 3 training centers. We notice that the variability of centers is well captured, in spite of the heterogeneity of the distribution in the latent space. In particular center 2 and center 3 have two clearly distinct means: this is due to the fact that subjects in these centers belong to two distinct groups (AD in center 2 and NL in center 3). Despite this, Fed-mv-PPCA is able to reconstruct correctly all views, even if 2 views are completely missing in some local datasets (MRI is missing in center 2 and FDG in center 3).

After convergence of Fed-mv-PPCA, each center is supplied with global distributions for each parameter: data corresponding to each view can therefore be simulated, even if some are missing in the local dataset. Considering the same simulation in the challenging G/K scenario, in Figure 5 (c) we plot the global distribution of some randomly selected features of a missing imaging view in the test center, against ground truth density histogram, from the original data. The global distribution provides an accurate description of the missing MRI view. Supplementary Figure 8 shows imputation for all features of the missing MRI and FDG views.

4.3.3 Differentially private Fed-mv-PPCA

We repeated all experiments described in Section 4.3.2, using Fed-mv-PPCA with differential privacy. For each parameter θc𝜽csubscript𝜃𝑐subscript𝜽𝑐\theta_{c}\in\boldsymbol{\theta}_{c} we set ε=10𝜀10\varepsilon=10, δ=0.01𝛿0.01\delta=0.01, except when stated otherwise. Finally, to perform difference clipping (Algorithm 2), we set the maximal lpsubscript𝑙𝑝l_{p} norm of the difference between the updated parameter at round r𝑟r and the prior, Δ𝜽¯c[r]subscript¯Δ𝜽𝑐delimited-[]𝑟\overline{\Delta\boldsymbol{\theta}}_{c}[r], to be σ𝜽~[r1]subscript𝜎~𝜽delimited-[]𝑟1\sigma_{\widetilde{\boldsymbol{\theta}}}[r-1].

DP parameters utility.

We tested the utility of global parameters obtained with the differentially private Algorithm 2, to appreciate if data reconstruction and accuracy in the latent space are well preserved when the perturbation is performed at the client level (see Table 2, DP-Fed-mv-PPCA rows). As expected, we observe a deterioration of previous results, which increases with the number of training centers, due to the communication of a larger number of perturbed parameters. Nevertheless, results remain still coherent, and illustrate the utility of the differentially private global parameters. In particular Figure 6 shows how ε𝜀\varepsilon and the multiplicative constant used for difference clipping affect the ability of the optimzed DP global parameters in preserving a meaningful separation of subjects by diagnosis in the test set. For instance, we note that when (ε,δ)𝜀𝛿(\varepsilon,\delta) are fixed to (1,0.01)10.01(1,0.01), the clipping constant should be at most 0.2 to preserve a reasonable utility of the model outputs, in comparison to the one obtained using Fed-mv-PPCA (reported in Figure 6, not DP column). This further stresses the need of carefully tuning these DP parameters to ensure a good balance between privacy and utility.

Refer to caption

(a) δ=0.01𝛿0.01\delta=0.01, clipping constant=1absent1=1

Refer to caption

(b) ε=1,δ=0.01formulae-sequence𝜀1𝛿0.01\varepsilon=1,\delta=0.01
Figure 6: DP-Fed-mv-PPCA performance in preserving subjects separation in the Latent Space (LS) by diagnosis, varying (a) ε𝜀\varepsilon and (b) the multiplicative constant for difference clipping. Results for Fed-mv-PPCA (without DP) are reported for comparison purposes. :=p1.e2\ast\ast:=p\leq 1.e-2, :=p1.e3\ast\ast\ast:=p\leq 1.e-3, :=p1.e4\ast\ast\ast\ast:=p\leq 1.e-4.
Evolution of the standard deviation of global parameters and convergence.

To better understand the effect of performing difference clipping with respect to the priors, in Figure 7 (a-b) we plot the median evolution of the estimated standard deviation for each global parameter during training in the G/K scenario, comparing Fed-mv-PPCA and DP-Fed-mv-PPCA. When DP is not introduced, we can see that all global parameters’ standard deviations converge, as expected, indicating harmonization of local parameters during training. In particular, it is worth noticing that the clinical view displays higher variability for both intercept and noise parameters. Indeed, the clinical view is the most discriminant one between healthy and Alzheimer patients, and results plotted in Figure 7 are obtained under a non-iid scenario. Furthermore, one can notice the low magnitude of the standard deviations for μ~(FDG)superscript~𝜇𝐹𝐷𝐺\widetilde{\mu}^{(FDG)} and σ~(MRI)superscript~𝜎𝑀𝑅𝐼\widetilde{\sigma}^{(MRI)}: this may be explained by the fact that there are less centers contributing to the estimation of both FDG- and MRI-specific parameters, since in the G/K scenario the FDG and MRI views are missing in some centers. On the other hand, when differential privacy is introduced we tend to loose information concerning variability of global parameters: in this case all standard deviations drop towards 0 after approximately 20 communication rounds, meaning that the final global parameters distributions are strongly concentrated around their mean.

Refer to caption

(a) Fed-mv-PPCA

Refer to caption

(b) DP-Fed-mv-PPCA

Refer to caption

(c) Comparison of accuracy metric
Figure 7: Evolution of the standard deviation (in log10subscript10\log_{10} scale) of all global parameters for the GK scenario using 3 centers: comparison between (a) Fed-mv-PPCA and (b) DP-Fed-mv-PPCA. (c) Accuracy in the latent space across round (mean and std over 10 3-CV tests performed in the IID case), comparing Fed-mv-PPCA and DP-Fed-mv-PPCA.

Finally, we empirically investigate the convergence of DP-Fed-mv-PPCA. The convergence of the EM algorithm for PPCA has already been commented by Tipping and Bishop (1999). Nevertheless, in the case of DP-Fed-mv-PPCA, local parameters updates are performed using priors estimated at the master level from perturbed previous local updates. In addition, as commented above, the standard deviations of global parameters used as priors tend to decrease rapidly due to the clipping mechanism. Consequently, priors provided to the centers will be increasingly informative, affecting the algorithm convergence. Figure 7 (c) shows the mean evolution of the accuracy in the latent space (and for the test dataset) during successive rounds of both Fed-mv-PPCA and DP-Fed-mv-PPCA: mean and standard deviation are obtained by repeating 10 times a 3-CV test. Although the convergence of the algorithm seems to be reached in both cases, DP-Fed-mv-PPCA optimized parameters are clearly sub-optimal. Moreover, we notice a higher variability of the accuracy metric, as a consequence of the random perturbation performed over local parameters, which in turns affects the priors. Further insights are provided in Supp. Figure 10, showing the estimated global variance of the Gaussian noise, which is greater when using DP-Fed-mv-PPCA compared to Fed-mv-PPCA, indicating an estimated higher variability in the global dataset (i.e. the ensemble of the local datasets). This is an expected consequence of the perturbation mechanism, which necessary affects the global model’s performance.

5 Conclusions

In spite of the large amount of currently available multi-site biomedical data, we still lack of reliable analysis methods to be applied in multi-centric applications in compliance with privacy. To tackle this challenge, Fed-mv-PPCA proposes a hierarchical generative model to perform data assimilation of federated heterogeneous multi-views data. The Bayesian approach allows to naturally handle statistical heterogeneity across centers and missing views in local datasets, to provide an interpretable model of data variability and a valuable tool for missing data imputation. We show that Fed-mv-PPCA can be further coupled with differential privacy. Compatibly with our Bayesian formulation, we provide formal privacy guarantees of the proposed federated learning scheme against potential private information leakage from the shared statistics.

Our applications demonstrate that Fed-mv-PPCA is robust with respect to an increasing degree of heterogeneity across training centers, and provides high-quality data reconstruction, outperforming competitive methods in all scenarios. Moreover, when differential privacy is introduced, we provide an investigation of the method’s performance according to different privacy budget scenarios. It is worth noting that three DP hyperparameters play a key role, and could affect the performance of DP-Fed-mv-PPCA: the privacy budget parameters (ϵ,δ)italic-ϵ𝛿(\epsilon,\delta), and the clipping constant multiplying σ𝜽~subscript𝜎~𝜽\sigma_{\widetilde{\boldsymbol{\theta}}}. These parameters are tightly related and all contribute to determine the magnitude of the noise used for perturbing the updated difference Δ𝜽¯¯Δ𝜽\overline{\Delta\boldsymbol{\theta}}. Indeed, increasing either ε𝜀\varepsilon or δ𝛿\delta, or reducing the multiplicative constant in the clipping mechanism, implies the addition of a smaller noise, hence the improvement of the overall utility of the global model. Nevertheless, smaller ε𝜀\varepsilon and δ𝛿\delta corresponds to higher privacy guarantees.

Further extensions of this work are possible in several directions. The computational efficiency of Fed-mv-PPCA and its scalability to large datasets can be improved by leveraging on data sparsity and optimizing matrix multiplications and norm calculations as showed by Elgamal et al. (2015). In addition, introducing sparsity on the reconstruction weights is also expected to improve the robustness of the approach to non-informative dimensions and modalities. Another interesting research direction concerns the handling of missing data. Indeed, in this paper we considered Missing At Random (MAR) views in local datasets due to heterogeneous pipelines (Rubin, 1976). Fed-mv-PPCA could be extended to take into account and impute Missing Not At Random (MNAR) data as well, covering for instance the case of missing data due to self censoring, of interest in the biomedical context.

In this work we adopted DP to increase our framework’s security, motivated by the need to derive explicit theoretical privacy guarantees for our model. Alternatively, some recent works propose to improve data privacy (and eventually model utility) in a federated setting by generating fake data through generative adversarial networks (Rajotte et al., 2021; Rasouli et al., 2020). Despite formal privacy guarantees cannot be provided by data augmentation methods, their comparison to DP is a problem of great interest and should be further investigated.

In addition, we provided an experimental analysis of the convergence properties of DP in the proposed setting. In the future, formal convergence guarantees could be investigated, for example for the general optimization setting associating DP to EM. Furthermore, adaptive clipping strategies (Andrew et al., 2019) could be investigated and employed to improve the convergence of DP-Fed-mv-PPCA and the final utility of global parameters. Finally, in order to improve the robustness of DP-Fed-mv-PPCA, non-Gaussian data likelihood and priors could be introduced in the future, to better account for heavy-tailed distributions defined by outliers datasets and centers.


Acknowledgments

This work received financial support by the French government, through the 3IA Côte d’Azur Investments in the Future project managed by the National Research Agency (ANR) with the reference number ANR-19-P3IA-0002, and by the ANR JCJC project Fed-BioMed, ref. num. 19-CE45-0006-01. The authors are grateful to the OPAL infrastructure from Université Côte d’Azur for providing resources and support.

Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; ElanPharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research &\& Development, LLC.; Johnson &\& Johnson Pharmaceutical Research &\& Development LLC.; Lumosity; Lundbeck; Merck &\& Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; andTransition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for NeuroImaging at the University of SouthernCalifornia.


Ethical Standards

The work follows appropriate ethical standards in conducting research and writing the manuscript, following all applicable laws and regulations regarding treatment of animals or human subjects.


Conflicts of Interest

The authors declare that they have no conflict of interests.

References

  • Abadi et al. (2016) Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318, 2016.
  • Andrew et al. (2019) Galen Andrew, Om Thakkar, H Brendan McMahan, and Swaroop Ramaswamy. Differentially private learning with adaptive clipping. arXiv preprint arXiv:1905.03871, 2019.
  • Antelmi et al. (2019) Luigi Antelmi, Nicholas Ayache, Philippe Robert, and Marco Lorenzi. Sparse multi-channel variational autoencoder for the joint analysis of heterogeneous data. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning Research, pages 302–311. PMLR, 2019. URL http://proceedings.mlr.press/v97/antelmi19a.html.
  • Argelaguet et al. (2018) R. Argelaguet, B. Velten, D. Arnol, S. Dietrich, T. Zenz, J. C. Marioni, F. Buettner, W. Huber, and O. Stegle. Multi-omics factor analysis-a framework for unsupervised integration of multi-omics data sets. Mol Syst Biol, 14(6):e8124, 2018.
  • Banerjee et al. (2017) Monami Banerjee, Bing Jian, and Baba C Vemuri. Robust fréchet mean and pga on riemannian manifolds with applications to neuroimaging. In International Conference on Information Processing in Medical Imaging, pages 3–15. Springer, 2017.
  • Chassang (2017) Gauthier Chassang. The impact of the eu general data protection regulation on scientific research. ecancermedicalscience, 11, 2017.
  • Chen et al. (2009) Tao Chen, Elaine Martin, and Gary Montague. Robust probabilistic pca with missing data and contribution analysis for outlier detection. Computational Statistics & Data Analysis, 53(10):3706–3716, 2009.
  • Cunningham and Ghahramani (2015) John P Cunningham and Zoubin Ghahramani. Linear dimensionality reduction: Survey, insights, and generalizations. The Journal of Machine Learning Research, 16(1):2859–2900, 2015.
  • Dwork et al. (2014) Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211–407, 2014.
  • Elgamal et al. (2015) Tarek Elgamal, Maysam Yabandeh, Ashraf Aboulnaga, Waleed Mustafa, and Mohamed Hefeeda. spca: Scalable principal component analysis for big data on distributed platforms. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 79–91, 2015.
  • Fischl (2012) Bruce Fischl. Freesurfer. Neuroimage, 62(2):774–781, 2012.
  • Fletcher and Zhang (2016) P Thomas Fletcher and Miaomiao Zhang. Probabilistic geodesic models for regression and dimensionality reduction on riemannian manifolds. In Riemannian Computing in Computer Vision, pages 101–121. Springer, 2016.
  • Fredrikson et al. (2015) Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pages 1322–1333, 2015.
  • Gelman et al. (2014) Andrew Gelman, Jessica Hwang, and Aki Vehtari. Understanding predictive information criteria for bayesian models. Statistics and computing, 24(6):997–1016, 2014.
  • Geyer et al. (2017) Robin C Geyer, Tassilo Klein, and Moin Nabi. Differentially private federated learning: A client level perspective. arXiv preprint arXiv:1712.07557, 2017.
  • Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
  • Greve et al. (2014) Douglas N Greve, Claus Svarer, Patrick M Fisher, Ling Feng, Adam E Hansen, William Baare, Bruce Rosen, Bruce Fischl, and Gitte M Knudsen. Cortical surface-based analysis reduces bias and variance in kinetic modeling of brain pet data. Neuroimage, 92:225–236, 2014.
  • Hromatka et al. (2015) Michelle Hromatka, Miaomiao Zhang, Greg M Fleishman, Boris Gutman, Neda Jahanshad, Paul Thompson, and P Thomas Fletcher. A hierarchical bayesian model for multi-site diffeomorphic image atlases. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 372–379. Springer, 2015.
  • Iyengar et al. (2018) Arun Iyengar, Ashish Kundu, and George Pallis. Healthcare informatics and privacy. IEEE Internet Computing, 22(2):29–31, 2018.
  • Jolliffe (1986) Ian T Jolliffe. Principal components in regression analysis. In Principal component analysis, pages 129–155. Springer, 1986.
  • Kalter et al. (2019) Joeri Kalter, Maike G Sweegers, Irma M Verdonck-de Leeuw, Johannes Brug, and Laurien M Buffart. Development and use of a flexible data harmonization platform to facilitate the harmonization of individual patient data for meta-analyses. BMC research notes, 12(1):164, 2019.
  • Kingma and Welling (2014) Diederik P Kingma and Max Welling. Stochastic gradient vb and the variational auto-encoder. In Second International Conference on Learning Representations, ICLR, volume 19, 2014.
  • Kingma and Welling (2019) Diederik P Kingma and Max Welling. An introduction to variational autoencoders. Foundations and Trends in Machine Learning, 12(4):307–392, 2019.
  • Klami et al. (2013) Arto Klami, Seppo Virtanen, and Samuel Kaski. Bayesian canonical correlation analysis. Journal of Machine Learning Research, 14(Apr):965–1003, 2013.
  • Kramer (1991) Mark A Kramer. Nonlinear principal component analysis using autoassociative neural networks. AIChE journal, 37(2):233–243, 1991.
  • Li et al. (2018) Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. Federated optimization in heterogeneous networks. arXiv preprint arXiv:1812.06127, 2018.
  • Li et al. (2020) Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50–60, 2020.
  • Liang et al. (2020) Paul Pu Liang, Terrance Liu, Liu Ziyin, Nicholas B Allen, Randy P Auerbach, David Brent, Ruslan Salakhutdinov, and Louis-Philippe Morency. Think locally, act globally: Federated learning with local and global representations. arXiv preprint arXiv:2001.01523, 2020.
  • Llera and Beckmann (2016) A Llera and CF Beckmann. Estimating an inverse gamma distribution. arXiv preprint arXiv:1605.01019, 2016.
  • Matsuura et al. (2018) Toshihiko Matsuura, Kuniaki Saito, Yoshitaka Ushiku, and Tatsuya Harada. Generalized bayesian canonical correlation analysis with missing modalities. In Proceedings of the European Conference on Computer Vision (ECCV), pages 0–0, 2018.
  • McMahan et al. (2017a) Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pages 1273–1282. PMLR, 2017a.
  • McMahan et al. (2017b) H Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differentially private recurrent language models. arXiv preprint arXiv:1710.06963, 2017b.
  • Rajotte et al. (2021) Jean-Francois Rajotte, Sumit Mukherjee, Caleb Robinson, Anthony Ortiz, Christopher West, Juan Lavista Ferres, and Raymond T Ng. Reducing bias and increasing utility by federated generative modeling of medical images using a centralized adversary. arXiv preprint arXiv:2101.07235, 2021.
  • Rasouli et al. (2020) Mohammad Rasouli, Tao Sun, and Ram Rajagopal. Fedgan: Federated generative adversarial networks for distributed data. arXiv preprint arXiv:2006.07228, 2020.
  • Rubin (1976) Donald B Rubin. Inference and missing data. Biometrika, 63(3):581–592, 1976.
  • Sattler et al. (2019) Felix Sattler, Simon Wiedemann, Klaus-Robert Müller, and Wojciech Samek. Robust and communication-efficient federated learning from non-iid data. IEEE transactions on neural networks and learning systems, 31(9):3400–3413, 2019.
  • Shen and Thompson (2019) Li Shen and Paul M Thompson. Brain imaging genomics: Integrated analysis and machine learning. Proceedings of the IEEE, 108(1):125–162, 2019.
  • Sommer et al. (2010) Stefan Sommer, François Lauze, Søren Hauberg, and Mads Nielsen. Manifold valued statistics, exact principal geodesic analysis and the effect of linear approximations. In European conference on computer vision, pages 43–56. Springer, 2010.
  • Tipping and Bishop (1999) Michael E Tipping and Christopher M Bishop. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(3):611–622, 1999.
  • Triastcyn and Faltings (2019) Aleksei Triastcyn and Boi Faltings. Federated learning with bayesian differential privacy. In 2019 IEEE International Conference on Big Data (Big Data), pages 2587–2596. IEEE, 2019.
  • Wei et al. (2020) Kang Wei, Jun Li, Ming Ding, Chuan Ma, Howard H Yang, Farhad Farokhi, Shi Jin, Tony QS Quek, and H Vincent Poor. Federated learning with differential privacy: Algorithms and performance analysis. IEEE Transactions on Information Forensics and Security, 15:3454–3469, 2020.
  • Yurochkin et al. (2019) Mikhail Yurochkin, Mayank Agarwal, Soumya Ghosh, Kristjan Greenewald, Nghia Hoang, and Yasaman Khazaeni. Bayesian nonparametric federated learning of neural networks. In International Conference on Machine Learning, pages 7252–7261. PMLR, 2019.
  • Zhang and Fletcher (2013) Miaomiao Zhang and Tom Fletcher. Probabilistic principal geodesic analysis. Advances in Neural Information Processing Systems, 26:1178–1186, 2013.
  • Zhang et al. (2021) Xinwei Zhang, Xiangyi Chen, Mingyi Hong, Zhiwei Steven Wu, and Jinfeng Yi. Understanding clipping for federated learning: Convergence and client-level differential privacy. arXiv preprint arXiv:2106.13673, 2021.
  • Zhao et al. (2019) Jun Zhao, Teng Wang, Tao Bai, Kwok-Yan Lam, Zhiying Xu, Shuyu Shi, Xuebin Ren, Xinyu Yang, Yang Liu, and Han Yu. Reviewing and improving the gaussian mechanism for differential privacy. arXiv preprint arXiv:1911.12060, 2019.

Appendix A. Theoretical derivation of Fed-mv-PPC method

Problem setting

We consider C𝐶C centers, each center c{1,,C}𝑐1𝐶c\in\{1,\dots,C\} providing data from Ncsubscript𝑁𝑐N_{c} subjects, each consisting of KcKsubscript𝐾𝑐𝐾K_{c}\leq K views. Let dksubscript𝑑𝑘d_{k} be the dimension of data corresponding to the kthsuperscript𝑘thk^{\textrm{th}}-view, and d:=k=1Kdkassign𝑑superscriptsubscript𝑘1𝐾subscript𝑑𝑘d:=\sum_{k=1}^{K}d_{k}.

For each k,c𝑘𝑐k,c and each n{1,,Nc}𝑛1subscript𝑁𝑐n\in\{1,\dots,N_{c}\}, the generative model is:

𝐭c,n(k)=Wc(k)𝐱c,n+𝝁c(k)+𝜺c(k),superscriptsubscript𝐭𝑐𝑛𝑘superscriptsubscript𝑊𝑐𝑘subscript𝐱𝑐𝑛superscriptsubscript𝝁𝑐𝑘superscriptsubscript𝜺𝑐𝑘\mathbf{t}_{c,n}^{(k)}=W_{c}^{(k)}\mathbf{x}_{c,n}+\boldsymbol{\mu}_{c}^{(k)}+\boldsymbol{\varepsilon}_{c}^{(k)},(7)

where:

  • 𝐭c,n(k)dksuperscriptsubscript𝐭𝑐𝑛𝑘superscriptsubscript𝑑𝑘\mathbf{t}_{c,n}^{(k)}\in\mathbb{R}^{d_{k}} denotes the raw data of the kthsuperscript𝑘thk^{\textrm{th}}-view of the sample indexed by n𝑛n in center c𝑐c, which belongs to group g𝑔g.

  • 𝐱c,n𝒩(0,𝕀q)similar-tosubscript𝐱𝑐𝑛𝒩0subscript𝕀𝑞\mathbf{x}_{c,n}\sim\mathcal{N}(0,\mathbb{I}_{q}) is a q𝑞q-dimensional latent variable, qmink(dk)𝑞subscript𝑘subscript𝑑𝑘q\leq\min_{k}(d_{k}) being a suitable user-defined latent-space dimension.

  • Wc(k)dk×qsuperscriptsubscript𝑊𝑐𝑘superscriptsubscript𝑑𝑘𝑞W_{c}^{(k)}\in\mathbb{R}^{d_{k}\times q} provides the linear mapping between the two sets of variables for the kthsuperscript𝑘thk^{\textrm{th}}-view.

  • 𝝁c(k)dksuperscriptsubscript𝝁𝑐𝑘superscriptsubscript𝑑𝑘\boldsymbol{\mu}_{c}^{(k)}\in\mathbb{R}^{d_{k}} allows data corresponding to view k𝑘k to have a non-zero mean.

  • 𝜺c(k)𝒩(0,σc(k)2𝕀dk)similar-tosuperscriptsubscript𝜺𝑐𝑘𝒩0superscriptsuperscriptsubscript𝜎𝑐𝑘2subscript𝕀subscript𝑑𝑘\boldsymbol{\varepsilon}_{c}^{(k)}\sim\mathcal{N}\left(0,{\sigma_{c}^{(k)}}^{2}\mathbb{I}_{d_{k}}\right) is a Gaussian noise for the kthsuperscript𝑘thk^{\textrm{th}}-view.

A compact formulation for 𝐭c,nsubscript𝐭𝑐𝑛\mathbf{t}_{c,n} (i.e. considering all views concatenated) can be easily derived from Equation (7):

𝐭c,n=Wc𝐱c,n+𝝁c+𝜺c,subscript𝐭𝑐𝑛subscript𝑊𝑐subscript𝐱𝑐𝑛subscript𝝁𝑐subscript𝜺𝑐\mathbf{t}_{c,n}=W_{c}\mathbf{x}_{c,n}+\boldsymbol{\mu}_{c}+\boldsymbol{\varepsilon}_{c},(8)

where:

  • 𝐭c,n=[𝐭c,n(1)T,,𝐭c,n(K)T]Tdsubscript𝐭𝑐𝑛superscriptsuperscriptsuperscriptsubscript𝐭𝑐𝑛1𝑇superscriptsuperscriptsubscript𝐭𝑐𝑛𝐾𝑇𝑇superscript𝑑\mathbf{t}_{c,n}=\left[{\mathbf{t}_{c,n}^{(1)}}^{T},\dots,{\mathbf{t}_{c,n}^{(K)}}^{T}\right]^{T}\in\mathbb{R}^{d}

  • Wc=[Wc(1)T,,Wc(K)T]Td×qsubscript𝑊𝑐superscriptsuperscriptsuperscriptsubscript𝑊𝑐1𝑇superscriptsuperscriptsubscript𝑊𝑐𝐾𝑇𝑇superscript𝑑𝑞W_{c}=\left[{W_{c}^{(1)}}^{T},\dots,{W_{c}^{(K)}}^{T}\right]^{T}\in\mathbb{R}^{d\times q}

  • 𝝁c=[𝝁c(1)T,,𝝁c(K)T]Tdsubscript𝝁𝑐superscriptsuperscriptsuperscriptsubscript𝝁𝑐1𝑇superscriptsuperscriptsubscript𝝁𝑐𝐾𝑇𝑇superscript𝑑\boldsymbol{\mu}_{c}=\left[{\boldsymbol{\mu}_{c}^{(1)}}^{T},\dots,{\boldsymbol{\mu}_{c}^{(K)}}^{T}\right]^{T}\in\mathbb{R}^{d}

  • 𝜺c=[𝜺c(1)T,,𝜺c(K)T]T𝒩(0,Ψc)subscript𝜺𝑐superscriptsuperscriptsuperscriptsubscript𝜺𝑐1𝑇superscriptsuperscriptsubscript𝜺𝑐𝐾𝑇𝑇similar-to𝒩0subscriptΨ𝑐\boldsymbol{\varepsilon}_{c}=\left[{\boldsymbol{\varepsilon}_{c}^{(1)}}^{T},\dots,{\boldsymbol{\varepsilon}_{c}^{(K)}}^{T}\right]^{T}\sim\mathcal{N}(0,\Psi_{c}),
    where ΨcsubscriptΨ𝑐\Psi_{c} is a diagonal block-matrix, Ψc=diag(σc(1)2𝕀d1,,σc(K)2𝕀dK)subscriptΨ𝑐𝑑𝑖𝑎𝑔superscriptsuperscriptsubscript𝜎𝑐12subscript𝕀subscript𝑑1superscriptsuperscriptsubscript𝜎𝑐𝐾2subscript𝕀subscript𝑑𝐾\Psi_{c}=diag\left({\sigma_{c}^{(1)}}^{2}\mathbb{I}_{d_{1}},\dots,{\sigma_{c}^{(K)}}^{2}\mathbb{I}_{d_{K}}\right)

Note that for the sake of simplicity we represented all K𝐾K views. If in center c𝑐c the kthsuperscript𝑘thk^{\textrm{th}}-view is missing, than it will be simply removed, e.g. one would have:
𝐭c,n=[𝐭c,n(1)T,,𝐭c,n(k1)T,𝐭c,n(k+1)T,,𝐭c,n(K)T]Tsubscript𝐭𝑐𝑛superscriptsuperscriptsuperscriptsubscript𝐭𝑐𝑛1𝑇superscriptsuperscriptsubscript𝐭𝑐𝑛𝑘1𝑇superscriptsuperscriptsubscript𝐭𝑐𝑛𝑘1𝑇superscriptsuperscriptsubscript𝐭𝑐𝑛𝐾𝑇𝑇\mathbf{t}_{c,n}=\left[{\mathbf{t}_{c,n}^{(1)}}^{T},\dots,{\mathbf{t}_{c,n}^{(k-1)}}^{T},{\mathbf{t}_{c,n}^{(k+1)}}^{T},\dots,{\mathbf{t}_{c,n}^{(K)}}^{T}\right]^{T}, 𝐭c,nddksubscript𝐭𝑐𝑛superscript𝑑subscript𝑑𝑘\mathbf{t}_{c,n}\in\mathbb{R}^{d-d_{k}}.

For each center c𝑐c and each k𝑘k we want to estimate 𝜽c:={𝝁c(k),Wc(k),σc(k)2}k=1,,Kcassignsubscript𝜽𝑐subscriptsuperscriptsubscript𝝁𝑐𝑘superscriptsubscript𝑊𝑐𝑘superscriptsuperscriptsubscript𝜎𝑐𝑘2𝑘1subscript𝐾𝑐\boldsymbol{\theta}_{c}:=\left\{\boldsymbol{\mu}_{c}^{(k)},W_{c}^{(k)},{\sigma_{c}^{(k)}}^{2}\right\}_{k=1,\dots,K_{c}} assuming that all local parameters are a realization of a common global distribution, to be estimated as well. The latter, provide a global model, which should be able to describe data across all centers.

Parameter 𝝁𝝁\boldsymbol{\mu}

We assume that c,kfor-all𝑐𝑘\forall c,k:

𝝁c(k)|𝝁~(k),σ𝝁~(k)2𝒩(𝝁~(k),σ𝝁~(k)2𝕀dk)similar-toconditionalsuperscriptsubscript𝝁𝑐𝑘superscript~𝝁𝑘superscriptsubscript𝜎superscript~𝝁𝑘2𝒩superscript~𝝁𝑘superscriptsubscript𝜎superscript~𝝁𝑘2subscript𝕀subscript𝑑𝑘\boldsymbol{\mu}_{c}^{(k)}|\widetilde{\boldsymbol{\mu}}^{(k)},\sigma_{\widetilde{\boldsymbol{\mu}}^{(k)}}^{2}\sim\mathcal{N}\left(\widetilde{\boldsymbol{\mu}}^{(k)},\sigma_{\widetilde{\boldsymbol{\mu}}^{(k)}}^{2}\mathbb{I}_{d_{k}}\right)(9)

Step 1. (In each center): Estimate 𝝁c(k)[s+1]superscriptsubscript𝝁𝑐𝑘delimited-[]𝑠1\boldsymbol{\mu}_{c}^{(k)}[s+1] given (𝝁~(k),σ𝝁~(k)2)[s]superscript~𝝁𝑘superscriptsubscript𝜎superscript~𝝁𝑘2delimited-[]𝑠\left(\widetilde{\boldsymbol{\mu}}^{(k)},\sigma_{\widetilde{\boldsymbol{\mu}}^{(k)}}^{2}\right)[s] (iteration s𝑠s is denoted by [s]delimited-[]𝑠[s]).

From Equation (7), the marginal distribution of 𝐭c,n(k),gsuperscriptsubscript𝐭𝑐𝑛𝑘𝑔\mathbf{t}_{c,n}^{(k),g} is:

𝐭c,n(k)𝒩(𝝁c(k),Cc(k)),similar-tosuperscriptsubscript𝐭𝑐𝑛𝑘𝒩superscriptsubscript𝝁𝑐𝑘superscriptsubscript𝐶𝑐𝑘\mathbf{t}_{c,n}^{(k)}\sim\mathcal{N}(\boldsymbol{\mu}_{c}^{(k)},C_{c}^{(k)}),

where Cc(k)=Wc(k)Wc(k)T+σc(k)2𝕀dksuperscriptsubscript𝐶𝑐𝑘superscriptsubscript𝑊𝑐𝑘superscriptsuperscriptsubscript𝑊𝑐𝑘𝑇superscriptsuperscriptsubscript𝜎𝑐𝑘2subscript𝕀subscript𝑑𝑘C_{c}^{(k)}=W_{c}^{(k)}{W_{c}^{(k)}}^{T}+{\sigma_{c}^{(k)}}^{2}\mathbb{I}_{d_{k}}, Cc(k)dk×dksuperscriptsubscript𝐶𝑐𝑘superscriptsubscript𝑑𝑘subscript𝑑𝑘C_{c}^{(k)}\in\mathbb{R}^{d_{k}\times d_{k}}.

The corresponding log-likelihood gives:

c(k)superscriptsubscript𝑐𝑘\displaystyle\mathcal{L}_{c}^{(k)}=\displaystyle=12{Ncdkln(2π)+Ncln|Cc(k)|+n=1Nc(𝐭c,n(k)𝝁c(k))T(Cc(k))1(𝐭c,n(k)𝝁c(k))}12subscript𝑁𝑐subscript𝑑𝑘2𝜋subscript𝑁𝑐superscriptsubscript𝐶𝑐𝑘superscriptsubscript𝑛1subscript𝑁𝑐superscriptsuperscriptsubscript𝐭𝑐𝑛𝑘superscriptsubscript𝝁𝑐𝑘𝑇superscriptsuperscriptsubscript𝐶𝑐𝑘1superscriptsubscript𝐭𝑐𝑛𝑘superscriptsubscript𝝁𝑐𝑘\displaystyle-\frac{1}{2}\left\{N_{c}d_{k}\ln{(2\pi)}+N_{c}\ln{|C_{c}^{(k)}|}+\sum_{n=1}^{N_{c}}\left(\mathbf{t}_{c,n}^{(k)}-\boldsymbol{\mu}_{c}^{(k)}\right)^{T}\left(C_{c}^{(k)}\right)^{-1}\left(\mathbf{t}_{c,n}^{(k)}-\boldsymbol{\mu}_{c}^{(k)}\right)\right\}

Therefore, for each center c𝑐c and for all k{1,,K}𝑘1𝐾k\in\{1,\dots,K\}, the following optimization problem should be considered:

max𝝁c(k)c(k)+lnp(𝝁c(k)),subscriptsuperscriptsubscript𝝁𝑐𝑘superscriptsubscript𝑐𝑘𝑝superscriptsubscript𝝁𝑐𝑘\max_{\boldsymbol{\mu}_{c}^{(k)}}\mathcal{L}_{c}^{(k)}+\ln{p\left(\boldsymbol{\mu}_{c}^{(k)}\right)},

where:

lnp(𝝁c(k))=12σ𝝁~(k)2(𝝁c(k)𝝁~(k))T(𝝁c(k)𝝁~(k))+const,𝑝superscriptsubscript𝝁𝑐𝑘12superscriptsubscript𝜎superscript~𝝁𝑘2superscriptsuperscriptsubscript𝝁𝑐𝑘superscript~𝝁𝑘𝑇superscriptsubscript𝝁𝑐𝑘superscript~𝝁𝑘𝑐𝑜𝑛𝑠𝑡\ln{p\left(\boldsymbol{\mu}_{c}^{(k)}\right)}=-\frac{1}{2\sigma_{\widetilde{\boldsymbol{\mu}}^{(k)}}^{2}}\left(\boldsymbol{\mu}_{c}^{(k)}-\widetilde{\boldsymbol{\mu}}^{(k)}\right)^{T}\left(\boldsymbol{\mu}_{c}^{(k)}-\widetilde{\boldsymbol{\mu}}^{(k)}\right)+const,

where const𝑐𝑜𝑛𝑠𝑡const collects terms which are independents from 𝝁c(k)superscriptsubscript𝝁𝑐𝑘\boldsymbol{\mu}_{c}^{(k)}. We obtain:

μc(k)[s+1]superscriptsubscript𝜇𝑐𝑘delimited-[]𝑠1\displaystyle\mu_{c}^{(k)}[s+1]=\displaystyle=[Nc𝕀dk+1σ𝝁~(k)2[s]Cc(k)]1[n=1Nc𝐭c,n(k)+1σ𝝁~(k)2[s]Cc(k)𝝁~(k)[s]]superscriptdelimited-[]subscript𝑁𝑐subscript𝕀subscript𝑑𝑘1superscriptsubscript𝜎superscript~𝝁𝑘2delimited-[]𝑠superscriptsubscript𝐶𝑐𝑘1delimited-[]superscriptsubscript𝑛1subscript𝑁𝑐superscriptsubscript𝐭𝑐𝑛𝑘1superscriptsubscript𝜎superscript~𝝁𝑘2delimited-[]𝑠superscriptsubscript𝐶𝑐𝑘superscript~𝝁𝑘delimited-[]𝑠\displaystyle\left[N_{c}\mathbb{I}_{d_{k}}+\frac{1}{\sigma_{\widetilde{\boldsymbol{\mu}}^{(k)}}^{2}[s]}C_{c}^{(k)}\right]^{-1}\left[\sum_{n=1}^{N_{c}}\mathbf{t}_{c,n}^{(k)}+\frac{1}{\sigma_{\widetilde{\boldsymbol{\mu}}^{(k)}}^{2}[s]}C_{c}^{(k)}\widetilde{\boldsymbol{\mu}}^{(k)}[s]\right]

Step 2. (In the master): Estimate (𝝁~(k)[s+1],σ𝝁~(k)2)[s+1]superscript~𝝁𝑘delimited-[]𝑠1superscriptsubscript𝜎superscript~𝝁𝑘2delimited-[]𝑠1\left(\widetilde{\boldsymbol{\mu}}^{(k)}[s+1],\sigma_{\widetilde{\boldsymbol{\mu}}^{(k)}}^{2}\right)[s+1] given 𝝁c(k)[s+1]superscriptsubscript𝝁𝑐𝑘delimited-[]𝑠1\boldsymbol{\mu}_{c}^{(k)}[s+1] for all c𝑐c.

Using (9), we obtain the following log-likelihood:

=c=1Clnp(𝝁c(k))=c=1C{const1σ𝝁~(k)2𝝁c(k)𝝁~(k)2}superscriptsubscript𝑐1𝐶𝑝superscriptsubscript𝝁𝑐𝑘superscriptsubscript𝑐1𝐶𝑐𝑜𝑛𝑠𝑡1superscriptsubscript𝜎superscript~𝝁𝑘2superscriptnormsuperscriptsubscript𝝁𝑐𝑘superscript~𝝁𝑘2\mathcal{L}=\sum_{c=1}^{C}\ln{p(\boldsymbol{\mu}_{c}^{(k)})}=\sum_{c=1}^{C}\left\{const-\frac{1}{\sigma_{\widetilde{\boldsymbol{\mu}}^{(k)}}^{2}}\|\boldsymbol{\mu}_{c}^{(k)}-\widetilde{\boldsymbol{\mu}}^{(k)}\|^{2}\right\}(12)

By imposing (𝝁~(k),σ𝝁~(k)2)((12))=0subscriptsuperscript~𝝁𝑘superscriptsubscript𝜎superscript~𝝁𝑘2italic-(12italic-)0\partial_{\left(\widetilde{\boldsymbol{\mu}}^{(k)},\sigma_{\widetilde{\boldsymbol{\mu}}^{(k)}}^{2}\right)}\left(\eqref{log_like_tildemu}\right)=0 we obtain:

𝝁~(k)[s+1]=1Cc=1C𝝁c(k)[s+1]superscript~𝝁𝑘delimited-[]𝑠11𝐶superscriptsubscript𝑐1𝐶superscriptsubscript𝝁𝑐𝑘delimited-[]𝑠1\widetilde{\boldsymbol{\mu}}^{(k)}[s+1]=\frac{1}{C}\sum_{c=1}^{C}\boldsymbol{\mu}_{c}^{(k)}[s+1](13)

and

σ𝝁~(k)2[s+1]=1Cdkc=1C𝝁c(k)[s+1]𝝁~(k)[s+1]2superscriptsubscript𝜎superscript~𝝁𝑘2delimited-[]𝑠11𝐶subscript𝑑𝑘superscriptsubscript𝑐1𝐶superscriptnormsuperscriptsubscript𝝁𝑐𝑘delimited-[]𝑠1superscript~𝝁𝑘delimited-[]𝑠12\sigma_{\widetilde{\boldsymbol{\mu}}^{(k)}}^{2}[s+1]=\frac{1}{Cd_{k}}\sum_{c=1}^{C}\left\|\boldsymbol{\mu}_{c}^{(k)}[s+1]-\widetilde{\boldsymbol{\mu}}^{(k)}[s+1]\right\|^{2}(14)

Complete-data log-likelihood

From Equations (7)-(8) one can derive the following marginal distributions:

𝐭c,n(k)|𝐱c,n𝒩(Wc(k)𝐱c,n+𝝁,σc(k)2𝕀dk)similar-toconditionalsuperscriptsubscript𝐭𝑐𝑛𝑘subscript𝐱𝑐𝑛𝒩superscriptsubscript𝑊𝑐𝑘subscript𝐱𝑐𝑛𝝁superscriptsuperscriptsubscript𝜎𝑐𝑘2subscript𝕀subscript𝑑𝑘\displaystyle\mathbf{t}_{c,n}^{(k)}|\mathbf{x}_{c,n}\sim\mathcal{N}\left(W_{c}^{(k)}\mathbf{x}_{c,n}+\boldsymbol{\mu},{\sigma_{c}^{(k)}}^{2}\mathbb{I}_{d_{k}}\right)

and

𝐱c,n|𝐭c,n𝒩(Σc1Bc(𝐭c,n𝝁c),Σc1),similar-toconditionalsubscript𝐱𝑐𝑛subscript𝐭𝑐𝑛𝒩superscriptsubscriptΣ𝑐1subscript𝐵𝑐subscript𝐭𝑐𝑛subscript𝝁𝑐superscriptsubscriptΣ𝑐1\displaystyle\mathbf{x}_{c,n}|\mathbf{t}_{c,n}\sim\mathcal{N}\left(\Sigma_{c}^{-1}B_{c}(\mathbf{t}_{c,n}-\boldsymbol{\mu}_{c}),\Sigma_{c}^{-1}\right),

where:

  • Σc:=(𝕀q+WcTΨc1Wc)=(𝕀q+k=1K1(σc(k))2Wc(k)TWc(k))q×qassignsubscriptΣ𝑐subscript𝕀𝑞superscriptsubscript𝑊𝑐𝑇superscriptsubscriptΨ𝑐1subscript𝑊𝑐subscript𝕀𝑞superscriptsubscript𝑘1𝐾1superscriptsuperscriptsubscript𝜎𝑐𝑘2superscriptsuperscriptsubscript𝑊𝑐𝑘𝑇superscriptsubscript𝑊𝑐𝑘superscript𝑞𝑞\Sigma_{c}:=(\mathbb{I}_{q}+W_{c}^{T}\Psi_{c}^{-1}W_{c})=\left(\mathbb{I}_{q}+\sum_{k=1}^{K}\frac{1}{\left(\sigma_{c}^{(k)}\right)^{2}}{W_{c}^{(k)}}^{T}W_{c}^{(k)}\right)\in\mathbb{R}^{q\times q}

  • Bc:=WcTΨc1=[Wc(1)T(σc(1))2,Wc(K)T(σc(K))2]q×dassignsubscript𝐵𝑐superscriptsubscript𝑊𝑐𝑇superscriptsubscriptΨ𝑐1superscriptsuperscriptsubscript𝑊𝑐1𝑇superscriptsuperscriptsubscript𝜎𝑐12superscriptsuperscriptsubscript𝑊𝑐𝐾𝑇superscriptsuperscriptsubscript𝜎𝑐𝐾2superscript𝑞𝑑B_{c}:={W_{c}}^{T}\Psi_{c}^{-1}=\left[\frac{{W_{c}^{(1)}}^{T}}{\left(\sigma_{c}^{(1)}\right)^{2}}\dots,\frac{{W_{c}^{(K)}}^{T}}{\left(\sigma_{c}^{(K)}\right)^{2}}\right]\in\mathbb{R}^{q\times d}

Hence:

  • 𝐱c,n=Σc1Bc(𝐭c,n𝝁c)delimited-⟨⟩subscript𝐱𝑐𝑛superscriptsubscriptΣ𝑐1subscript𝐵𝑐subscript𝐭𝑐𝑛subscript𝝁𝑐\langle\mathbf{x}_{c,n}\rangle=\Sigma_{c}^{-1}B_{c}(\mathbf{t}_{c,n}-\boldsymbol{\mu}_{c})

  • 𝐱c,n𝐱c,nT=Σc1+𝐱c,n𝐱c,nTdelimited-⟨⟩subscript𝐱𝑐𝑛superscriptsubscript𝐱𝑐𝑛𝑇superscriptsubscriptΣ𝑐1delimited-⟨⟩subscript𝐱𝑐𝑛superscriptdelimited-⟨⟩subscript𝐱𝑐𝑛𝑇\langle\mathbf{x}_{c,n}\mathbf{x}_{c,n}^{T}\rangle=\Sigma_{c}^{-1}+\langle\mathbf{x}_{c,n}\rangle\langle\mathbf{x}_{c,n}\rangle^{T}

The joint distribution of 𝐭c,nsubscript𝐭𝑐𝑛\mathbf{t}_{c,n} and 𝐱c,nsubscript𝐱𝑐𝑛\mathbf{x}_{c,n} follows (p(𝐭c,n,𝐱c,n)=p(𝐭c,n|𝐱c,n)p(𝐱c,n)𝑝subscript𝐭𝑐𝑛subscript𝐱𝑐𝑛𝑝conditionalsubscript𝐭𝑐𝑛subscript𝐱𝑐𝑛𝑝subscript𝐱𝑐𝑛p(\mathbf{t}_{c,n},\mathbf{x}_{c,n})=p(\mathbf{t}_{c,n}|\mathbf{x}_{c,n})p(\mathbf{x}_{c,n})), hence the expectation of the complete-data log-likelihood for each center c𝑐c with respect to p(𝐱c,n|𝐭c,n)𝑝conditionalsubscript𝐱𝑐𝑛subscript𝐭𝑐𝑛p(\mathbf{x}_{c,n}|\mathbf{t}_{c,n}):

Ccdelimited-⟨⟩subscriptsubscript𝐶𝑐\displaystyle\langle{\mathcal{L}_{C}}_{c}\rangle=\displaystyle=n=1Nc{k=1K[dk2ln(σc(k)2)+12σc(k)2𝐭c,n(k)𝝁c(k)2+12σc(k)2tr(Wc(k)TWc(k)𝐱c,n𝐱c,nT)\displaystyle-\sum_{n=1}^{N_{c}}\left\{\sum_{k=1}^{K}\left[\frac{d_{k}}{2}\ln{\left({\sigma_{c}^{(k)}}^{2}\right)}+\frac{1}{2{\sigma_{c}^{(k)}}^{2}}\|\mathbf{t}_{c,n}^{(k)}-\boldsymbol{\mu}_{c}^{(k)}\|^{2}+\frac{1}{2{\sigma_{c}^{(k)}}^{2}}tr\left({W_{c}^{(k)}}^{T}{W_{c}^{(k)}}\langle\mathbf{x}_{c,n}\mathbf{x}_{c,n}^{T}\rangle\right)\right.\right.(15)
1σc(k)2𝐱c,nTWc(k)T(𝐭c,n(k)𝝁c(k))]+12tr(𝐱c,n𝐱c,nT)},\displaystyle\left.\left.-\frac{1}{{\sigma_{c}^{(k)}}^{2}}\langle\mathbf{x}_{c,n}\rangle^{T}{W_{c}^{(k)}}^{T}\left(\mathbf{t}_{c,n}^{(k)}-\boldsymbol{\mu}_{c}^{(k)}\right)\right]+\frac{1}{2}tr\left(\langle\mathbf{x}_{c,n}\mathbf{x}_{c,n}^{T}\rangle\right)\right\},

Parameter W𝑊W

We assume that c,kfor-all𝑐𝑘\forall c,k:

Wc(k)|W~(k),σW~(k)2𝒩dk,q(W~(k),𝕀dk,σW~(k)2𝕀q)similar-toconditionalsuperscriptsubscript𝑊𝑐𝑘superscript~𝑊𝑘superscriptsubscript𝜎superscript~𝑊𝑘2subscript𝒩subscript𝑑𝑘𝑞superscript~𝑊𝑘subscript𝕀subscript𝑑𝑘superscriptsubscript𝜎superscript~𝑊𝑘2subscript𝕀𝑞W_{c}^{(k)}|\widetilde{W}^{(k)},\sigma_{\widetilde{W}^{(k)}}^{2}\sim\mathcal{MN}_{d_{k},q}\left(\widetilde{W}^{(k)},\mathbb{I}_{d_{k}},\sigma_{\widetilde{W}^{(k)}}^{2}\mathbb{I}_{q}\right)(16)

Step 1. (In each center): Estimate Wc(k)[s+1]superscriptsubscript𝑊𝑐𝑘delimited-[]𝑠1W_{c}^{(k)}[s+1] given (W~(k),σW~(k)2)[s]superscript~𝑊𝑘superscriptsubscript𝜎superscript~𝑊𝑘2delimited-[]𝑠\left(\widetilde{W}^{(k)},\sigma_{\widetilde{W}^{(k)}}^{2}\right)[s].

For each center c𝑐c, we consider the following optimization problem:

maxWc(k)Cc+lnp(Wc(k)),subscriptsuperscriptsubscript𝑊𝑐𝑘subscriptsubscript𝐶𝑐𝑝superscriptsubscript𝑊𝑐𝑘\max_{W_{c}^{(k)}}\langle{\mathcal{L}_{C}}_{c}\rangle+\ln{p\left(W_{c}^{(k)}\right)},

where lnp(Wc(k))=12σW~(k)2tr(Wc(k)W~(k)22)+const.𝑝superscriptsubscript𝑊𝑐𝑘12superscriptsubscript𝜎superscript~𝑊𝑘2𝑡𝑟superscriptsubscriptnormsuperscriptsubscript𝑊𝑐𝑘superscript~𝑊𝑘22𝑐𝑜𝑛𝑠𝑡\ln{p\left(W_{c}^{(k)}\right)}=-\frac{1}{2\sigma_{\widetilde{W}^{(k)}}^{2}}tr\left(\|W_{c}^{(k)}-\widetilde{W}^{(k)}\|_{2}^{2}\right)+const.
It follows:

Wc(k)[s+1]superscriptsubscript𝑊𝑐𝑘delimited-[]𝑠1\displaystyle W_{c}^{(k)}[s+1]=\displaystyle=[n=1Nc(𝐭c,n(k)𝝁c(k))𝐱c,nT+σc(k)2σW~(k)2[s]W~(k)[s]][n=1Nc𝐱c,n𝐱c,nT+σc(k)2σW~(k)2[s]𝕀q]1delimited-[]superscriptsubscript𝑛1subscript𝑁𝑐superscriptsubscript𝐭𝑐𝑛𝑘superscriptsubscript𝝁𝑐𝑘superscriptdelimited-⟨⟩subscript𝐱𝑐𝑛𝑇superscriptsuperscriptsubscript𝜎𝑐𝑘2superscriptsubscript𝜎superscript~𝑊𝑘2delimited-[]𝑠superscript~𝑊𝑘delimited-[]𝑠superscriptdelimited-[]superscriptsubscript𝑛1subscript𝑁𝑐delimited-⟨⟩subscript𝐱𝑐𝑛superscriptsubscript𝐱𝑐𝑛𝑇superscriptsuperscriptsubscript𝜎𝑐𝑘2superscriptsubscript𝜎superscript~𝑊𝑘2delimited-[]𝑠subscript𝕀𝑞1\displaystyle\left[\sum_{n=1}^{N_{c}}(\mathbf{t}_{c,n}^{(k)}-\boldsymbol{\mu}_{c}^{(k)})\langle\mathbf{x}_{c,n}\rangle^{T}+\frac{{\sigma_{c}^{(k)}}^{2}}{\sigma_{\widetilde{W}^{(k)}}^{2}[s]}\widetilde{W}^{(k)}[s]\right]\left[\sum_{n=1}^{N_{c}}\langle\mathbf{x}_{c,n}\mathbf{x}_{c,n}^{T}\rangle+\frac{{\sigma_{c}^{(k)}}^{2}}{\sigma_{\widetilde{W}^{(k)}}^{2}[s]}\mathbb{I}_{q}\right]^{-1}

Step 2. (In the master): Estimate (W~(k),σW~(k)2)[s+1]superscript~𝑊𝑘superscriptsubscript𝜎superscript~𝑊𝑘2delimited-[]𝑠1\left(\widetilde{W}^{(k)},\sigma_{\widetilde{W}^{(k)}}^{2}\right)[s+1] given Wc(k)[s+1]superscriptsubscript𝑊𝑐𝑘delimited-[]𝑠1W_{c}^{(k)}[s+1] for all c𝑐c. Proceeding as for parameter 𝝁𝝁\boldsymbol{\mu} and using (16):

W~(k)[s+1]=1Cc=1CWc(k)[s+1]superscript~𝑊𝑘delimited-[]𝑠11𝐶superscriptsubscript𝑐1𝐶superscriptsubscript𝑊𝑐𝑘delimited-[]𝑠1\widetilde{W}^{(k)}[s+1]=\frac{1}{C}\sum_{c=1}^{C}W_{c}^{(k)}[s+1](17)

and

σW~(k)2[s+1]superscriptsubscript𝜎superscript~𝑊𝑘2delimited-[]𝑠1\displaystyle\sigma_{\widetilde{W}^{(k)}}^{2}[s+1]=\displaystyle=1Cdkqc=1Ctr[(Wc(k)[s+1]W~(k)[s+1])T(Wc(k)[s+1]W~(k)[s+1])]1𝐶subscript𝑑𝑘𝑞superscriptsubscript𝑐1𝐶𝑡𝑟delimited-[]superscriptsuperscriptsubscript𝑊𝑐𝑘delimited-[]𝑠1superscript~𝑊𝑘delimited-[]𝑠1𝑇superscriptsubscript𝑊𝑐𝑘delimited-[]𝑠1superscript~𝑊𝑘delimited-[]𝑠1\displaystyle\frac{1}{Cd_{k}q}\sum_{c=1}^{C}tr\left[\left(W_{c}^{(k)}[s+1]-\widetilde{W}^{(k)}[s+1]\right)^{T}\left(W_{c}^{(k)}[s+1]-\widetilde{W}^{(k)}[s+1]\right)\right]

Parameter σ2superscript𝜎2\sigma^{2}

We assume that c,kfor-all𝑐𝑘\forall c,k:

σc(k)2|σ~(k)2Inverse-Gamma(α(k),β(k)),similar-toconditionalsuperscriptsuperscriptsubscript𝜎𝑐𝑘2superscript~𝜎superscript𝑘2Inverse-Gammasuperscript𝛼𝑘superscript𝛽𝑘{\sigma_{c}^{(k)}}^{2}|\widetilde{\sigma}^{(k)^{2}}\sim\textrm{Inverse-Gamma}(\alpha^{(k)},\beta^{(k)}),(19)

so that:

Var(σc(k)2)=β(k)2(α(k)1)2(α(k)2):=σ~(k)2𝑉𝑎𝑟superscriptsuperscriptsubscript𝜎𝑐𝑘2superscriptsuperscript𝛽𝑘2superscriptsuperscript𝛼𝑘12superscript𝛼𝑘2assignsuperscript~𝜎superscript𝑘2Var\left({\sigma_{c}^{(k)}}^{2}\right)=\frac{{\beta^{(k)}}^{2}}{(\alpha^{(k)}-1)^{2}(\alpha^{(k)}-2)}:={\widetilde{\sigma}^{(k)^{2}}}(20)

Step 1. (In each center): Estimate σc(k)2[s+1]superscriptsuperscriptsubscript𝜎𝑐𝑘2delimited-[]𝑠1{\sigma_{c}^{(k)}}^{2}[s+1] given (α(k),β(k))[s]superscript𝛼𝑘superscript𝛽𝑘delimited-[]𝑠\left(\alpha^{(k)},\beta^{(k)}\right)[s].

For each center c𝑐c, we consider the following optimization problem:

maxσc(k)2Cc+lnp(σc(k)2),subscriptsuperscriptsuperscriptsubscript𝜎𝑐𝑘2subscriptsubscript𝐶𝑐𝑝superscriptsuperscriptsubscript𝜎𝑐𝑘2\max_{{\sigma_{c}^{(k)}}^{2}}\langle{\mathcal{L}_{C}}_{c}\rangle+\ln{p\left({\sigma_{c}^{(k)}}^{2}\right)},

where lnp(σc(k)2)=(α(k)+1)ln(σc(k)2)β(k)σc(k)2+const𝑝superscriptsuperscriptsubscript𝜎𝑐𝑘2superscript𝛼𝑘1superscriptsuperscriptsubscript𝜎𝑐𝑘2superscript𝛽𝑘superscriptsuperscriptsubscript𝜎𝑐𝑘2𝑐𝑜𝑛𝑠𝑡\ln{p\left({\sigma_{c}^{(k)}}^{2}\right)}=-(\alpha^{(k)}+1)\ln{\left({\sigma_{c}^{(k)}}^{2}\right)}-\frac{\beta^{(k)}}{{\sigma_{c}^{(k)}}^{2}}+const:

It follows:

(σc(k))2[s+1]superscriptsuperscriptsubscript𝜎𝑐𝑘2delimited-[]𝑠1\displaystyle\left(\sigma_{c}^{(k)}\right)^{2}[s+1]=\displaystyle=1Ncdk+2(α(k)[s]+1){n=1Nc[tc,n(k)μc(k)2+tr(Wc(k)TWc(k)xc,nxc,nT)\displaystyle\frac{1}{N_{c}d_{k}+2(\alpha^{(k)}[s]+1)}\left\{\sum_{n=1}^{N_{c}}\left[\|t_{c,n}^{(k)}-\mu_{c}^{(k)}\|^{2}+tr\left({W_{c}^{(k)}}^{T}{W_{c}^{(k)}}\langle x_{c,n}x_{c,n}^{T}\rangle\right)\right.\right.(21)
2xc,nTWc(k)T(tc,n(k)μc(k))]+2β(k)[s]}\displaystyle\left.\left.-2\langle x_{c,n}\rangle^{T}{W_{c}^{(k)}}^{T}\left(t_{c,n}^{(k)}-\mu_{c}^{(k)}\right)\right]+2\beta^{(k)}[s]\right\}
=\displaystyle=1Ncdk+2(α(k)[s]+1){n=1Nc[(tc,n(k)μc(k))Wc(k)xc,n2\displaystyle\frac{1}{N_{c}d_{k}+2(\alpha^{(k)}[s]+1)}\left\{\sum_{n=1}^{N_{c}}\left[\|(t_{c,n}^{(k)}-\mu_{c}^{(k)})-W_{c}^{(k)}\langle x_{c,n}\rangle\|^{2}\right.\right.
+tr(Wc(k)Σc1Wc(k)T)]+2β(k)[s]}\displaystyle\left.\left.+tr\left(W_{c}^{(k)}\Sigma_{c}^{-1}{W_{c}^{(k)}}^{T}\right)\right]+2\beta^{(k)}[s]\right\}

Step 2. (In the master): Estimate (α(k),β(k))[s+1]superscript𝛼𝑘superscript𝛽𝑘delimited-[]𝑠1\left(\alpha^{(k)},\beta^{(k)}\right)[s+1] given σc(k)2[s+1]superscriptsuperscriptsubscript𝜎𝑐𝑘2delimited-[]𝑠1{\sigma_{c}^{(k)}}^{2}[s+1] for all c𝑐c.

In order to estimate the parameters of the inverse-gamma distribution, we use the (ML1) method described by Llera and Beckmann Llera and Beckmann (2016).

Appendix B. Supplementary Tables and Figures

Table 3: Demographics of the clinical sample from the Alzheimer’s Disease Neuroimaging Initiative (ADNI).
GroupSexCountAgeRange
ADFemale9471.58 (7.59)55.10 - 90.30
Male11374.37 (7.19)55.90 - 89.30
NLFemale5873.76 (4.61)65.10 - 84.70
Male4675.39 (6.58)59.90 - 85.60
Table 4: Data Types.
ViewDim.Description
CLINIC7Cognitive assessments
MRI41Magnetic resonance imaging
FDG41Fluorodeoxyglucose-Positron Emission Tomography (PET)
AV4541AV45-Amyloid PET
Table 5: Latent Space Dimension Assessment.
SDADNI
𝐪𝐪\mathbf{q}WAICMAE TrainMAE TestWAICMAE TrainMAE Test
2-29110.0886±plus-or-minus\pm0.00310.0879±plus-or-minus\pm0.0024-49160.1240±plus-or-minus\pm0.00170.1249±plus-or-minus\pm0.0011
3-39540.0640±plus-or-minus\pm0.00290.0662±plus-or-minus\pm0.0038-52750.1170±plus-or-minus\pm0.00280.1187±plus-or-minus\pm0.0023
4-47250.0450±plus-or-minus\pm0.00360.0485±plus-or-minus\pm0.0042-60880.1113±plus-or-minus\pm0.00160.1142±plus-or-minus\pm0.0009
5𝟓𝟏𝟏𝟒5114\mathbf{-5114}0.0327±plus-or-minus\pm0.00380.0375±plus-or-minus\pm0.0054-69150.1064±plus-or-minus\pm0.00170.1102±plus-or-minus\pm0.0007
6-36880.0313±plus-or-minus\pm0.00320.0366±plus-or-minus\pm0.0052𝟕𝟓𝟒𝟔7546\mathbf{-7546}0.1028±plus-or-minus\pm0.00150.1073±plus-or-minus\pm0.0005
757220.0320±plus-or-minus\pm0.00360.0373±plus-or-minus\pm0.0053---
Table 6: Results on SD dataset for all scenarios, and comparison with VAE and mc-VAE.
ScenarioCentersMethodMAE TrainMAE TestAccuracy in LS
IID 1(centralizedcase) Fed-mv-PPCA0.0124±plus-or-minus\boldsymbol{\pm}3.e-50.0405±plus-or-minus\boldsymbol{\pm}0.00371±plus-or-minus\pm0
VAE0.0851±plus-or-minus\pm0.00390.1011±plus-or-minus\pm0.00481±plus-or-minus\pm0
mc-VAE0.1236±plus-or-minus\pm0.00990.1382±plus-or-minus\pm0.00871±plus-or-minus\pm0
3Fed-mv-PPCA0.0320±plus-or-minus\boldsymbol{\pm}0.00240.0373±plus-or-minus\boldsymbol{\pm}0.00351±plus-or-minus\pm0
DP-Fed-mv-PPCA0.0858±plus-or-minus\pm0.01110.0848±plus-or-minus\pm0.00991±plus-or-minus\pm0
VAE0.0683±plus-or-minus\pm0.00730.0702±plus-or-minus\pm0.00731±plus-or-minus\pm0
mc-VAE0.1172±plus-or-minus\pm0.00300.1146±plus-or-minus\pm0.00461±plus-or-minus\pm0
6Fed-mv-PPCA0.0422±plus-or-minus\boldsymbol{\pm}0.00520.0371±plus-or-minus\boldsymbol{\pm}0.00391±plus-or-minus\pm0
DP-Fed-mv-PPCA0.0843±plus-or-minus\pm0.00930.0738±plus-or-minus\pm0.00761±plus-or-minus\pm0
VAE0.0769±plus-or-minus\pm0.00930.0680±plus-or-minus\pm0.00801±plus-or-minus\pm0
mc-VAE0.1295±plus-or-minus\pm0.00550.1134±plus-or-minus\pm0.00301±plus-or-minus\pm0
G3Fed-mv-PPCA0.0432±plus-or-minus\boldsymbol{\pm}0.00740.0433±plus-or-minus\boldsymbol{\pm}0.00260.9930±plus-or-minus\boldsymbol{\pm}0.0093
DP-Fed-mv-PPCA0.0960±plus-or-minus\pm0.01510.0951±plus-or-minus\pm0.01440.9873±plus-or-minus\pm0.0176
VAE0.0787±plus-or-minus\pm0.01350.0698±plus-or-minus\pm0.00820.9835±plus-or-minus\pm0.0272
mc-VAE0.1562±plus-or-minus\pm0.00860.1497±plus-or-minus\pm0.00760.9732±plus-or-minus\pm0.0512
6Fed-mv-PPCA0.0538±plus-or-minus\boldsymbol{\pm}0.01010.0420±plus-or-minus\boldsymbol{\pm}0.00480.9995±plus-or-minus\boldsymbol{\pm}0.0019
DP-Fed-mv-PPCA0.0945±plus-or-minus\pm0.01290.0813±plus-or-minus\pm0.01141±plus-or-minus\pm0
VAE0.0891±plus-or-minus\pm0.01480.0685±plus-or-minus\pm0.00630.9918±plus-or-minus\pm0.0428
mc-VAE0.1758±plus-or-minus\pm0.01540.1495±plus-or-minus\pm0.01120.9607±plus-or-minus\pm0.0398
K3Fed-mv-PPCA0.0320±plus-or-minus\pm0.00520.0455±plus-or-minus\pm0.00691±plus-or-minus\pm0
DP-Fed-mv-PPCA0.0922±plus-or-minus\pm0.01370.1048±plus-or-minus\pm0.01511±plus-or-minus\pm0
6Fed-mv-PPCA0.0402±plus-or-minus\pm0.00650.0448±plus-or-minus\pm0.00881±plus-or-minus\pm0
DP-Fed-mv-PPCA0.0959±plus-or-minus\pm0.01050.1014±plus-or-minus\pm0.01191±plus-or-minus\pm0
G/K3Fed-mv-PPCA0.0395±plus-or-minus\pm0.00680.0567±plus-or-minus\pm0.01080.7812±plus-or-minus\pm0.02179
DP-Fed-mv-PPCA0.1144±plus-or-minus\pm0.02150.1343±plus-or-minus\pm0.02350.7852±plus-or-minus\pm0.0526
6Fed-mv-PPCA0.0499±plus-or-minus\pm0.01040.0575±plus-or-minus\pm0.01280.7785±plus-or-minus\pm0.0222
DP-Fed-mv-PPCA0.1070±plus-or-minus\pm0.01390.1119±plus-or-minus\pm0.01440.7887±plus-or-minus\pm0.0449

Refer to caption

(a) MRI

Refer to caption

(b) FDG
Figure 8: Global distribution of all features of missing views in the Test dataset, for the G/K scenario. In this scenario, 1/3 of all subjects in the Test dataset do not provide MRI data and 1/3 do not provide FDG data. In both figure, the blue curve denotes the predicted global distribution of all features of the (a) MRI view and (b) the FDG view. Gray histograms correspond to real data in the Test dataset.
Table 7: Results on ADNI dataset for all scenario G using VAE (resp. mc-VAE), and FedProx as aggregation scheme with the proximal term λ𝜆\lambda varying from 0.01 to 0.5.
CentersMethodλ𝜆\mathbf{\lambda}MAE TrainMAE TestAccuracy in LS
3VAE0 (FedAvg)0.1172±plus-or-minus\pm0.00220.1192±plus-or-minus\pm0.00150.8289±plus-or-minus\pm0.0383
0.010.1209±plus-or-minus\pm0.00740.1215±plus-or-minus\pm0.00130.7962±plus-or-minus\pm0.0438
0.050.1215±plus-or-minus\pm0.00760.1218±plus-or-minus\pm0.00150.8009±plus-or-minus\pm0.0425
0.10.1214±plus-or-minus\pm0.00750.1220±plus-or-minus\pm0.00180.8067±plus-or-minus\pm0.0399
0.20.1218±plus-or-minus\pm0.00770.1221±plus-or-minus\pm0.00160.7977±plus-or-minus\pm0.0469
0.30.1212±plus-or-minus\pm0.00750.1216±plus-or-minus\pm0.00170.7865±plus-or-minus\pm0.0443
0.40.1212±plus-or-minus\pm0.00740.1217±plus-or-minus\pm0.00140.8033±plus-or-minus\pm0.0355
0.50.1214±plus-or-minus\pm0.00770.1218±plus-or-minus\pm0.00200.7878±plus-or-minus\pm0.0420
mc-VAE0 (FedAvg)0.1602±plus-or-minus\pm0.00350.1567±plus-or-minus\pm0.00170.8850±plus-or-minus\pm0.0262
0.010.1674±plus-or-minus\pm0.01550.1605±plus-or-minus\pm0.00280.8185±plus-or-minus\pm0.0494
0.050.1667±plus-or-minus\pm0.01530.1604±plus-or-minus\pm0.00280.8156±plus-or-minus\pm0.0444
0.10.1674±plus-or-minus\pm0.01540.1609±plus-or-minus\pm0.00220.8249±plus-or-minus\pm0.0399
0.20.1676±plus-or-minus\pm0.01560.1610±plus-or-minus\pm0.00250.8217±plus-or-minus\pm0.0431
0.30.1676±plus-or-minus\pm0.01570.1610±plus-or-minus\pm0.00290.8184±plus-or-minus\pm0.0511
0.40.1673±plus-or-minus\pm0.01550.1607±plus-or-minus\pm0.00210.8275±plus-or-minus\pm0.0426
0.50.1679±plus-or-minus\pm0.01570.1613±plus-or-minus\pm0.00250.8229±plus-or-minus\pm0.0408
6VAE0 (FedAvg)0.1357±plus-or-minus\pm0.00420.1191±plus-or-minus\pm0.00140.8224±plus-or-minus\pm0.0377
0.010.1400±plus-or-minus\pm0.01140.1198±plus-or-minus\pm0.00220.7804±plus-or-minus\pm0.0470
0.050.1403±plus-or-minus\pm0.01150.1203±plus-or-minus\pm0.00210.7827±plus-or-minus\pm0.0411
0.10.1406±plus-or-minus\pm0.01160.1205±plus-or-minus\pm0.00190.7847±plus-or-minus\pm0.0531
0.20.1407±plus-or-minus\pm0.01170.1207±plus-or-minus\pm0.00180.7837±plus-or-minus\pm0.0433
0.30.1404±plus-or-minus\pm0.01150.1207±plus-or-minus\pm0.00180.7837±plus-or-minus\pm0.0569
0.40.1405±plus-or-minus\pm0.01160.1203±plus-or-minus\pm0.00200.7753±plus-or-minus\pm0.0546
0.50.1406±plus-or-minus\pm0.01130.1205±plus-or-minus\pm0.00230.7776±plus-or-minus\pm0.0501
mc-VAE0 (FedAvg)0.1840±plus-or-minus\pm0.00540.1563±plus-or-minus\pm0.00170.8894±plus-or-minus\pm0.0230
0.010.1932±plus-or-minus\pm0.02200.1596±plus-or-minus\pm0.00190.8140±plus-or-minus\pm0.0420
0.050.1927±plus-or-minus\pm0.02190.1592±plus-or-minus\pm0.00160.8101±plus-or-minus\pm0.0484
0.10.1932±plus-or-minus\pm0.02210.1595±plus-or-minus\pm0.00220.8043±plus-or-minus\pm0.0399
0.20.1930±plus-or-minus\pm0.02190.1596±plus-or-minus\pm0.00200.8066±plus-or-minus\pm0.0441
0.30.1931±plus-or-minus\pm0.02210.1595±plus-or-minus\pm0.00190.8217±plus-or-minus\pm0.0453
0.40.1931±plus-or-minus\pm0.02200.1594±plus-or-minus\pm0.00180.8111±plus-or-minus\pm0.0419
0.50.1934±plus-or-minus\pm0.02210.1596±plus-or-minus\pm0.00220.8021±plus-or-minus\pm0.0581

Refer to caption

(a) 2q152𝑞152\leq q\leq 15, boxplot

Refer to caption

(b) 2q62𝑞62\leq q\leq 6, boxplot
Figure 9: Boxplots showing the evolution of the WAIC with the latent space dimension varying (a) from 2 to 15 and (b) a zoom with latent dimension q𝑞q up to 6. We notice that an increasing complexity of the model provide only a relative improvement of the median WAIC score, while being associated to a higher variation suggesting less stable results.
Refer to caption
Figure 10: Mean and standard deviation of σ~(k)2superscript~𝜎superscript𝑘2\widetilde{\sigma}^{(k)^{2}} at each round over 10 3-CV tests performed in the IID case, using (red) Fed-mv-PPCA and (blue) DP-Fed-mv-PPCA. By model definition, σ~(k)2superscript~𝜎superscript𝑘2\widetilde{\sigma}^{(k)^{2}} represents the global variance of the Gaussian noise for the kthsuperscript𝑘thk^{\textrm{th}}-view. As one can see, when DP is introduced the estimated global data variance is greater for every view. This fact can affect the performance of the final global model, both for the reconstruction and the separation tasks.

Refer to caption

(a) Mean absolute error for VAE (left) and mc-VAE (right)

Refer to caption

(b) Classification accuracy in the latent space for VAE (left) and mc-VAE (right)
Figure 11: ADNI data, scenario G with 3 centers, comparing the performance of VAE (left column) and mc-VAE (right column) either increasing the number of considered layers (2 layers), or adopting a robust aggregation scheme alternative to Fed-Avg (FedProx, with varying parameter λ𝜆\lambda). For the sake of comparison, in each plot, the first element corresponds to results obtained with our method - Fed-mv-PPCA - for the considered scenario. Upper row: MAE results for the (blue) train and (red) test datasets. Bottom row: accuracy in the latent space for the test dataset. For each metric we provide results obtained using a 2-layers VAE, resp. a 2-layers mc-VAE (second element of each plot), and federated averaging as aggregation scheme. Finally, results for both MAE and accuracy in the latent space using FedProx as aggregation scheme are provided, with the FedProx proximal parameter λ𝜆\lambda varying between 0 to 0.5. Note that if we set λ=0𝜆0\lambda=0 we recover the Fed-Avg aggregation scheme.