Print Close

Efficient Compression and Interpretation of Multimodal Data: Scalable Bayesian Factor Analysis

Poster No:

1366

Submission Type:

Abstract Submission

Authors:

George Hutchings¹, Thomas Nichols¹, Chris Holmes¹, Habib Ganjgahi¹

Institutions:

¹University of Oxford, Oxford, United Kingdom

First Author:

George Hutchings
University of Oxford
Oxford, United Kingdom

Co-Author(s):

Thomas Nichols
University of Oxford
Oxford, United Kingdom

Chris Holmes
University of Oxford
Oxford, United Kingdom

Habib Ganjgahi
University of Oxford
Oxford, United Kingdom

Introduction:

There is growing interest in large-scale epidemiological studies, with both many individuals and many variables, and dimensionality reduction is an important tool for summarising the structure in these datasets. For example, the UK Biobank and NO.MS (Novartis-Oxford multiple sclerosis dataset, a longitudinal study involving 8000 individuals) include a diverse assortment of variables: from binary variables to brain images. This data, which is both high dimensional and includes discrete variables, provides challenges for existing methods like ICA and PCA, which are optimal for continuous (and even Gaussian) variables. Additionally, they only have heuristics for choosing the number of latent features. We propose an efficient data compression method, a Bayesian factor analysis which can handle high dimensional data, as well as dealing with mixed modality data and inferring the number of latent variables in a principled manner.

Methods:

Our method utilizes a Bayesian model optimized through the variational EM algorithm (Fig 1).

A sparsity-inducing spike & slab prior on the loading matrix promotes interpretable latent variables. Each element of the loading matrix follows a zero-mean normal distribution, with differing variances. Non-zero elements are assigned to the slab (with probability 1-w), have a large variance, while zero elements are assigned to the low variance spike (with probability w). Sparsity is induced through an Indian buffet process (IBP) prior on w, which also can shrink any unimportant latent variables to 0, inferring the number of latent features.

To ensure scalability, our method utilizes a variational EM approach as opposed to more traditional MCMC which is prohibitive in high dimensions and can suffer from the label switching problem, a common issue arising from model unidentifiability. The approach maintains dependencies between variables, with only the factor scores and the continuous analogue of the discrete variables assumed to factorise. Other latent variables in the model are treated without factorisation, preserving their dependencies, this is unlike a purely mean-field approximation.

For handling discrete variables, we incorporate a semiparametric Gaussian copula, providing a principled approach to address discrete data within the model.

Results:

The efficacy of our model is assessed with simulation studies and real data, with the noteworthy ones outlined below:

Sim i) Data is simulated from loading matrix shown in fig 2a, of note is that the data is of higher dimensionality than the number of samples. The recovered loading matrix is in fig 2b.

Sim ii) Binary data is simulated from loading matrix shown in fig 2c (binarized by thresholding the continuous data at 0), one should note that two latent dimensions share multiple covariates making this a challenging simulation. The recovered loading matrix is in fig 2d.

One can see that in both cases the true loading matrix is recovered (the loading matrix is unidentifiable to permutation of columns) and in sim i the number of latent variables (non-zero columns) is shrunk to the true number.

Additionally, we apply our method to the NO.MS clinical dataset which includes binary, count and continuous covariates, we obtain the structure of the loading matrix seen in fig 2e and 2f.

Conclusions:

We propose a scalable factor analysis method designed to handle the complexities of large-scale epidemiological studies, and imaging data. The method efficiently compresses high-dimensional, mixed modality data, using an approach inspired by probabilistic PCA. The method is then evaluated in several simulation studies and on real data.

Modeling and Analysis Methods:

Bayesian Modeling ¹

Methods Development ²

Keywords:

Statistical Methods

^1|2Indicates the priority used for review

Provide references using author date format

Murray, J. S., Dunson, D. B., Carin, L., & Lucas, J. E. (2013). Bayesian Gaussian copula factor models for mixed data. Journal of the American Statistical Association, 108(502), 656-665.

Ročková, V., & George, E. I. (2018). The spike-and-slab lasso. Journal of the American Statistical Association, 113(521), 431-444.

Ročková, V., & George, E. I. (2016). Fast Bayesian factor analysis via automatic rotations to sparsity. Journal of the American Statistical Association, 111(516), 1608-1622.