A Generative Model for Missing Data in Large Epidemiological Cohorts

Poster No:

1923 

Submission Type:

Abstract Submission 

Authors:

Lav Radosavljevic1, Thomas Nichols2, Stephen Smith1

Institutions:

1University of Oxford, Oxford, Oxfordshire, 2University of Oxford, Oxford, United Kingdom

First Author:

Lav Radosavljevic  
University of Oxford
Oxford, Oxfordshire

Co-Author(s):

Thomas Nichols  
University of Oxford
Oxford, United Kingdom
Stephen Smith  
University of Oxford
Oxford, Oxfordshire

Introduction:

Large-scale epidemiological cohorts are vital for identifying risk factors of disease and are increasingly including neuroimaging data. Such data inevitably has missing values in two forms, unstructured and structured missingness (Mitra et al., 2023). Unstructured missing data occurs for individual subjects on occasional variables and doesn't reflect a large-scale pattern. Structured missingness occurs when data is missing for a particular set of variables and subjects, often corresponding to follow-up studies that only include a subset of subjects. While much research exists on missing data, it has focused on relatively small datasets and generally omits consideration of structured missingness. We know that participation in studies is correlated to some health- and socio-economic- attributes of the participants (Fry et al., 2017), and there is a need to evaluate missing data methods that account for the type of missing data found in practice to obtain a more accurate picture of the performance of different imputation methods on UK Biobank data.

In this work we propose a generative model for mixed type synthetic data that includes structured missingness. The method is applied to non-Imaging Derived Phenotypes (nIDPs) from the UK Biobank brain imaging cohort and we generate synthetic sets to evaluate the performance of commonly used imputation methods.

Methods:

We assume that our real data set conforms to these three criteria:

1. There is structured missingness, blocks of missingness caused by non-participation in extension studies, as well as unstructured missingness.

2. There is an association between inter-variable correlation and inter-variable missingness similarity.

3. Missingness is informative in the sense of MAR, where there is a relationship between missingness in a given variable and the observed elements of other variables.

We further assume that that the structured missingness is entirely determined by baseline variables through a logistic model that predicts sub-study participation. Also, for the sake of simplicity, we assume that the unstructured missingness is MCAR. Using hierarchical clustering over variable missingness patterns, the sub-studies are identified and parameters for inducing both types of missingness are estimated, giving us our generative model.
Crucially, we will have access to ground truth for simulated data, which gives us the ability to evaluate the performance of different methods of handling missing data for analytical tasks.
Supporting Image: flowchart_analysis_OHBM_2.png
   ·Analysis Pipeline
Supporting Image: flowchart_generation_OHBM_1.png
   ·Generation Pipeline
 

Results:

We evaluated the imputation accuracy of different methods from standard libraries over 20 synthetic data sets. Iterative imputation methods had the best overall performance. It was, however, also shown that it is very difficult to impute structurally missing data, likely due to the fact that highly correlated variables tend to be from the same sub-study, which means that they are mostly jointly missing. This was further demonstrated by comparing the results for our synthetic data sets to data sets with the exact same ground truth, but with completely unstructured missingness. The results from the simulation indicate that standard methods of handling missing data for analysis might give similar results as using mean or median imputation.
We illustrated the conclusions of our simulation study by comparing results of variable selection of approx. 15,000 nIDPs for predicting gray matter volume, with imputation as a pre-processing step. Here, iterative imputation again gave the best results, but not much better than those using mean imputation.

Conclusions:

The results from our simulation study as well as our illustrative example show that for analytical tasks performed on data with a highly structured missingness pattern, simple methods such as mean imputation will often give results that are similar to more advanced methods found in standard libraries. This means that further research is needed to develop methods that will handle this type of data.

Modeling and Analysis Methods:

Methods Development 1
Multivariate Approaches 2

Keywords:

Data analysis
Statistical Methods
Other - Missing Data

1|2Indicates the priority used for review

Provide references using author date format

R. Mitra, S. F. McGough, T. Chakraborti, C. Holmes, R. Copping, N. Hagenbuch, S. Biedermann,
J. Noonan, B. Lehmann, A. Shenvi, et al., Learning from data with structured missingness, Nature
Machine Intelligence 5 (1) (2023) 13–23.
A. Fry, T. J. Littlejohns, C. Sudlow, N. Doherty, L. Adamska, T. Sprosen, R. Collins, N. E. Allen,
Comparison of sociodemographic and health-related characteristics of uk biobank participants with
those of the general population, American journal of epidemiology 186 (9) (2017) 1026–1034.