Evaluation of imputation methods for missing values in large-scale neuroimaging data

Poster No:

1985 

Submission Type:

Abstract Submission 

Authors:

YiFan Li1, Meng Liang1

Institutions:

1School of Medical Techonology, Tianjin Medical University, Tianjin, Tianjin

First Author:

YiFan Li  
School of Medical Techonology, Tianjin Medical University
Tianjin, Tianjin

Co-Author:

Meng Liang  
School of Medical Techonology, Tianjin Medical University
Tianjin, Tianjin

Introduction:

With the advancement of data acquisition and analysis techniques, large-scale neuroimaging-related datasets have been established. However, missing values are inevitable in such datasets, which will introduce negative effects on subsequent data analyses. Although many imputation methods have been proposed to address the issue of missing values, there is no consensus on which method performs the best, especially for neuroimaging data. Current studies on comparisons of imputation methods for neuroimaging data have focused on some specific situations, and the evidence from large-scale neuroimaging data is still lacking. Therefore, we compared the performances of complete case analysis (CCA) and four imputation methods in multiple neuroimaging scenarios to provide a reference for selecting imputation methods for neuroimaging studies.

Methods:

The Chinese Imaging Genetics (CHIMGEN) cohort (Xu et al., 2020) was used. This study was approved by the local ethics committee and written informed consent was obtained from each participant. We compared the efficacy of five methods for handling missing values in two application scenarios: first, prediction of structural/functional brain imaging measures using non-brain (behavioral) data (Scenario 1); second, prediction of non-brain data using brain imaging data (Scenario 2). In Scenario 1, 46 California Verbal Learning Test (CVLT) scores were used to predict the total gray matter volume (TGMV) obtained from structural MRI (CVLT_TGMV dataset). This dataset included 6962 subjects, and 924 had missing CVLT values with different missing patterns. In Scenario 2, regional homogeneity (ReHo) measures of 116 brain regions obtained from resting-state functional MRI were used to predict subjects' gender (ReHo_Gender dataset). This dataset included 6953 subjects, and 763 had missing ReHo values in eight cerebellar regions. These two datasets are referred to as the real datasets, and the above models are named "prediction model".
For each dataset, 80% of the subjects without missing values were randomly sampled for 100 times. In each time, a series of simulated datasets with different percentages of missing values was created by deleting different amounts of values according to the missing patterns of the real data. The missing values of the simulated data were then handled with the following five methods: CCA, regression imputation, mean imputation, expectation maximum (EM) imputation, and multiple imputation (MI) (van Buuren et al., 2011).
The performances of the five methods were assessed from three aspects: (1) differences between the imputed and the real values, (2) differences between the coefficients of the prediction models estimated from the imputed data and those estimated from the real data, and (3) differences in the prediction accuracies between the imputed data and the real data. These differences were quantified in three ways: normalized root mean square error (NRMSE), percent bias (PB), and mean absolute error (MAE). These methods were ranked by these error measurements.

Results:

The performances of the five methods are shown in Figs. 1 & 2. Overall, MI performed well across different error measurements and datasets: according to NRMSE, MI performed the second best in all scenarios; according to PB, MI performed the best on the CVLT_TGMV dataset and the second best on the ReHo_Gender dataset; according to MAE, MI performed the second best on both datasets. In general, CCA performed poorly in all scenarios, and the performances of other methods varied across error measurements.
Supporting Image: Fig1_RealDataset_line_3-01.png
Supporting Image: Fig2_Rank_radar_3.png
 

Conclusions:

In large-scale neuroimaging studies, missing values imputation could improve model correctness compared to CCA (i.e., simply deleting subjects with missing values). In general, MI showed the highest robustness and outperformed most other imputation methods regardless of the type of data and error measurements, and thus is recommended for handling missing values in neuroimaging studies.

Modeling and Analysis Methods:

Motion Correction and Preprocessing 2
Univariate Modeling
Other Methods 1

Keywords:

Data analysis
MRI
Other - imputation comparison; missing data; multiple imputation; big data

1|2Indicates the priority used for review

Provide references using author date format

van Buuren, S., et al. (2011), 'mice: Multivariate Imputation by Chained Equations in R', Journal of Statistical Software, vol. 45, no. 3, pp. 1-67.
Xu, Q., et al. (2020), 'CHIMGEN: a Chinese imaging genetics cohort to enhance cross-ethnic and cross-geographic brain research', Mol Psychiatry, vol. 25, no. 3, pp. 517-529.