How to Properly Select Covariates in Neuroimaging Data Analysis

Poster No:

1887 

Submission Type:

Abstract Submission 

Authors:

Gang Chen1, Zhengchen Cai2, Paul Taylor1

Institutions:

1National Institutes of Health, Bethesda, MD, 2The Neuro (Montreal Neurological Institute-Hospital), McGill University, Montreal, Québec

First Author:

Gang Chen  
National Institutes of Health
Bethesda, MD

Co-Author(s):

Zhengchen Cai  
The Neuro (Montreal Neurological Institute-Hospital), McGill University
Montreal, Québec
Paul Taylor  
National Institutes of Health
Bethesda, MD

Introduction:

The role of covariates is pivotal in data analysis, yet their inclusion often lacks a clear rationale. Recent development has shown how covariates easily lead to biased or spurious estimation, harming interpretation, understanding and reproducibility. We address three common issues: indiscriminate variable inclusion, lack of justification and overlooking reporting nuances. These problems arise when variables like reaction time, height, weight, head size, and cortex thickness are integrated, potentially distorting results.

Consider an illustrative scenario involving short-term memory (STM) as the response variable and gray matter density (GMD) as the predictor. Sex, age, intracranial volume (ICV), APOE genotype, and body weight are considered as covariates, leading to four questions:

1) Predictor vs. Response Variable: Is it justifiable to invert the roles of GMD and STM, making GMD the voxel-level response variable for easy implementation?
2) Covariates: Should all five covariates be included?
3) Result Reporting/Interpretability: Is it appropriate to report all parameter estimates from a single GLM?
4) Experimental Design: What variables could be omitted, and what other variables might have been incorporated to improve estimation?

Methods:

Common justifications for including covariates encompass availability, precedence, and statistical evidence, often quantified through metrics like p-values or R2. Nevertheless, it is imperative to scrutinize these rationales. For instance, in contexts where the focus is on statistical inference, can the statistical evidence of a covariate justify its inclusion? Furthermore, adding a covariate might either fortify or undermine the statistical evidence for a predictor–what criteria should guide the selection?

Can a model be reliably appraised solely based on its output? We advocate for the adoption of independent principles in covariate selection to sidestep circular reasoning or interpretational ambiguities. The incorporation of prior information and domain knowledge is instrumental in guiding this decision-making process.

A judicious approach involves a nuanced understanding of causal relationships, aptly depicted through directed acyclic graphs (DAGs)[1,2]. Covariates can be categorized into three types (Fig 1A): confounders (shared causes), colliders (shared effects), and mediators (effects of predictors and causes of response variables). Decisions hinge on three cardinal rules: include confounders, exclude colliders, and include/exclude mediators based on the focus on direct/total effects. For covariates associated with the predictor or response variable, but not both, only consider the parent of the response variable for improved precision (Fig 1B). These rules, deeply rooted in causal inference, have withstood rigorous derivations and validation through simulations[3]. Neglecting them risks biases: underestimation, overestimation, sign reversal, effect suppression, or spurious effects.
Supporting Image: fig1.png
 

Results:

Using the causal relationships among variables (Fig 2), the four example questions above can be addressed:

1) GMD influencing STM is more plausible.
2) Age and sex (confounders) are included; APOE (influencing only the response variable) improves precision; weight (collider) is excluded; ICV (influencing only GMD) is not considered.
3) Estimation of age and sex effects should not be reported unless their direct effects are the focus.
4) Weight (collider) should be excluded from data collection when examining the GMD-STM relationship; additional variables like sleep hours would enhance experimental design.
Supporting Image: fig2.png
 

Conclusions:

A model reveals associations, but understanding causal relationships is crucial. DAGs aid covariate selection, guiding good experimental design and promoting analytical rigor. Causal thinking fosters theoretical hypotheses, transparency and reproducibility in neuroimaging data analysis. Emphasizing these aspects refines covariate selection, ensuring more robust and meaningful neuroimaging studies.

Modeling and Analysis Methods:

Methods Development 1
Other Methods 2

Keywords:

Data analysis
Design and Analysis
Modeling
Statistical Methods

1|2Indicates the priority used for review

Provide references using author date format

[1] Pearl, J., 2009. Causal inference in statistics: An overview. Statistics Surveys 3, 96–146.
[2] Pearl J., Mackenzie D., 2018. The Book of Why: The New Science of Cause and Effect. New York: Basic Books
[3] Wysocki, A.C., Lawson, K.M., Rhemtulla, M., 2022. Statistical Control Requires Causal Justification. Advances in Methods and Practices in Psychological Science 5, 25152459221095823.