Print Close

Clarifying the reliability paradox: poor test-retest reliability attenuates group differences

Poster No:

1851

Submission Type:

Abstract Submission

Authors:

Povilas Karvelis¹, Andreea Diaconescu¹^,2^,3^,4

Institutions:

¹CAMH, Toronto, ON, ²Department of Psychology, University of Toronto, Toronto, ON, ³Institute of Medical Sciences, University of Toronto, Toronto, ON, ⁴Department of Psychiatry, University of Toronto, Toronto, ON

First Author:

Povilas Karvelis
CAMH
Toronto, ON

Co-Author:

Introduction:

Cognitive tasks that produce robust group effects tend to have poor test-rest reliability – a phenomenon known as the reliability paradox (Hedge et al., 2018). This is true for simple summary statistics of task behavior (Hedge et al., 2018), computational measures obtained from modelling task behavior (Karvelis et al., 2023), as well as task-based fMRI activations (Elliott et al., 2020). Most literature on this issue highlights how poor test-retest reliability undermines correlational individual differences research as well as translational personalized and precision psychiatry efforts. Our aim here is to demonstrate that poor test-retest reliability is detrimental not just for studying individual differences, but also for studying group differences (e.g., patients vs controls).

Methods:

To illustrate our argument, we ran model simulations. We generated synthetic datasets of varying levels of between-subject and error variance and investigated how it affected test-retest reliability, individual differences, within-subject effects, and between-subject effects. We used intra-class correlation coefficient (ICC) to estimate test-retest reliability, Cohen's d to estimate group difference effect size, and Pearson's r to estimate correlations. While our analysis is general and would apply to any scenario of comparing different groups, to make our demonstration more intuitive, we considered two illustrative cases: 1) comparing patients vs. controls and 2) comparing two groups created via a median split.

Results:

First, our simulations reproduce the reliability paradox and provide intuitive clarification that robust group effects are achieved by minimizing overall variance (not just between-subject variance; Fig. 1). Second, and most importantly, our simulations show that test-retest reliability attenuates observed between-subject effects just as much as it attenuates observed correlations – this was equally true in both cases under consideration (patients vs. controls and median split; Fig. 2).

Conclusions:

Our work highlights that the reliability paradox has even wider implications than originally stated: low test-retest reliability undermines not only individual differences but also group differences research. Note that this applies not only to studying patient groups but also to many other areas of research: sex differences, ethnic differences, age differences, etc. Overall, our findings further stress that improving test-retest reliability of cognitive measures is of paramount importance for improving the quality of research.

Modeling and Analysis Methods:

Activation (eg. BOLD task-fMRI) ²

Methods Development ¹

Other Methods

Keywords:

Cognition

Computational Neuroscience

Modeling

Psychiatric Disorders

Other - Reliability

^1|2Indicates the priority used for review

Provide references using author date format

Elliott, M. L. (2020). What is the test-retest reliability of common task-functional MRI measures? New empirical evidence and a meta-analysis. Psychological science, 31(7), 792-806.

Hedge, C. (2018). The reliability paradox: Why robust cognitive tasks do not produce reliable individual differences. Behavior research methods, 50(3), 1166-1186.

Karvelis, P. (2023). Individual differences in computational psychiatry: a review of current challenges. Neuroscience & Biobehavioral Reviews, 105137.