The Hypothesis Race Model for evaluation of research findings

Poster No:

1364 

Submission Type:

Abstract Submission 

Authors:

Robert Kelly1, Matthew Hoptman2

Institutions:

1Weill Cornell Medicine, White Plains, NY, 2Clinical Research Division, Nathan S. Kline Institute for Psychiatric Research, Orangeburg, NY

First Author:

Robert Kelly, M.D.  
Weill Cornell Medicine
White Plains, NY

Co-Author:

Matthew Hoptman, Ph.D.  
Clinical Research Division, Nathan S. Kline Institute for Psychiatric Research
Orangeburg, NY

Introduction:

Empirical results from individual research experiments are commonly evaluated with Null Hypothesis Significance Testing (NHST). The "replication crisis" refers to concerns that statistically significant findings derived from NHST often are not replicable. Benjamin et al. (2018) suggested that we "improve reproducibility" for "claims of new discoveries" by reducing the conventional α-level from 0.05 to 0.005, while increasing sample sizes to maintain study power. However, doing so can increase the total cost of "confirming" hypotheses over multiple studies. We demonstrate this problem using a Bayesian model called the Hypothesis Race Model (HRM, Fig. 1) for the case where many hypotheses are tested simultaneously, each with a small initial probability of being true. For this "Horse Race" scenario, cost-efficiency improves by reducing sample sizes and focusing testing on hypotheses that best progress toward "confirmation."

Methods:

From Bayes' theorem R1 = BF1*R0, where R0 is the prior odds that our hypothesis is true (Ht; Hf if false), and R1 is the posterior odds that our hypothesis is true after obtaining the results, T1, of an experiment; and BF1 = P(T1 | Ht)/P(T1 | Hf). For the binary outcomes of statistically significant (+) or not (-), BF1+ = (1-β)/α, where 1-β is the power of the study and α, the α-level for statistical significance; and BF1- = β/(1-α). In the nonbinary approach, T1 is a measured continuous test statistic and P(T1 | Ht) and P(T1 | Hf) are the probability densities for obtaining T1 given Ht or Hf, respectively.

Our illustrative "experiment": measuring tumor-cell reduction after exposure to a chemical in a petri dish, testing 10,000 chemicals, each with a 1/1000 chance of being a "winner," until we find 5 winners. Our 10,000 hypotheses were simulated with a database of candidate chemicals having effect sizes of Cohen's d=0, except for 10 cases with d = 0.2, 0.4 or 0.8 (the winners). We assessed statistical significance with one-tailed paired t-tests comparing to 0 the mean difference in number of cells in the petri dishes before and after exposure, with df=N-1, where N was the number of petri dishes in each sample. We recorded the total number of petri dishes required, over multiple trials with sample sizes of 5, 10, 20, 40, 80 or 160.

A Monte Carlo simulation found the winners by repeatedly picking and testing a hypothesis with the highest R-value, confirming it as a winner when R > 100. A computer program generated the test statistic Tj (for the jth trial) that corresponded to a randomly selected point on the cumulative t-distribution with df=N-1, centered on d for the tested hypothesis. We conservatively estimated BFj for the binary HRM from the estimate d=de=0.2, calculating a fixed value of α for each N, such that the average distance traversed toward confirmation would be optimized (work not shown). The estimated value of β followed from the t-distribution corresponding to de=0.2. The nonbinary HRM followed the same pattern, except that α did not need to be specified.

For each sample size, 500 iterations were performed, for the binary and non-binary applications. Mean values for total number of petri dishes needed were compared using two-tailed, unpaired t-tests, considering p < 0.001, uncorrected, statistically significant. Each of these 6 values were compared pairwise within each group (binary or nonbinary) and across groups for a total of 36 comparisons.

Results:

Smaller trial sizes (N = 5, 10 or 20) were significantly more cost-effective in terms of total number of petri dishes needed than larger trial sizes (N = 40, 80 or 160) for both the binary and nonbinary HRM (Fig. 2). The percentage of incorrectly identified winners per iteration ranged from 0.04% to 0.68% for the 12 cases.

Conclusions:

A Bayesian perspective that considers results from multiple trials can help to estimate and reduce costs of evaluating hypotheses through empirical testing.

Modeling and Analysis Methods:

Bayesian Modeling 1
Methods Development 2

Keywords:

Data analysis
Design and Analysis
Modeling
Statistical Methods

1|2Indicates the priority used for review
Supporting Image: OHBM_2024_Poster_figs_draft3_fig1_100dpi.png
   ·Introduction
Supporting Image: OHBM_2024_Poster_figs_draft3_fig2_100dpi.png
   ·Results
 

Provide references using author date format

Aarts, A. A. et al (2015), ‘Estimating the reproducibility of psychological science’, Science, vol. 349, 6251.
Benjamin, D. J. et al (2018), ’Redefine statistical significance’, Nature Human Behaviour, vol. 2, no. 1, pp. 6–10.
Ioannidis, J. P. A. (2005), ‘Why Most Published Research Findings Are False’, PLoS Medicine, vol. 2, no. 8, e124.
Kelly, R. E. & Hoptman, M. J. (2022), ‘Replicability in Brain Imaging’, Brain Sciences, vol. 12, no. 3, 397.
Wagenmakers, E. J. (2007), ‘A practical solution to the pervasive problems of p values’, Psychon Bull Rev, vol. 14, no. 5, pp. 779-804.