Power Contours: Optimising Sample Size and Precision in Experimental Psychology and Human Neuroscience

[!info]- Info Zotero

Authors:: Daniel H. Baker, Greta Vilidaite, Freya A. Lygo, Anika K. Smith, Tessa R. Flack, André D. Gouws, Timothy J. Andrews

[!abstract]- When designing experimental studies with human participants, experimenters must decide how many trials each participant will complete, as well as how many participants to test. Most discussion of statistical power (the ability of a study design to detect an effect) has focused on sample size, and assumed sufficient trials. Here we explore the influence of both factors on statistical power, represented as a 2-dimensional plot on which iso-power contours can be visualized. We demonstrate the conditions under which the number of trials is particularly important, that is, when the within-participant variance is large relative to the between-participants variance. We then derive power contour plots using existing data sets for 8 experimental paradigms and methodologies (including reaction times, sensory thresholds, fMRI, MEG, and EEG), and provide example code to calculate estimates of the within- and between-participants variance for each method. In all cases, the within-participant variance was larger than the between-participants variance, meaning that the number of trials has a meaningful influence on statistical power in commonly used paradigms. An online tool is provided (https://shiny.york.ac.uk/powercontours/) for generating power contours, from which the optimal combination of trials and participants can be calculated when designing future studies., Many studies in neuroscience and experimental psychology involve testing human participants multiple times in a given condition, and averaging across these repetitions to get a more accurate estimate of the true response. Yet most researchers do not have a principled way to decide how many trials they should conduct, and decisions are often made using arbitrary criteria. This is an important issue because the number of trials has a direct effect on the statistical power of a study—the likelihood that it is able to detect a real effect. In the context of the recent “replication crisis” in psychology, researchers need tools to optimize the quality of their research designs to increase power. Here we propose a way to visualize the combined effect of sample size (the number of participants tested) and number of trials per participant on statistical power, using a two-dimensional contour plot. We show by subsampling eight existing data sets from a range of widely used methods (including reaction times, EEG, MEG, and fMRI) that these contours are curved, and permit estimation of an optimal number of participants and trials at the study design stage. All of the analysis scripts, as well as an online tool, are provided to permit others to tailor our methods to their own experimental paradigms. We anticipate that this approach will facilitate the design of experimental studies that are more efficient, and more likely to report real effects.

Annotations: Power Contours: Optimising Sample Size and Precision in Experimental Psychology and Human Neuroscience

Key Points

Two important points here. [@bakerPowerContoursOptimising2021, p. 295] 1. how many trials 2. how many participants

experimenters must decide how many trials each participant will complete, as well as how many participants to test. [@bakerPowerContoursOptimising2021, p. 295]
This is key. Power contours are useful [@bakerPowerContoursOptimising2021, p. 298]

power contours offer a useful summary of the effect of possible experimental designs on statistical power [@bakerPowerContoursOptimising2021, p. 298]
This is key. Too many trials is not really advantageous. There is a cut-off point. [@bakerPowerContoursOptimising2021, p. 298]

If relatively few participants are available (perhaps because of financial constraints, or testing of a clinical population) then the number of trials can be increased. Note, however, that beyond a particular number of trials (around k 50 in Figure 1h), the function asymptotes and further trials are not beneficial. [@bakerPowerContoursOptimising2021, p. 298]

Background

This is basically a standard in psychology [@bakerPowerContoursOptimising2021, p. 296]

Power is typically derived using effect size measures such as Cohen’s d (Cohen, 1988), which depends on the sample mean (or difference in means), and also the sample standard deviation (formally d M/ s). [@bakerPowerContoursOptimising2021, p. 296]
Statistical power is the ability to detect an effect [@bakerPowerContoursOptimising2021, p. 296]

Statistical power is the ability of a study design with a given sample size to detect an effect of a particular magnitude. [@bakerPowerContoursOptimising2021, p. 296]

Low powered studies are less able to detect a true effect (and so make more Type II errors) [@bakerPowerContoursOptimising2021, p. 296]

So basically there is a massive problem in psychology. The reproducibility problem or replicability problem. This means that studies that are published with significant results have low-power [@bakerPowerContoursOptimising2021, p. 296]

because of publication bias (whereby significant findings are more likely to be published than nonsignificant ones) published low powered studies will also have a high Type I error (false positive) rate. [@bakerPowerContoursOptimising2021, p. 296]

and estimates of power across studies in the neurosciences (Button et al., 2013) yield power values in the range 8 –30%, far below the desired level of 80% [@bakerPowerContoursOptimising2021, p. 296]

Its true that there may be high variability within-participants. To reduce this variability you can increase the number of trials. This in-turn increases the power of the study overall because you’re able to detect the effect, if there was one there to detect in the first place. [@bakerPowerContoursOptimising2021, p. 296]

In domains where the dependent variable is subject to high within-participant variance (as is potentially the case in psychology and neuroscience studies), increasing the precision of the per-participant estimate can therefore greatly increase overall power, perhaps reducing the number of participants required for a study (see Cleary & Linn, 1969; Phillips & Jiang, 2016). [@bakerPowerContoursOptimising2021, p. 296]

When the dependent variable of interest can be estimated with high precision, repeated measurements provide little benefit, and the main source of variance is between participants. [@bakerPowerContoursOptimising2021, p. 296]

Normally individual variance within an experiment can be large. If the tool used to measure the participant has less accuracy [@bakerPowerContoursOptimising2021, p. 296]

A more realistic situation for many experimental paradigms is shown in Figure 1b, where the variance of each individual estimate is large, as indicated by the horizontal standard error bars. [@bakerPowerContoursOptimising2021, p. 296]

Hypothesis / Positive

Something not cosidered normally is that the number of trials influences the power of a study. Not just sample size alone. [@bakerPowerContoursOptimising2021, p. 296]

However there is a second degree of freedom available to many experimenters at the study design stage—the number of repetitions (or trials) of a given experimental condition by each participant. [@bakerPowerContoursOptimising2021, p. 296]

there is no widely used procedure for quantitatively determining the appropriate number of trials to run. [@bakerPowerContoursOptimising2021, p. 296]

we advocate a useful representation, the power contour plot—a two-dimensional representation of power as the joint function of sample size (N) and number of trials (k). [@bakerPowerContoursOptimising2021, p. 296]

So the also show the application of the power-contour plots in action [@bakerPowerContoursOptimising2021, p. 296]

We then use a subsampling method to explore the joint effects of sample size and number of trials on real data sets using common methodologies and paradigms in psychology and neuroscience research. [@bakerPowerContoursOptimising2021, p. 296]

Methods / Process

Both factors are explored in this paper [@bakerPowerContoursOptimising2021, p. 295]

Here we explore the influence of both factors on statistical power, represented as a 2-dimensional plot on which iso-power contours can be visualized. [@bakerPowerContoursOptimising2021, p. 295]
They showed how many number of trials is needed when the within participant variance is large compared to between participant variance [@bakerPowerContoursOptimising2021, p. 295]

demonstrate the conditions under which the number of trials is particularly important, that is, when the within-participant variance is large relative to the between-participants variance [@bakerPowerContoursOptimising2021, p. 295]
Description of power contours [@bakerPowerContoursOptimising2021, p. 298]

Consider first the situation described above, in which the dependent variable of interest can be estimated accurately from a single trial, but individuals all express different true values of the variable (formally, the within-participant variance is low, but the between-participants variance is high, w b). [@bakerPowerContoursOptimising2021, p. 298]
If the number of trials and sample size number equal the same statistical power this is power equivalence. Only in special situations might this occur though like the example measuring age and height. [@bakerPowerContoursOptimising2021, p. 298]

Examples might include variables such as age and height, for which there is low measurement error and minimal variation from moment to moment, or for which tools exist (such as tape measures) to facilitate accurate measurement. In these situations, statistical power is a function of sample size and effect size (Figure 1d), where effect size is Cohen’s d. Clearly, in such a situation, testing each participant multiple times should confer no advantage. We can represent the power as a function of both sample size and number of trials using a two-dimensional plot such as the one shown in Figure 1g. Here the lines trace iso-power contourscombinations of sample size and number of trials that result in the same statistical power (this property is sometimes referred to as power equivalence, see von Oertzen, 2010). [@bakerPowerContoursOptimising2021, p. 298]
So repeated measurements with a noisy measurement tool will result in better power. Here the number of trials absolutely does matter! [@bakerPowerContoursOptimising2021, p. 298]

a situation where the individual measurements are very noisy (high within-participant variance relative to the between-participants variance, w b). The sample standard deviation decreases as a function of the number of trials (Figure 1e), as the estimated mean for each participant becomes more accurate with repeated measurements. [@bakerPowerContoursOptimising2021, p. 298]
So if you want to achieve power of 80% you can use a combination of sample size and number of trials [@bakerPowerContoursOptimising2021, p. 298]

80%, indicated by the thick blue curves on the power contour plots) can be obtained from multiple combinations of sample size and trial number. [@bakerPowerContoursOptimising2021, p. 298]
If number of trials has to be low like using children then you just have to increase sample size to achieve the same effect. [@bakerPowerContoursOptimising2021, p. 298]

if each participant must be tested very rapidly (e.g., for studies involving children), but many participants are available, the number of trials could be kept relatively low (here around k 20), and a larger sample size tested. [@bakerPowerContoursOptimising2021, p. 298]
Null condition just randomly sampled the time course and not from within a trial and then extracted the beta weights [@bakerPowerContoursOptimising2021, p. 304]

To provide a null condition, we repeated the analysis using randomly determined events within the experiment time-course (i.e., not using the true event timings). This generated the sample distributions of beta weights shown in Figure 7d, and resulted in an effect size of d 0.9 for the full data set. [@bakerPowerContoursOptimising2021, p. 304]
So they used the beta estimate across different GLMs with differing number of trials to estimate power contours with experiments with different numbers of trials [@bakerPowerContoursOptimising2021, p. 304]

We then fit the GLM to determine a regression (beta) weight for the target condition to use as our dependent variable. By varying the number of trials allocated to the target and nontarget conditions, we were able to simulate experiments with different numbers of trials, [@bakerPowerContoursOptimising2021, p. 304]

Results / Data

Almost all the time the within-participant variance is larger than between- [@bakerPowerContoursOptimising2021, p. 295]

the within-participant variance was larger than the between-participants variance [@bakerPowerContoursOptimising2021, p. 295]

Implications / ToDo

Therefore the number of trials to consider is meaningful to statistical power! [@bakerPowerContoursOptimising2021, p. 295]

meaning that the number of trials has a meaningful influence on statistical power in commonly used paradigms. [@bakerPowerContoursOptimising2021, p. 295]

This is an important issue because the number of trials has a direct effect on the statistical power of a study [@bakerPowerContoursOptimising2021, p. 296]

We anticipate that this approach will facilitate the design of experimental studies that are more efficient, and more likely to report real effects. [@bakerPowerContoursOptimising2021, p. 296]

A typical situation is where the experimenter wants minimal sample size and testing time and the knee-point of the power contour permits the optimization of both. [@bakerPowerContoursOptimising2021, p. 298]

A more typical situation is one in which an experimenter wishes to minimize both sample size and testing time— here values around the knee-point of the power contour permit joint optimization of both parameters. [@bakerPowerContoursOptimising2021, p. 298]
Event-related designs have higher power [@bakerPowerContoursOptimising2021, p. 304]

such that 80% power could be maintained for sample sizes from N 20 to N 600, simply by varying the number of trials. This flexibility allows event-related designs to achieve high statistical power even with relatively modest sample sizes, [@bakerPowerContoursOptimising2021, p. 304]
This makes sense because adding more trials in a block design doesn’t increase the power as much as it does for event-related design [@bakerPowerContoursOptimising2021, p. 306]

This pattern is somewhat different from the event-related fMRI results discussed previously (see Figure 7), where adding more trials continued to increase power across the entire range. [@bakerPowerContoursOptimising2021, p. 306]
The used V1 of course the effect size is large here compared to other regions! [@bakerPowerContoursOptimising2021, p. 306]

For the larger effects (Figure 8h–j), power was high even with the relatively small samples (N 20) typical of many neuroimaging studies (Button et al., 2013). Of course looking for responses to visual stimuli in V1 is guaranteed to produce large effect sizes—most fMRI studies are designed to test subtler effects which will inevitably be smaller than in the examples here. [@bakerPowerContoursOptimising2021, p. 306]

The “take-home” message is that measurement precision is as vital as sample size. Instead of following “rules of thumb” (e.g., “we always run 30 participants”), researchers should use power contours to quantitatively determine the most efficient combination of trials and participants for their specific experimental goals.

Main takeaway

The paper argues that statistical power in psychology and neuroscience depends on BOTH:

How many participants you test (N) How many trials each participant completes (k)

Most researchers only think about sample size (N). This paper shows that adding more trials per participant can dramatically increase power, especially when measurements are noisy.

The authors propose a way to visualize this tradeoff using power contour plots.

What are “power contours”?

Power contours are 2D plots showing combinations of:

participant number (N) trial number per participant (k)

that produce the same statistical power.

Think of them like a topographic map:

each contour line = same power level (e.g. 80%) you can move along the curve: fewer participants + more trials OR more participants + fewer trials and still maintain equivalent power.

The paper calls these iso-power contours.

The central theoretical idea

The key equation is:

s= b 2

k w 2

Where:

s = overall sample variability b = between-participant variability w = within-participant variability (noise) k = number of trials

The important insight:

If within-participant noise (w) is large, then increasing trials (k) reduces noise substantially. That improves effect size and statistical power. Why this matters

The paper is framed in the context of the:

replication crisis underpowered studies inflated effect sizes false positives in psychology/neuroscience.

The authors argue many studies are inefficient because researchers:

recruit too few participants AND do not optimize trial counts.

They show that smarter balancing of trials and participants can produce:

higher power more reliable findings more efficient studies. The biggest conceptual point

The paper demonstrates that in many neuroscience paradigms:

Within-subject noise is much larger than between-subject variability.

That means:

repeated measurements matter A LOT.

This was true across:

reaction times psychophysics EEG MEG fMRI Iowa Gambling Task.

This is probably the single most important empirical conclusion.

Key findings across methods

More trials can substitute for more participants

Example:

you might maintain 80% power with: 20 participants + many trials OR 50 participants + fewer trials.

The tradeoff depends on the paradigm.

There is usually a “sweet spot”

Power contours often have a “knee” shape:

initially, adding trials helps a lot after some point, extra trials give diminishing returns.

So:

there’s often an optimal balance between: participant burden recruitment burden. 3. Different paradigms behave differently Event-related fMRI

Needed many trials because single events are noisy.

Blocked fMRI

Reached asymptote quickly — extra trials helped less.

EEG/MEG

Strong dependence on trial count.

Psychophysics

Trials greatly improved threshold estimation precision.

Why the paper is influential

Before this paper, many researchers:

chose trial numbers arbitrarily copied previous studies or used “rules of thumb.”

This paper provides:

a formal framework visualization tools code practical guidance for study design. Practical message for researchers

The practical recommendation is:

Don’t do power analysis using only participant number.

Instead:

estimate both: between-subject variance within-subject variance then optimize BOTH: number of participants number of trials. What the authors are NOT saying

They are NOT saying:

“just run tons of trials instead of participants.”

Important nuance:

after enough trials, gains plateau. some paradigms still primarily benefit from larger N. participant variability still matters.

So the message is:

optimize both dimensions jointly.

One-sentence summary

This paper shows that statistical power in psychology and neuroscience is jointly determined by participant number and trial number, and that many studies could become substantially more powerful and efficient by optimizing both using “power contour” analyses rather than focusing only on sample size.

Disadvantages

Yes — your intuition is correct, and the paper itself actually acknowledges several limitations. The framework is very useful, but it is somewhat simplified and can easily be overinterpreted if used mechanically.

The biggest limitation is:

Power contours are only as good as the assumptions and pilot data used to generate them.

The method works best when:

the experiment is relatively stable, the measurement process is well understood, and within-/between-subject variance are reasonably estimable.

That is often not true in real neuroscience or psychology experiments.

Here are the major disadvantages and caveats.

It can oversimplify experimental reality

The framework compresses a complex experiment into basically:

effect size, within-subject variance, between-subject variance, trial count, sample size.

But many experiments have additional complexities:

learning effects fatigue adaptation boredom strategy changes scanner drift nonstationarity stimulus heterogeneity temporal autocorrelation individual differences in trajectories

These violate the assumption that “more trials = cleaner estimate.”

For example:

in a long fMRI task, later trials may actually become worse because participants disengage. in EEG, adding more trials may increase movement artifacts. in learning tasks, early and late trials are not interchangeable.

The paper briefly discusses this for the Iowa Gambling Task because behavior changes across time.

So:

More trials are not always independent, equivalent observations.

That’s a major practical limitation.

It assumes the pilot dataset generalizes

This is probably the biggest statistical weakness.

Power contours depend heavily on estimating:

within-subject variance (w) between-subject variance (b)

But these can vary enormously across:

labs scanners EEG systems preprocessing pipelines task versions participant populations experimenters

The paper explicitly warns against blindly generalizing these estimates.

A pilot dataset might dramatically underestimate noise.

That means:

your power contour could be overly optimistic, leading to underpowered replication attempts.

This is a classic problem in all power analysis, but here it becomes even more sensitive because you estimate two variance structures.

The framework mainly targets low-level repeated-measures paradigms

The approach works especially well for:

psychophysics EEG averaging reaction times sensory neuroscience repeated event paradigms

Why?

Because these paradigms naturally involve:

many repeated measurements of the same process.

But in:

social psychology, developmental work, naturalistic cognition, clinical interviews, longitudinal designs, complex behavioral paradigms,

the concept of “trial repetition” may be much less meaningful.

Sometimes:

additional trials fundamentally change the psychological process itself. 4. It can encourage “cheap power”

This is an important conceptual criticism.

Power contours can make it seem like:

“If I can’t recruit more participants, I’ll just run tons more trials.”

But more trials do NOT solve:

poor population sampling, lack of generalizability, individual difference limitations, demographic biases, small-N external validity problems.

For example:

8 participants × 1000 trials is still only 8 nervous systems.

You may get:

high statistical precision, but poor population inference.

This is especially important in cognitive neuroscience.

It ignores many higher-level modeling issues

The paper mostly uses:

t-tests ANOVAs relatively simple repeated-measures structures.

Modern neuroscience often uses:

mixed-effects models, hierarchical Bayesian models, multilevel covariance structures, cross-classified random effects, latent-variable models, representational similarity analyses, decoding models, permutation statistics.

In these settings:

“effective sample size” becomes much harder to define.

The authors do acknowledge this and say the approach can be extended via subsampling.

But the simplicity becomes less convincing in highly structured models.

Trials are often not independent

This is huge.

The method implicitly benefits from averaging assuming noise reduction behaves roughly like:

But in real experiments:

adjacent trials are correlated, attention fluctuates, physiological state drifts, responses cluster.

So adding 100 correlated trials may not help nearly as much as adding 100 independent observations.

This especially matters for:

fMRI learning tasks resting-state paradigms adaptive psychophysics 7. It focuses on statistical significance, not scientific usefulness

Another conceptual issue:

maximizing power ≠ maximizing scientific insight.

You can optimize a study to detect:

a tiny, highly reliable, but theoretically trivial effect.

Power contours say nothing about:

construct validity, interpretability, ecological validity, mechanistic insight, causal inference. 8. The “optimal point” depends on practical constraints

The paper discusses finding an “optimal” balance between N and trials.

But “optimal” depends on:

recruitment difficulty, scanner cost, participant fatigue, dropout, ethics, clinical population constraints, preprocessing time, data storage, analysis burden.

So there is no universal optimum.

A study on:

autism, rare lesions, intracranial EEG, 7T fMRI, infants,

will have entirely different tradeoffs.

Where power contours ARE genuinely useful

Despite the limitations, the paper is genuinely valuable because it corrected a widespread misconception:

That only participant number matters for power.

That misconception was especially common in neuroscience.

The paper’s strongest contribution is probably:

conceptual, not mathematical.

It reframed study design as:

a joint optimization problem.

That’s an important advance.

My overall interpretation

Power contours are best viewed as:

a design heuristic, an exploratory planning tool, a way to visualize tradeoffs,

NOT:

a definitive prescription for sample size.

For many neuroscience paradigms, especially repeated-measures designs, they are extremely useful.

But they become increasingly approximate when:

cognition changes across trials, data are hierarchical, participants are heterogeneous, trials are non-independent, or the experimental structure is complex.

Your skepticism is well justified.

Annotations: Power Contours: Optimising Sample Size and Precision in Experimental Psychology and Human Neuroscience

Key Points

Background

Hypothesis / Positive

Methods / Process

Results / Data

Implications / ToDo

Further Reading / Misc

Conclusion

Disadvantages

Notes

Access Required

Power Contours: Optimising Sample Size and Precision in Experimental Psychology and Human Neuroscience