Categories
Data thoughts R Statistical modelling

Power analysis – A flexible simulation approach using R

In this post we discuss how to calculate the statistical power of an expertiment using a Monte Carlo simulation in R.
We’ll see how this can help refine our choice of experimental design.

In this post we will introduce the concept of statistical power and we are going to use R to perform a power analysis through simulation.

I know what power analysis is! Just show me the code!

Table of Contents

    Introduction – what is power?

    In science, we often want to assess the effect of a certain factor or set of factors on an outcome. For example, we could ask whether a drug lowers blood pressure or whether a new teaching method improves students’ scores. We can then design an experiment in which we give the drug to one group of patients and a placebo to another, and compare their mean blood pressure using a statistical method.

    In doing so, we make a major assumption: the effect we observe in our experiment is a truthful representation of what will happen to future patients who take the drug. But how do we know that the effect we saw was not just a fluke? In other words, how can we be confident that the result wasn’t simply due to random variation in our sample?

    Sampling error and sampling bias

    One reason an observed effect might be misleading is that our sample doesn’t reflect the population we care about. So, if our sample (e.g., the group of patients we give the drug to) consists of only young, healthy male students from a European University campus, it wouldn’t be surprising if the drug acted somehow differently, for example, in elderly Japanese women with hypertension, who are actually more likely to take it! In this case, the sample is not a good representation of the population, an issue known as sampling bias. This is a serious problem related to how we designed the study and should be avoided as much as possible through careful design of our studies. There are many causes of sampling bias, and many strategies to mitigate it, but this is beyond the scope of this post. Importantly, power analysis does not address sampling bias; it only addresses sampling error (see below).

    However, even when the sample is representative of the population, we can still observe effects that do not generalise. We could, entirely by chance, happen to randomly select patients who are very sensitive to the drug, and when we test the drug at a larger scale, we might find that the average effect is much smaller, or even non-existent. This is known as sampling error, and it’s a fundamental issue in statistics; importantly, we can never completely eliminate it, unless we measure the entire population (which is often impossible).

    The other important issue is that we are using statistical methods to estimate the effect. These methods always involve some uncertainty, meaning that even if there is a true effect in the population, our estimate from the sample might differ simply due to random variation. To quantify this, scientists use the concept of statistical power, which is defined as “the probability of correctly detecting a true effect when it exists”. High power means we are more likely to identify the effect, while low power increases the risk of missing it, even if the effect is real. For example, a power of 0.8 (or 80%) means that if a true effect exists in the population, our study design has an 80% chance of detecting it. Crucially, power is a property of the study design, not of the data after they are collected. It must therefore be considered before running the experiment.

    Factors influencing power

    The most obvious factor influencing power, and the one that most people focus on, is sample size. The larger the sample size, the more precise our estimates become, because random variation tends to cancel out, and we are more likely to detect true effects.

    So, is it just a matter of recruiting as many participants as possible? Not quite.

    While sample size is important, I would argue that often it is more effective to focus first on experimental design. A well-designed experiment that minimises variability and controls for confounding factors can greatly enhance power, even with a smaller sample size. Furthermore, in fields like biomedical sciences and medicine, increasing sample size can be costly, time-consuming, and sometimes ethically challenging.

    Let’s consider this example: I am interested in growing pepper plants and would like to measure how compound X affects the plants growth over time. I could either:

    1. Grow three group of plants in standard soil, and three other groups with X, measuring the growth of group 1 at the start, group 2 after 2 weeks, and group 3 after 4 weeks.
    2. Grow a single group of plants in standard soil, and another group with the compound, and measure the growth of plants multiple times at 0, 2, and 4 weeks.

    The second design is better and likely to yield higher power, because it controls for variability between plants by measuring the same plants over time (a repeated-measures design). So, even if I randomly picked taller plants and shorter plants, the comparison will less affected by this variability, because I would be looking at how much each plant grows relative to its own baseline (an internal control).

    In the first design, the plants I measure at week 4 are different plants from those I measure at week 2, so if by pure chance I happened to pick taller plants at week 1 and shorter plants at week 4 in the treatment group, I might incorrectly conclude that the compound stunts growth, when in reality it might have no effect or even promote growth.

    Importantly, we need to ensure that we use statistical methods appropriate for the design. In the example above, for design 1 we would probably use a linear model with time, treatment, and their interaction, to compare the means of the groups, while for design 2 we would use a linear mixed-effects model, which accounts for the correlation between measurements from the same plants, and is thus a more powerful approach.

    Crucially, we also need to think about effect size and variability. This is probably the hardest part in a power calculation, because we often need to estimate these parameters before we have any substantial data. The effect size is simply how big the effect we want to detect is (e.g., how much our drug lowers blood pressure on average). The bigger the effect size, the easier it would be to detect it, and thus the higher the power, given the same sample size.

    Variability, on the other hand, will have the opposite effect: the more variable our measurements, because of real biological variability, measurement error or any other source of noise, the harder it will be to detect an effect, and thus the lower the power.

    Power calculation

    As the name implies, a power calculation is a method to estimate the statistical power of a proposed study design, given certain assumptions about effect size, variability, and sample size. This can help researchers decide how many participants they need to recruit, or whether their proposed design is likely to yield meaningful results.

    A brief note on post-hoc power analysis. Power is sometimes computed after a study has been completed, using the observed effect size. This practice is generally discouraged, because post-hoc power is a direct function of the p-value and does not provide additional information beyond what the confidence interval already conveys. Power analysis is most useful before data collection, as a planning tool to guide study design and sample size.

    While there are analytical methods (i.e., mathematical formulas) to perform power calculations for simple designs, these methods often rely on strong assumptions and can be difficult to apply to more complex designs (e.g., mixed-effects models, non-normal data, etc.). The R package pwr provides several functions to perform power calculations for common statistical tests, for example pwr.t.test() for t-tests, pwr.anova.test() for ANOVA, and so on. In this post, we will not cover these analytical methods but rather use a simulation-based apprach.

    This involves simulating data according to our proposed design, and then analysing the simulated data. By repeating this process many times (e.g., 1000, a technical term for this is Monte Carlo simulation) we can estimate the proportion of simulations in which we correctly detect the effect. This proportion is our estimated power.

    There are several advantages to a simulation-based approach:

    • It works for any type of experimental design, even those where it is difficult to define analytical formulas
    • It can incorporate realistic features of the study, like missing data, measurement error, or non-standard distributions.
    • It allows testing different scenarios to improve the design before collecting data.
    • The core of the simulation code can be reused for different scenarios with minimal changes.
    • Most importantly, it forces us to reflect carefully on our choices of experimental design, leading to better study planning.

    Of course, there are also some disadvantages to consider:

    • You need to have some (relatively basic) programming skills to implement the simulation. However, if you already use R (or another language) for data analysis, this should not be a big hurdle.
    • It can be computationally intensive, especially for complex designs or large datasets. In my experience, this has not been a major issue.
    • Results are dependent on assumptions used in the simulation; unrealistic assumptions lead to misleading power estimates. Of course, this is true for any power calculation method.

    Setting up a simulation for power calculation in R

    To show the power of simulation-based power analysis, we will consider the moderately complex scenario described above, where we want to assess the effect of compound X on pepper plant growth over time. We will also compare the two experimental designs described earlier, to see how they affect power.

    General plan

    The general idea for the simulation is

    n_simulations = 1000
    n_detections = 0
    Repeat for sample size N in a range of interest:
         Repeat n_simulations times:
              Generate synthetic data with sample size N
              Run the statistical analysis
              If p-value < 0.05:
                 n_detections = n_detections + 1
         Power(N) = n_detections / n_simulations

    Defining simulation parameters

    In order to generate relevant data, we need to define several parameters and assumptions for our simulation. We starti with asking what effect size do we want to detect. This is the key question to answer when doing a power calculation and it does require domain knowledge. This can be based on previous studies, pilot data, expert opinion or, less ideally, an arbitrary choice. For this example, we will all be plant growing experts (right!) so we will use our knowledge to make some assumptions.

    • At time zero, the height of all plants is normally distributed with a mean of 20 cm and a standard deviation of 5 cm.
    • A plant on its own (without X) grows on average 8 cm per week
    • Other compounds similar to X have been shown to increase growth by an average of 12 cm over 4 weeks compared to control plants. So, we want to be able to detect an effect size of at minimum 12 cm over 4 weeks, hoping that X has a larger effect. As plant experts we might know that this is a biologically meaningful increase which might, for example, result in higher yields.
    • The growth of plants is linear over time, so 3 cm per week (that is, 12 cm over 4 weeks). This is probably not the case in reality, but we can easily update our simulation to use a different growth curve later on.
    • We also assume some variability in plant growth. Let’s assume that the standard deviation of growth rates is 3 cm per week. For simplicity, we assume this is the same for plants treated with X and controls.

    Repeating steps

    In the steps above we need to repeat certain steps several times. We can do this using a for loop in R. A for loop allows us to execute a block of code multiple times, iterating over a sequence of values. For instance

    for (i in 1:5) {
        print(i)
    }

    This means that we assign the value 1 to the variable i, execute the code inside the curly braces (which in this trivial case just prints the value of i), then assign the value 2 to i, execute the code again, and so on, until we reach 5. The output of this code would be:

    [1] 1
    [1] 2
    [1] 3
    [1] 4
    [1] 5

    Putting it all together

    We’re now ready to put it all together and implement the simulation in R! You can download the full script here, below I will guide you through the main parts.

    We start by defining the parameters

    # Power analysis parameters
    num_simulations <- 1000
    alpha <- 0.05 # significance level
    sample_sizes <- c(3, 5, 10, 15, 20, 30, 50) # sample sizes to test
    
    # Experimental design parameters
    measurement_weeks <- c(0, 2, 4)
    mean_initial_height <- 20
    sd_initial_height <- 5
    base_growth_rate <- 8 # cm/week
    sd_growth_rate <- 3
    X_growth_rate <- 3 # additional cm/week due to compound X

    Next, we create two functions to simulate data for the two designs. Creating functions makes our code more modular, easier to read and allows us to make changes more easily later on.

    Note: I have tried to keep the code simple, to help understand what is going on. In some places, this might come sometimes at the expense of speed and efficiency; while there are ways of writing this code that would make it faster, they are beyond the scope of this post.

    Because we assume normal distributions for plant heights and growth rates, we can use the rnorm() function in R to generate random numbers from a normal distribution with specified mean and standard deviation.

    Our functions will return a data frame containing plant ID (which we will need for design 2 only), time point in weeks, treatment group, and measured height.

    To simulate design 1 fairly, we first generate a population of plants with differences in baseline height and growth rate, and then sample new plants at each time point. Although these differences are never observed directly over time in design 1, they still contribute to variability and therefore affect power.

    generate_population <- function(N) {
        # Generate a population of plants with initial heights and growth rates
        # This incorporates natural variability among plants with the parameters we defined
        data.frame(
            PlantID = 1:N,
            InitialHeight = rnorm(N, mean_initial_height, sd_initial_height),
            GrowthRate = rnorm(N, base_growth_rate, sd_growth_rate)
        )
    }
    
    simulate_design1 <- function(n) {
        # Generate a large population of plants
        population <- generate_population(10000)
        # Design 1: Separate groups measured at different times
        
        data <- data.frame(ID = integer(),
                           Height = numeric(),
                           Time = integer(),
                           Treatment = factor())
    
        for (week in measurement_weeks) {
            # Sample n plants for control and treatment groups at each time point
            ctrl_plants <- population[sample(1:nrow(population), n), ]
            treat_plants <- population[sample(1:nrow(population), n), ]
    
            ctrl_heights <- ctrl_plants$InitialHeight + ctrl_plants$GrowthRate * week
            treat_heights <- treat_plants$InitialHeight + 
                (treat_plants$GrowthRate + X_growth_rate) * week
    
            data <- rbind(data,
                          data.frame(
                              # All IDs must be unique across time points,
                              # because we are sampling new plants each time
                              ID = 1:(n * 2) + nrow(data),
                              Height = c(ctrl_heights, treat_heights),
                              Time = rep(week, n * 2),
                              Treatment = factor(rep(c("Control", "X"), each = n))
                          ))
        }
    
        return(data)
    }

    The second function simulates data for design 2, in a very similar way. The main difference is that we need to account for repeated measurements on the same plants over time, so we assign unique IDs to each plant and ensure that their growth is correlated across time points.

    simulate_design2 <- function(n) {
        # Generate a large population of plants
        population <- generate_population(10000)
    
        # We sample plants once here because this is a repeated measures design    
        ctrl_plants  <- population[sample(1:nrow(population), n), ]
        treat_plants <- population[sample(1:nrow(population), n), ]
    
        data <- data.frame(ID = integer(),
                           Height = numeric(),
                           Time = integer(),
                           Treatment = factor())
    
        for (week in measurement_weeks) {
    
            ctrl_heights <- ctrl_plants$InitialHeight +
                ctrl_plants$GrowthRate * week
    
            treat_heights <- treat_plants$InitialHeight +
                (treat_plants$GrowthRate + X_growth_rate) * week
    
            data <- rbind(data,
                          data.frame(
                              # Same IDs reused across time → repeated measures
                              ID = c(ctrl_plants$PlantID, treat_plants$PlantID),
                              Height = c(ctrl_heights, treat_heights),
                              Time = rep(week, n * 2),
                              Treatment = factor(rep(c("Control", "X"), each = n))
                          ))
        }
    
        return(data)
    }

    Let’s check that everything is working as expected, by calling these functions, for example with 10 plants per group, then plotting them.

    simulated_data1 <- simulate_design1(10)
    simulated_data2 <- simulate_design2(10)
    
    ggplot(simulated_data1,
        aes(x = Time, y = Height, fill = Treatment)) +
        geom_line(alpha = 0.3) +
        geom_boxplot() +
        geom_point(position = position_jitterdodge()) +
        ggtitle("Design 1") +
        xlab("Time (weeks)") +
        ylab("Plant Height (cm)") +    
        ylim(0, 75) +
        theme_minimal()
    
    ggplot(simulated_data2,
        aes(x = Time, y = Height, Group = ID)) +
        geom_line(aes(group = ID), linewidth = 0.4) +
        geom_point() +
        ggtitle("Design 2") +
        xlab("Time (weeks)") +
        ylab("Plant Height (cm)") +    
        facet_wrap(~Treatment) +
        ylim(0, 75) +
        theme_minimal()
    Example of simulated data from design 1. We have 3 control groups and 3 groups treated with X. We clearly see the linear increase in growth over time, and the higher growth in the treatment groups.
    Example of simulated data from design 2. As above, we can see the linear increase in growth, but in this case the same plant is measured multiple times.

    Analyzing simulated data

    Now that we can simulate data, we need to analyse it to see whether we can detect the effect of compound X.

    For design 1, this is done using a linear model with Treatment, Time, and their interaction as predictors of plant height. In this case we will mostly be interested with the Treatment × Time interaction, which tests whether plant growth over time differs between control and treated plants. A significant interaction indicates that the effect of compound X changes the growth trajectory, rather than simply shifting heights by a constant amount.

    The p-value for this interaction can be extracted from the ANOVA summary, in the Pr(>F) column. If this doesn’t make sense, try print(summary(model)) after fitting the model to see what the output looks like!

    analyze_design1 <- function(sim_data) {
        model <- aov(Height ~ Treatment * Time, data = sim_data)
        summary_model <- summary(model)
        # p-value for the interaction term
        p_value <- summary_model[[1]]["Treatment:Time", "Pr(>F)"]
        return(p_value)
    }

    For design 2, we use a linear mixed-effects model to account for the repeated measurements on the same plants, by including plant ID as a random effect. This explicitly models the correlation between observations taken from the same plant over time.

    As for design 1, the term of interest is the Treatment × Time interaction, which tests whether the growth trajectory differs between treated and control plants. The corresponding p-value can be extracted from the model summary. Here we use the nlme package to fit the model, although other packages (e.g. lme4) would work as well.

    For simplicity, we include only a random intercept, even though plants also differ in growth rate. Allowing random slopes would be more realistic but adds complexity without changing the main conclusion.

    library(nlme)
    
    analyze_design2 <- function(data) {
        model <- lme(Height ~ Treatment * Time, random = ~ 1 | ID, data = data)
        summary_model <- summary(model)
        # p-value for the interaction term
        p_value <- summary_model$tTable["TreatmentX:Time", "p-value"]   
        
        return(p_value)
    }

    The image below shows six samples of simulated data from design 2, with 5 plants per group. The lines represent the mean growth over time for each treatment group, and the shaded areas represent the standard deviation. The sixth panel shows an example where, due to random variation, we do not detect a significant effect of treatment. It is important to remember that in all simulations, there is a true effect of compound X; the fact that we sometimes fail to detect it is due to sampling error and variability in the data.

    Example of six different samples for our experiment. In five of them a significant difference could be found. Shaded areas show standard deviation

    Running the power simulation

    Finally, we can put everything together to run the power simulation. We will loop over a range of sample sizes, and for each sample size, we will run multiple simulations, generating data, analysing it, and checking whether we detect a significant effect (p < 0.05). We will then calculate the power as the proportion of simulations where we detected the effect. We will store the results in a data frame for easy plotting later.

    This bit of the code can be a bit slow. for loops are terribly inefficient in R, but they are easy. In practice, you might want to look into vectorized operations or parallel processing to speed things up, but that is beyond the scope of this post.

    power_results <- data.frame(
        SampleSize = integer(),
        Power_Design1 = numeric(),
        Power_Design2 = numeric()
    )
    
    for (n in sample_sizes) {
        print(paste("Simulating for sample size:", n))
        detections_design1 <- 0
        detections_design2 <- 0
    
        for (sim in 1:num_simulations) {
            # Design 1
            data1 <- simulate_design1(n)
            p_value1 <- analyze_design1(data1)
            if (p_value1 < alpha) {
                detections_design1 <- detections_design1 + 1
            }
    
            # Design 2
            data2 <- simulate_design2(n)
            p_value2 <- analyze_design2(data2)
            if (p_value2 < alpha) {
                detections_design2 <- detections_design2 + 1
            }
        }
    
        power_design1 <- detections_design1 / num_simulations
        power_design2 <- detections_design2 / num_simulations
    
        power_results <- rbind(power_results, data.frame(
            SampleSize = n,
            Power_Design1 = power_design1,
            Power_Design2 = power_design2
        ))
    }

    We can plot the results to visualise how power changes with sample size for the two designs.

    library(ggplot2)
    
    ggplot(power_results, aes(x = SampleSize)) +
        geom_line(aes(y = Power_Design1 * 100, color = "Design 1"), linewidth = 1) +
        geom_point(aes(y = Power_Design1 * 100, color = "Design 1"), size = 2) +
        geom_line(aes(y = Power_Design2 * 100, color = "Design 2"), linewidth = 1) +
        geom_point(aes(y = Power_Design2 * 100, color = "Design 2"), size = 2) +
        scale_color_manual(values = c("Design 1" = "blue", "Design 2" = "red")) +
        geom_hline(yintercept = 80, linetype = "dashed", color = "black") +
        labs(x = "Sample Size per Group",
             y = "Power (%)",
             color = "Design") +
        ylim(0, 100) +
        theme_minimal()
    A plot shpwing the result of the power analysis (power against sample size) in the two designs. Design 1 is underpowered throughout the range of sample sizes.
    The plot shows the estimated power for both experimental designs as a function of sample size per group. The dashed line indicates an 80% power threshold (a commonly used level).

    Conclusions and reflections

    We can clearly see that design 2 (the repeated-measures design) achieves higher power for the same sample size compared to design 1. Indeed, to reach 80% power, design 1 requires approximately 20 plants per group, while design 2 achieves the same power with only about 10 plants per group, being thus more efficient and less costly. If this experiment involved significant costs or ethical considerations (e.g., animal studies), this difference could be very important.

    This example illustrates the flexibility and power of simulation-based approach to power analysis. By simulating data according to our proposed design and analysing it, we can estimate the power of our study under various scenarios. It would be straightforward to extend this simulation to include other factors, such as missing data, measurement error, other confounding variables, etc.

    Importantly, we need to always remember that practical constraints often limit our ability to achieve the desired power, or to implement the ideal experimental design. For example, if we were using a destructive measurement method that prevents repeated measurements on the same plants, we would be forced to use design 1, despite its lower power. In such cases, it is crucial to be aware of the limitations of our study and to interpret the results accordingly. In other situations, we might need to balance power with other considerations, such as cost, time, or ethical concerns; for example repeated blood sampling in human or animal studies might not be feasible or ethical.

    Finally, it is imprtant to remember that power estimates are conditional on the assumed effect size and variability; if these are wrong, the power estimate will be wrong as well.

    I hope this post has provided a useful introduction to power analysis using simulation in R. If you have any questions or comments, feel free to reach out!

    CC-BY-SA 4.0 This work is licensed under a CC-BY-SA 4.0 licence.
    DOI: 10.59350/fw8zx-wm472

    Leave a Reply

    Your email address will not be published. Required fields are marked *

    This site uses Akismet to reduce spam. Learn how your comment data is processed.