Categories

# A gentle introduction to mixed-effects models

Pseudoreplication is a major issue in biomedical sciences. This post will show you how to avoid it and properly analyse your data in R!

In a previous post, I discussed how to use linear regression to analyse whether one or more variables influence an outcome. Linear regression is great, but it is not always the best solution; one of the extremely important assumptions of linear regression is that the observations are independent. However, there are a lot of situations when that does not apply. This post will explore how to use mixed-effects models in R to deal with this!

## Experimental designs giving non-independent measures

Not interested in the theory? Jump directly to the example!

Two of the main types of experimental design where you end up with non-independent observations are repeated measures and nested designs.
In repeated measures design, you are taking measurements from the same experimental unit multiple times; a typical example is taking multiple blood samples from the same patient over the course of several visits. Because the measures always come from the same individual, they are not independent, and they are likely to be more correlated between each other than they are with measures from a different individual.

In nested design, you are taking measurements that have some hierarchical or nested structure. For example, you may take biopsies from various patients at different clinics, and then analyse multiple cells in each biopsy. The cells coming from the same biopsy come from the same piece of tissue, thus they are not independent and may have some higher degree of correlation than cells from different biopsies. Furthermore, patients at the same clinic may also show some higher degree of correlation because they are treated by the same medical staff, using the same procedures and maybe because they live in the same area. So we have a nesting of the type clinic > patient/biopsy > cells. Two examples of independent samples. On the left, in a repeated measure design, we have 4 patients and we take a blood sample every hour for three hours. Although we have 12 samples, we only have 4 independent experimental units. On the right, a nested design. We have 5 patients for each of 3 clinics, and we take 3 samples per patient. Again, although we have 45 samples, they only come from 3 independent clinics and 5 independent patients per clinic.

## Why should I bother (replication and pseudoreplication)?

So… what’s the big deal with having non-independent samples? We will see some examples below, but the issue is that we risk creating models that do not correctly represent the real variability in our sample. For instance, think about the situation on the left in the figure above. If we model those as 12 independent samples, we are assuming that there are 11 degrees of freedom in our system, which is not true, as each measurement at time 2 will be somehow related to the respective measurement at time 1.

Replication is an essential part of experimental design. We perform a measurement multiple times, on multiple experimental units to increase how precisely the estimations we perform on the samples approximate the parameters of the population.

By incorrectly increasing the degrees of freedom in our model, we generate inaccurate statistical models, that can misrepresent reality. The process of (maliciously or not) inflating the degrees of freedom is called pseudoreplication, and it is a well-known issue in biological and other sciences (see for example Lazic 2010 or Lazic et al. 2018). By reporting having measured 12 independent samples instead of reporting 4 independent groups of 3 non-independent samples, and analysing your data according, you are falling into the pseudoreplication trap… and bad things can happen there!

## A simple example

Let’s say we measure the blood level of substance X, over the course of 3 time points in 10 people. We want to know whether time has an effect on the levels of X. I created some example data to use for this post.

Let’s have a quick look at the data

ggplot(data, aes(x=Time, y=X)) +
geom_line(aes(col=Patient))

This plot tells us that time has no clear effect on the level of X, and that there is quite a bit of variability amongst patients in their X levels. This is a good example of repeated measures and one that will allow us to see the issues with pseudoreplication. So, let’s start to analyse these data using R!

We start by creating a simple linear model to analyse our data such as:

$$X = \beta_0 + \beta_1 * Time + \epsilon$$

model <- lm(X ~ Time, data)
summary(model)
Call:
lm(formula = X ~ Time, data = data)

Residuals:
Min       1Q   Median       3Q      Max
-10.2700  -2.5912   0.9825   3.7788  10.7300

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   29.980      2.879  10.413 3.91e-11 ***
Time          -0.205      1.333  -0.154    0.879
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.96 on 28 degrees of freedom
Multiple R-squared:  0.0008443,	Adjusted R-squared:  -0.03484
F-statistic: 0.02366 on 1 and 28 DF,  p-value: 0.8789

The summary of the model tells us that Time does not influence the levels of X however, you can see that the model is representing a system with 28 degrees of freedom, which we know is wrong! Indeed, what is happening is that we are considering the measurements as all independent.

We can see this if we plot the residuals by patient

In the definition of our linear model we assume $\epsilon \sim N(0, \sigma_{\epsilon}^{2})$, which, albeit being true here when looking at all the residuals (the orange bar in the figure), it is clearly not true when looking at residuals from the single individuals. What is happening is that we are artificially inflating the sample variability, by putting the “patient effect” inside the residuals.

Let’s see if there is a better way!

## The useless way

Seen that the problem lies in the individual patients we may be tempted to use it as a factor.

model2 <- lm(X ~ Time + Patient, data)

Indeed, this solves the problem, and now the patient effect is not in the residuals anymore, however the model summary raises a different issue

Call:
lm(formula = X ~ Time + Patient, data = data)

Residuals:
Min      1Q  Median      3Q     Max
-3.4383 -1.0083  0.1025  0.9967  3.3617

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  36.9433     1.6598  22.258 4.52e-15 ***
Time         -0.2050     0.5082  -0.403 0.691176
PatientB     -5.9333     1.8557  -3.197 0.004743 **
PatientC      1.4000     1.8557   0.754 0.459845
PatientD    -16.1000     1.8557  -8.676 4.92e-08 ***
PatientE     -8.4000     1.8557  -4.527 0.000231 ***
PatientF     -6.5333     1.8557  -3.521 0.002286 **
PatientG     -8.8333     1.8557  -4.760 0.000136 ***
PatientH    -15.8667     1.8557  -8.550 6.16e-08 ***
PatientI     -4.2000     1.8557  -2.263 0.035521 *
PatientJ     -5.1667     1.8557  -2.784 0.011824 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.273 on 19 degrees of freedom
Multiple R-squared:  0.9014,	Adjusted R-squared:  0.8495
F-statistic: 17.37 on 10 and 19 DF,  p-value: 1.615e-07

Indeed, the residual standard error is about 1/3 of what it was before, however what is happening here is that we are creating a lot of dummy variables to represent the patients. So our model is something like

$$X = \beta_0 + \beta_1 * Time + \beta_2 * PB + \beta_3 * PC + \ldots + \epsilon$$

Where

$$PB = \begin{cases}& 1\text{ if Patient} = B\\ & 0\text{ otherwise}\end{cases}$$

$$PC = \begin{cases}& 1\text{ if Patient} = C\\ & 0\text{ otherwise}\end{cases}$$

And so on…

So this model is not very useful, as it only models those 10 patients, and cannot extrapolate to a patient 11 unless you estimate a new variable for them, which will need measuring X for that patient. Essentially, we have created a model that tells us what we already know and cannot do anything else.

## The proper way, using mixed-effects models!

So, we finally get to do things the proper way, using mixed-effects models!

The idea is that we don’t use the patient as a predictor or, in mixed-effect lingo as a fixed effect, but rather like a random effect, a random variable, independent from the residuals, which influences our outcome but that we are not interested in determining. In other terms

$$X_{kij} = \beta_0 + \beta_1 * Time + b + epsilon$$

Where we have 10 random effects $b \sim N(0, \sigma_b^{2})$, each of which is shared by the 3 measurements from each patient, plus 30 residuals $\epsilon \sim N(0, \sigma_{\epsilon}^{2})$, one per observation.

Unfortunately, lm cannot handle mixed-effect models, so we need to use the lme function in the nlme package.

library(nlme)

model3 <- lme(X ~ Time, random=~1|Patient, data)

The syntax of lme is essentially the same as lm, plus a random term, which I will explain below.

summary(model3)

Linear mixed-effects model fit by REML
Data: data
AIC      BIC    logLik
166.4679 171.7967 -79.23394

Random effects:
Formula: ~1 | Patient
(Intercept) Residual
StdDev:    5.611054 2.272793

Fixed effects: X ~ Time
Value Std.Error DF   t-value p-value
(Intercept) 29.980  2.086551 19 14.368206  0.0000
Time        -0.205  0.508212 19 -0.403375  0.6912
Correlation:
(Intr)
Time -0.487

Standardized Within-Group Residuals:
Min          Q1         Med          Q3         Max
-1.47025716 -0.48710670 -0.02399607  0.31374056  1.54213668

Number of Observations: 30
Number of Groups: 10 

R has correctly identified that we have 30 observations but only 10 experimental units! The rest of the summary is pretty much the same as for lm, however instead of getting $R^2$ as a goodness-of-fit measure we get three measures, AIC, BIC and log-likelihood. Briefly, lower values of AIC or BIC, or higher value of log-likelihood indicate a better fit.

## Random intercepts and random slopes

So, what exactly is happening here? Let’s look at the estimated coefficients

> coef(model3)
  (Intercept)   Time
A    36.58225 -0.205
B    30.95659 -0.205
C    37.90966 -0.205
D    21.31711 -0.205
E    28.61783 -0.205
F    30.38770 -0.205
G    28.20697 -0.205
H    21.53834 -0.205
I    32.60004 -0.205
J    31.68350 -0.205

R is estimating a different intercept for each patient, whilst keeping the same slope for Time. We have created a random-intercept mixed-effects model.

Why does it only change the intercept? That is because we defined the random effects as ~1 | Patient. This tells R we want to estimate a different intercept (indicated by 1) for each patient. We can also estimate different slopes for the different fixed effects in our model, by specifying them in the random effects formula. For instance

model4 <- lme(X ~ Time, random=~Time|Patient, data)
coef(model4)
  (Intercept)        Time
A    34.98618  0.66094525
B    32.97207 -1.20892331
C    35.51697  1.07422047
D    21.09373 -0.17600178
E    26.48515  0.85413897
F    33.41293 -1.72205606
G    28.04105 -0.13863582
H    20.51766  0.22698304
I    35.62665 -1.70147529
J    31.14762  0.08080454

Similarly, if we wanted a random slope model, where we change only the slope and not the intercept we would use ~Time-1 | Patient.

Finally, let’s plot the model’s prediction in the three situations

ggplot(data, aes(Time, X)) +
geom_point(aes(col=Patient)) +
geom_line(aes(x = Time, y = predict(model3), col=Patient)) Random intercept model. Each patient has its own intercept, the slope for time is always the same. Random slope model. Not incredibly appropriate here… All patients share the same intercept, the slope for time varies.

So, which one should I choose? Random intercept, random slope, or both? It depends on your particular problem, and this is where domain-specific knowledge comes into play. Which model makes more sense in biological terms?

Mixed-effects models are a much more complex topic and there is much more you can do with them, two good reference books are “Mixed Effects Models in S and S-PLUS” by Douglas Bates and José Pinheiro and “Extending the Linear Model with R” by Julian Faraway.

And that is all for today, hope you enjoyed this post!

## 6 replies on “A gentle introduction to mixed-effects models” Manoel Santanasays:

Thank you! Excellent blog!
Interesting to summarize assumption for data ( independence etc) and for residuals ( normality etc)
Keep posting ! nicosays:

Thank you Manoel! Glad you enjoyed it!

Other posts will be coming soon! Nicosays:

really great article, thanks!
in the 1st sentence of the 1st paragraph, you missed the “non-” before “independent”, right? might be confusing to some readers… nicosays:

Well spotted Nico! I have now corrected the text, thank you. Jacob Mazalalesays:

Thanks for this. A small correction on the Pseudoreplication in the top ‘grey backgound’ text.

Thanks once more. nicosays:

Thank you Jacob!

This site uses Akismet to reduce spam. Learn how your comment data is processed.