Categories

# Create awesome plots with ggplot!

This post will introduce you to ggplot, an R package which makes it super-easy to create visually pleasing plots with just a few lines of code!

## Introduction to ggplot

This post will give you a quick introduction to a great way of plotting using R: the ggplot2 package (which for simplicity I will call ggplot from now on). ggplot is an R package which makes it super-easy to create visually pleasing plots with just a few lines of code! Behind it lies a complex philosophy of visualisation, created by Hadley Wickham in 2005, based on the theory developed by the statistician and computer scientist Leland Wilkinson in his 1999 book “The grammar of graphics”.

If you have never used ggplot, you will first have to install it. This needs to be done only once, using

install.packages("ggplot2")

Once that is done (it may take a little while), every time you want to use it, you can load the package using

library(ggplot2)

## Aesthetics and geometries

The philosophy behind ggplot is that each plot is made out of layers that you can manipulate individually. The main function you are going to use for generating plots is ggplot.
But, first of all, let’s load up some data! I am going to use a hypothetical dataset containing blood levels of a metabolite in the blood of patients.

metab <- read.csv("metab.csv")

Let’s start by having a quick look at the data, using the head and summary functions.

head(metab)
 Concentration Sex Age Treatment
1         550.7   F  79      CTRL
2         260.7   M  41         A
3         450.8   F  64      CTRL
4         324.4   F  52      CTRL
5         228.7   M  43         B
6         325.1   M  53         A
summary(metab)
 Concentration   Sex          Age        Treatment
Min.   :118.1   F:215   Min.   :35.00   A   :150
1st Qu.:287.3   M:214   1st Qu.:45.00   B   : 80
Median :373.8           Median :58.00   CTRL:199
Mean   :386.8           Mean   :56.92
3rd Qu.:466.8           3rd Qu.:68.00
Max.   :862.5           Max.   :80.00  

We can now pass the dataset, and define the aesthetics that map the data to visual aspects of the plot.

ggplot(data = metab, aes(x = Treatment,
y = Concentration))

But. . . wait a moment, there is nothing on the plot!
That is because we did not specify what type of plot we want. Let’s try again. . . this time asking for a boxplot.

This is done by using geometries, that are generated through the geom_... functions. In our case, we are going to use geom_boxplot. Because we want to add a new layer to our plot, we use + to add the boxplot. Easy, isn’t it?

ggplot(data = metab, aes(x = Treatment,
y = Concentration)) +
geom_boxplot()

If you wanted to plot a histogram, instead, you could use geom_histogram

ggplot(data = metab, aes(x = Concentration)) +
geom_histogram(binwidth = 20)

# Alternatively, use bin to set the number of bins

But, let’s say we want something more complicated, for instance adding some points over the boxplot, how do we go about it?
Very simple, we just add another layer using geom_point

ggplot(data = metab, aes(x = Treatment,
y = Concentration)) +
# Avoid plotting outliers on the boxplot,
# since we are adding points on top
geom_boxplot(outlier.shape = NA) +
geom_point()

The plot above is not very readable, because points overlap too much. We can use geom_jitter to jitter the points and improve it. We also make the dots smaller using the size argument.

ggplot(data = metab, aes(x = Treatment,
y = Concentration)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = .1, size = 0.5)

## Mapping variables to aesthetics

This graphic is much improved, however it does not tell us which data points come from men and which from women. We can add a further aesthetic in geom_jitter, to specify the colour.

ggplot(data = metab, aes(x = Treatment,
y = Concentration)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.1, aes(col = Sex), size = 0.5)

Note that ggplot automatically identifies Sex as a factor and colours it using a discrete scale. If we were to colour the points by a continuous variable, a continuous scale would be used. For example, we can try to do that with Age

ggplot(data = metab, aes(x = Treatment,
y = Concentration)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.1, aes(col = Age), size = 0.5)

## Custom scales

The colour scale can be easily changed, should you not like the default black to blue gradient. The scale_color_xxx functions will help you with that. For example with scale_color_gradient2 we can easily specify a “low-high” gradient; we can specify colours using their name, or their RGB components either in hexadecimal (as in this example) or in decimal using the rgb function.

ggplot(data = metab, aes(x = Treatment,
y = Concentration)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.1, aes(col = Age), size = 0.5) +
scale_color_gradient2(low = "#9d3b4a", mid = "beige",
high = "#00426e", midpoint = 60)

We can even go further and map Sex to the colour of the dots and Age to their size. I am using a custom discrete scale this time.

ggplot(data = metab, aes(x = Treatment,
y = Concentration)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.1, aes(col = Sex, size = Age)) +
scale_color_manual(values = c("navy", "orange"))

Similarly to what we did before, we can customise how big the dots should be by using scale_size_continuous

ggplot(data = metab, aes(x = Treatment,
y = Concentration)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.1, aes(col = Sex, size = Age)) +
scale_size_continuous(range = c(0.5, 2)) +
scale_color_manual(values = c("navy", "orange"))

## Faceting

The plot above is pretty, but it is still relatively complicated to clearly see M vs F differences. One way we could go about this is faceting, that is, splitting the plot into subplots

ggplot(data = metab, aes(x = Treatment,
y = Concentration)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.1, size = 0.5) +
facet_grid(~Sex)

The ~Sex notation tells ggplot to split the plot by Sex.
This can also be used on multiple variables. Let’s create a new one, by dividing people into Young (<=60 years old), and Old (>60 years old)

metab$AgeCateg <- ifelse(metab$Age <= 60,
"Young", "Old")
ggplot(data = metab, aes(x = Treatment,
y = Concentration)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(width = 0.1, size = 0.5) +
facet_grid(AgeCateg~Sex)

The faceting notation defines how the plot is split; by using var1 ~ var2 we will split by var1 in rows and var2 in columns. To split by one variable into columns we can use ~var1, while to split into rows var1~.

I will finish this post with a more complex graph, where we add a smoother to the data, for example, a linear regression

ggplot(data = metab, aes(x = Age, y = Concentration)) +
geom_point(aes(col = Sex), size = 0.8) +
geom_smooth(method = "lm")

And, by just adding another aesthetic we can easily have a regression for men and one for women, and map age to size

ggplot(data = metab, aes(x = Age, y = Concentration)) +
geom_point(aes(col = Sex, size = Age)) +
geom_smooth(method = "lm", aes(col = Sex)) +
scale_size_continuous(range = c(0.5, 2))

In a few lines of code, we have created a complex visualization using ggplot. From this we can clearly see that Age has a different effect on men and women. That is, there is an interaction between Age and Sex (don’t know what that is? Look at my post on interactions in linear models!)

Hopefully, this has given you a quick taste of what you can do with ggplot. This is just the tip of the iceberg, though, and there is a lot more you can do!