Boxplots with data points are great way to visualize the summary information between distributions and also look at the actual data points. Sometimes, when making boxplot with paired data points, it is also useful to connect the paired data points with lines. Adding lines to points betweeb two groups/time points can immediately reveal the change in trend.
In this post, we will see an example of how to connect paired datapoints on boxplot with lines using ggplot2 in R. We will start with making a simple boxplot between two time points and work towards adding connecting line between the sample samples at two time points. Then we will further learn how to customize the connected boxplots with jittered data points and colors matching the boxplots.
Loading Packages and Data for connecting boxplots with lines
Let us load tidyverse and gapminder package. We will work with gapminder dataset to make the boxplot connected by lines.
library(tidyverse) library(gapminder) theme_set(theme_bw(16))
Let us first simplify the gapminder dataframe. We first filter the data for just two time points and for just one continent. Then we create a new variable that specifies which data points are paired.
In this example, lifeExp data for same country in two time points are paired data points. And this will help us see how the lifeExp has changed between two timepoints.
library(gapminder) df = gapminder %>% filter(year %in% c(1952,2007)) %>% filter(continent %in% c("Americas")) %>% select(country,year,lifeExp)%>% mutate(paired = rep(1:(n()/2),each=2), year=factor(year))
Now we have our datadrame ready for making boxplot with points connected by lines. One can notice that the value of “paired” variable is the same for a country. For example, the first two rows corresponding to Argentina, “paired” variable has value1 “1”, and the 3rd and 4th rows corresponding to Bolivia has value “2”.
df %>% head() ## # A tibble: 6 x 4 ## country year lifeExp paired ## <fct> <fct> <dbl> <int> ## 1 Argentina 1952 62.5 1 ## 2 Argentina 2007 75.3 1 ## 3 Bolivia 1952 40.4 2 ## 4 Bolivia 2007 65.6 2 ## 5 Brazil 1952 50.9 3 ## 6 Brazil 2007 72.4 3
Simple Boxplots with ggplot2
To start with, we will see an example of how to make a simple boxplot using ggplot2 in R. We use year on x-axis and lifeExp on y-axis and fill the boxplot by year. We use geom_boxplot() to make boxplot with ggplot2.
df %>% ggplot(aes(year,lifeExp, fill=year)) + geom_boxplot() + theme(legend.position = "none")
First attempt at Connecting Paired Points on Boxplots with ggplot2
Let us first add data points to the boxplot using geom_point() function in ggplot2. To connect the data points with line between two time points, we use geom_line() function with the variable “paired” to specify which data points to connect with group argument.
df %>% ggplot(aes(year,lifeExp, fill=year)) + geom_boxplot() + geom_point()+ geom_line(aes(group=paired)) + theme(legend.position = "none")
Our first effort to make boxplot with data points connected by lines is successful. We have line connecting each data point in one time point connected to a point in the other time point.
Connecting Paired Points with jitter on Boxplots with ggplot2
Although our first try at connecting paired points with lines is successful, multiple overlapping data points causes over-plotting issue. A better solution is to have jittered data points on boxplot and have lines connecting the jittered data point.
Let us try changing geom_point() function to geom_jitter().
df %>% ggplot(aes(year,lifeExp, fill=year)) + geom_boxplot() + geom_line(aes(group=paired)) + geom_jitter(aes(fill=year,group=paired), width=0.15) + theme(legend.position = "none")
Here our try to connect jittered points between two groups did not succeed. We have lines between two groups, but they don’t start and end on data points.
How to Connect Paired Points with lines on Boxplots with ggplot2?
The challenge was not using the jittered position while drawing lines. A solution to connect paired data points with jitter is to specify the position for the data points and lines.
Here we use position arguments in both geom_line() and geom_point() functions. We specify the same argument “position = position_dodge(0.2)” to add lines between boxplot with jittered points.
df %>% ggplot(aes(year,lifeExp, fill=year)) + geom_boxplot() + geom_line(aes(group=paired), position = position_dodge(0.2)) + geom_point(aes(fill=year,group=paired), position = position_dodge(0.2)) + theme(legend.position = "none")
Voila, our boxplot with lines looks great.
Customizing a ggplot with lines connecting Paired Points
The data points on boxplot connected by lines are black in the above example. We can further customize the boxplot with lines connecting paired data points, by making the data points to have same color as the boxplots.
df %>% ggplot(aes(year,lifeExp, fill=year)) + geom_boxplot() + geom_line(aes(group=paired), position = position_dodge(0.2)) + geom_point(aes(fill=year,group=paired),size=2,shape=21, position = position_dodge(0.2)) + theme(legend.position = "none")
We use size and shape argument inside geom_point() function to specify the spae for which we can use fill color. And now the colors of data points match boxplot colors.