Multiple Density Plots and Coloring by Variable with ggplot2

density_plot_add_color_to_density_line_ggplot2_R
density_plot_add_color_to_density_line_ggplot2_R

In this tutorial, we will learn how to make multiple density plots in R using ggplot2. Making multiple density plot is useful, when you have quantitative variable and a categorical variable with multiple levels. First, we will start with making multiple overlapping density plots and then see 4 ways to customize the density plot and make it look better.

Load Packages and Datasets

Let us load tidyverse and also set the default theme to theme_bw() with base size for axis labels.

library(tidyverse)
theme_set(theme_bw(base_size=16))

We will make density plots using 2019 Stack Overflow survey data. The results from the 2019 survey is processed already and is available at datavizpyr.com‘s github page.

stackoverflow_file <- "https://raw.githubusercontent.com/datavizpyr/data/master/SO_data_2019/StackOverflow_survey_filtered_subsampled_2019.csv"
# read file
survey_results <- read_csv(stackoverflow_file)

To make density plots, we will mainly use distribution of salary and the manager category with two levels: individual contributors and managers in US to make multiple density plots with ggplot2.


## # A tibble: 5 x 4
##   CompTotal Gender Manager YearsCode
##       <dbl> <chr>  <chr>   <chr>    
## 1    180000 Man    IC      25       
## 2     55000 Man    IC      5        
## 3     77000 Man    IC      6        
## 4     67017 Man    IC      4        
## 5     90000 Man    IC      6

How to Make Multiple Density Plots with ggplot2

Let us first make a simple multiple-density plot in R with ggplot2. We learned earlier that we can make density plots in ggplot using geom_density() function. To make multiple density plot we need to specify the categorical variable as second variable. In this example, we specify the categorical variable with “fill” argument within aes() function inside ggplot(). And then we add geom_density() function as before.

survey_results%>%
  ggplot(aes(x=CompTotal, fill=Manager)) +
  geom_density()+ 
  labs(x= "Salary",
       subtitle="Manager and Ind. Contributor\nSalary Distribution in US",
       caption="Data Source: StackOverflow Survey Results 2019")

ggsave("simple_density_plot_with_ggplot2_R.jpg")

We get a multiple density plot in ggplot filled with two colors corresponding to two level/values for the second categorical variable. If our categorical variable has five levels, then ggplot2 would make multiple density plot with five densities.

simple_density_plot_with_ggplot2_R

Multiple Density Plots with log scale

We can see that the our density plot is skewed due to individuals with higher salaries. We can correct that skewness by making the plot in log scale. In ggplot2, we can transform x-axis values to log scale using scale_x_log10() function.

survey_results%>%
  ggplot(aes(x=CompTotal, fill=Manager)) +
  geom_density()+ 
  scale_x_log10()+
  labs(x= "Salary",
       subtitle="Manager and Ind. Contributor\nSalary Distribution in US",
       caption="Data Source: StackOverflow Survey Results 2019")
#ggsave("density_plot_scale_x_log10_with_ggplot2_R.jpg")

Now our multiple density plot looks much better with log scale on x-axis.

density_plot_log_scale_with_ggplot2_R

Multiple Density Plots with tranparency

Another problem we see with our density plot is that fill color makes it difficult to see both the distributions. We can solve this issue by adding transparency to the density plots. We can change the transparency using alpha argument.

survey_results%>%
  ggplot(aes(x=CompTotal, fill=Manager)) +
  geom_density(alpha=0.3)+ 
  scale_x_log10()+
  labs(x= "Salary",
       subtitle="Manager and Ind. Contributor\nSalary Distribution in US",
       caption="Data Source: StackOverflow Survey Results 2019")
#ggsave("density_plot_scale_x_log10_with_ggplot2_R.jpg")

In this example, we set the transparency level with alpha=0.3 inside geom_density() function. Now we can see the distribution of salaries for both the groups we have.

density_plot_with_transparency_ggplot2_R

Color Density line in Multiple Density Plots by a Variable

Note that the outline around the density plot is black in color. We can color the outline of density plot with the same colors as the fill argument, using another argument “color” inside aes() function as shown below. Here we color the line by a variable in the data frame.

survey_results%>%
  ggplot(aes(x=CompTotal, color=Manager, fill=Manager)) +
  geom_density(alpha=0.3,size=1)+ 
  scale_x_log10()+
  labs(x= "Salary",
       subtitle="Manager and Ind. Contributor\nSalary Distribution in US",
       caption="Data Source: StackOverflow Survey Results 2019")

We have also increased the thickness of outline using size argument to geom_density()

density_plot_add_color_to_density_line_ggplot2_R

Having both fill and color arguments colors the outline and fills the density plot. If you don’t want to fill the density plot, we can simply not use the fill argument.

In the example below we color the density plot outline but do not fill it color.

survey_results%>%
  ggplot(aes(x=CompTotal, color=Manager)) +
  geom_density(alpha=0.3,size=1)+ 
  scale_x_log10()+
  labs(x= "Salary",
       subtitle="Manager and Ind. Contributor\nSalary Distribution in US",
       caption="Data Source: StackOverflow Survey Results 2019")
Multiple Density Plot Coloring by Variable
Exit mobile version