Visualizing Correlation with tidymodels’ corrr package

In this tutorial, we will learn examples of computing correlations among all the numerical variables in a dataframe and visualize the correlation in multiple ways. We will use Corrr package from tidymodels to compute the correlation and visualize the correlation.

corrr is a package for exploring correlations in R. It focuses on creating and working with data frames of correlations (instead of matrices) that can be easily explored via corrr functions or by leveraging tools like those in the tidyverse.

Let us get started by loading the packages needed.

# install corrr if needed
install.packages("corrr")

library(tidyverse)
library(palmerpenguins)
library(corrr)

We will use Palmer penguin dataset in this tutorial. Since we are mainly interested in computing correlations among the numerical variables, let us select the numerical columns for further analysis. We will also remove any rows with missing values for the sake of simplicity.

penguins <- penguins %>%
  drop_na() %>%
  select(-year) %>%
  select(where(is.numeric))

Our data for computing correlations looks like this.

penguins %>% 
         head()

## # A tibble: 6 x 4
##   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##            <dbl>         <dbl>             <int>       <int>
## 1           39.1          18.7               181        3750
## 2           39.5          17.4               186        3800
## 3           40.3          18                 195        3250
## 4           36.7          19.3               193        3450
## 5           39.3          20.6               190        3650
## 6           38.9          17.8               181        3625

Computing correlation with correlate

correlate() function is one of the key functions in corrr package that computes correlation on a dataframe. By default it uses Pearson correlation.

penguins_cor <- penguins %>% 
  correlate()

## 
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'

correlate() function computes correlation for all the variables in the dataframe against with each other. We can see that we have nice symmetric tibble with correlation values.

penguins_cor 

## # A tibble: 4 x 5
##   term              bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <chr>                      <dbl>         <dbl>             <dbl>       <dbl>
## 1 bill_length_mm            NA            -0.229             0.653       0.589
## 2 bill_depth_mm             -0.229        NA                -0.578      -0.472
## 3 flipper_length_mm          0.653        -0.578            NA           0.873
## 4 body_mass_g                0.589        -0.472             0.873      NA

We can also rearrange the correlation dataframe using rearrange function. Now the tibble has variables with highest correlation first instead of the original variable order.

penguins_cor %>% 
  rearrange()

## # A tibble: 4 × 5
##   term              flipper_length_mm body_mass_g bill_length_mm bill_depth_mm
##   <chr>                         <dbl>       <dbl>          <dbl>         <dbl>
## 1 flipper_length_mm            NA           0.873          0.653        -0.578
## 2 body_mass_g                   0.873      NA              0.589        -0.472
## 3 bill_length_mm                0.653       0.589         NA            -0.229
## 4 bill_depth_mm                -0.578      -0.472         -0.229        NA

Visualizing the correlation with dotplot

With the function rplot() we can visualize the correlation as dot plot, with a default color palette for correlation ranging from -1 to +1.

Here we plot the correlation after rearranging by its strength as shown above.

penguins %>% 
  correlate() %>%
  rearrange() %>%
  rplot()

## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'

Instead of plotting the symmetric correlation matrix as dot plot, we can plot the lower triangular matrix alone. To do that by we can shave off the upper triangle using shape() function and then use rplot() to plot it. In this example, we have also specified to print the correlation value on top of the dots.

penguins %>% 
  drop_na() %>%
  select(where(is.numeric)) %>%
  correlate() %>%
  rearrange() %>%
  shave()  %>%
  rplot(print_cor=TRUE)

Lower triangular correlation plot with corrr

corrr package also has nice function to visualize the correlation as network. With network_plot() we can make a graph/network with notes as the variables and the edges as the correlation values between variables.

penguins %>% 
  drop_na() %>%
  select(where(is.numeric)) %>%
  correlate() %>%
  network_plot(min_cor=0.1)
ggsave("correlation_network_plot_in_corrr.png")

## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'

Computing correlation with correlate

Visualizing the correlation with dotplot

Related posts: