In this tutorial, we will learn examples of computing correlations among all the numerical variables in a dataframe and visualize the correlation in multiple ways. We will use Corrr package from tidymodels to compute the correlation and visualize the correlation.
corrr is a package for exploring correlations in R. It focuses on creating and working with data frames of correlations (instead of matrices) that can be easily explored via corrr functions or by leveraging tools like those in the tidyverse.
Let us get started by loading the packages needed.
# install corrr if needed install.packages("corrr")
library(tidyverse) library(palmerpenguins) library(corrr)
We will use Palmer penguin dataset in this tutorial. Since we are mainly interested in computing correlations among the numerical variables, let us select the numerical columns for further analysis. We will also remove any rows with missing values for the sake of simplicity.
penguins <- penguins %>% drop_na() %>% select(-year) %>% select(where(is.numeric))
Our data for computing correlations looks like this.
penguins %>% head() ## # A tibble: 6 x 4 ## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <dbl> <dbl> <int> <int> ## 1 39.1 18.7 181 3750 ## 2 39.5 17.4 186 3800 ## 3 40.3 18 195 3250 ## 4 36.7 19.3 193 3450 ## 5 39.3 20.6 190 3650 ## 6 38.9 17.8 181 3625
Computing correlation with correlate
correlate() function is one of the key functions in corrr package that computes correlation on a dataframe. By default it uses Pearson correlation.
penguins_cor <- penguins %>% correlate() ## ## Correlation method: 'pearson' ## Missing treated using: 'pairwise.complete.obs'
correlate() function computes correlation for all the variables in the dataframe against with each other. We can see that we have nice symmetric tibble with correlation values.
penguins_cor ## # A tibble: 4 x 5 ## term bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 bill_length_mm NA -0.229 0.653 0.589 ## 2 bill_depth_mm -0.229 NA -0.578 -0.472 ## 3 flipper_length_mm 0.653 -0.578 NA 0.873 ## 4 body_mass_g 0.589 -0.472 0.873 NA
We can also rearrange the correlation dataframe using rearrange function. Now the tibble has variables with highest correlation first instead of the original variable order.
penguins_cor %>% rearrange() ## # A tibble: 4 × 5 ## term flipper_length_mm body_mass_g bill_length_mm bill_depth_mm ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 flipper_length_mm NA 0.873 0.653 -0.578 ## 2 body_mass_g 0.873 NA 0.589 -0.472 ## 3 bill_length_mm 0.653 0.589 NA -0.229 ## 4 bill_depth_mm -0.578 -0.472 -0.229 NA
Visualizing the correlation with dotplot
With the function rplot() we can visualize the correlation as dot plot, with a default color palette for correlation ranging from -1 to +1.
Here we plot the correlation after rearranging by its strength as shown above.
penguins %>% correlate() %>% rearrange() %>% rplot() ## Correlation computed with ## • Method: 'pearson' ## • Missing treated using: 'pairwise.complete.obs'
Instead of plotting the symmetric correlation matrix as dot plot, we can plot the lower triangular matrix alone. To do that by we can shave off the upper triangle using shape() function and then use rplot() to plot it. In this example, we have also specified to print the correlation value on top of the dots.
penguins %>% drop_na() %>% select(where(is.numeric)) %>% correlate() %>% rearrange() %>% shave() %>% rplot(print_cor=TRUE)
corrr package also has nice function to visualize the correlation as network. With network_plot() we can make a graph/network with notes as the variables and the edges as the correlation values between variables.
penguins %>% drop_na() %>% select(where(is.numeric)) %>% correlate() %>% network_plot(min_cor=0.1) ggsave("correlation_network_plot_in_corrr.png") ## Correlation computed with ## • Method: 'pearson' ## • Missing treated using: 'pairwise.complete.obs'