How To Make Scree Plot in R with ggplot2

Scree plot: barplot with geom_col()
Scree plot: barplot with geom_col()

PCA aka Principal Component analysis is one of the most commonly used unsupervised learning techniques in Machine Learning. PCA on a high dimensional data can reveal the pattern or structure in the data. Scree plot is one of the diagnostic tools associated with PCA and help us understand the data better.

Scree plot is basically visualizing the variance explained, proportion of variation, by each Principal component from PCA. A dataset with many similar feature will have few have principal components explaining most of the variation in the data.

In this tutorial, we will learn to how to make Scree plot using ggplot2 in R. We will use Palmer Penguins dataset to do PCA and show two ways to create scree plot. At first we will make Scree plot using line plots with Principal components on x-axis and variance explained by each PC as point connected by line. Then we will make Scree plot using barplot with principal components on x-axis and height of the bar representing variance explained.

Principal Component Analysis (PCA) in R

Let us load the packages needed to do PCA in R and make scree plot.

library(tidyverse)
library(palmerpenguins)
theme_set(theme_bw(24))

Our Penguins data look likes with three categorical variables and five numerical variables.

head(penguins)
## # A tibble: 6 x 8
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Adelie  Torge…           39.1          18.7              181        3750 male 
## 2 Adelie  Torge…           39.5          17.4              186        3800 fema…
## 3 Adelie  Torge…           40.3          18                195        3250 fema…
## 4 Adelie  Torge…           NA            NA                 NA          NA <NA> 
## 5 Adelie  Torge…           36.7          19.3              193        3450 fema…
## 6 Adelie  Torge…           39.3          20.6              190        3650 male 
## # … with 1 more variable: year <int>

We will need only the numerical variables for doing PCA. So let us select the numerical variables from Penguins data frame.

penguins_data <- penguins %>%
  select(bill_length_mm:year, -sex)

Now we are ready to do PCA analysis. We use prcomp() function in R to do Principal Component analysis. We filter out any rows with missing values and scale the variables before doing PCA with prcomp() function.

pca_obj <- prcomp(drop_na(penguins_data), scale. = TRUE)

Compute Variance Explained by Each PC

Let us look at the contribution or the importance of the principal components using summary() function.

summary(pca_obj)
## Importance of components:
##                          PC1    PC2    PC3     PC4     PC5
## Standard deviation     1.664 0.9967 0.8786 0.60433 0.31512
## Proportion of Variance 0.554 0.1987 0.1544 0.07304 0.01986
## Cumulative Proportion  0.554 0.7527 0.9071 0.98014 1.00000

We will create a dataframe containing the PCs and the variance explained by each PC. We can compute the proportion of variation for each PC using std, i.e standard deviation values from the PCA results.

var_explained_df <- data.frame(PC= paste0("PC",1:5),
                               var_explained=(pca_obj$sdev)^2/sum((pca_obj$sdev)^2))

head(var_explained_df)

Scree plot with line plot in R

Now, we have the data to make a scree plot. First, we make a scree plot as a line plot using geom_point() and geom_line() as shown below.

var_explained_df %>%
  ggplot(aes(x=PC,y=var_explained, group=1))+
  geom_point(size=4)+
  geom_line()+
  labs(title="Scree plot: PCA on scaled data")

We can see that the first PC explains over 55% of the variation and the second PC explains close to 20% of the variation in the data.

Scree plot with line plot using ggplot2 in R

Screeplot with bar plot in R

We can also make Scree plot as barplot with PCs on x-axis and variance explained as the height of the bar.

var_explained_df %>%
  ggplot(aes(x=PC,y=var_explained))+
  geom_col()+
  labs(title="Scree plot: PCA on scaled data")

And we get Scree plot as bar plot. We can see that the first three PCs explains close to 90% of the variation in the data.

Scree plot: barplot with geom_col()
Exit mobile version