One of the common data exploring activities is to quickly have an understanding of the exten of missing data in a data set. There are a few quick ways to make a visualization of missing data. In this tutorial we will learn how to make stacked barplot with counts and proportion of missing data in each column of the data.
Let us load tidyvere packages.
library(tidyverse) theme_set(theme_bw(base_size=16))
We will use Hawks dataset from Rdatasets package.
url = "https://vincentarelbundock.github.io/Rdatasets/csv/Stat2Data/Hawks.csv" hawks <- read_csv(url) %>% select(-X1)
A quick glance at the data shows that it has columns with missing data.
hawks %>% head() ## # A tibble: 6 x 19 ## Month Day Year CaptureTime ReleaseTime BandNumber Species Age Sex Wing ## <dbl> <dbl> <dbl> <chr> <time> <chr> <chr> <chr> <chr> <dbl> ## 1 9 19 1992 13:30 NA 877-76317 RT I <NA> 385 ## 2 9 22 1992 10:30 NA 877-76318 RT I <NA> 376 ## 3 9 23 1992 12:45 NA 877-76319 RT I <NA> 381 ## 4 9 23 1992 10:50 NA 745-49508 CH I F 265 ## 5 9 27 1992 11:15 NA 1253-98801 SS I F 205 ## 6 9 28 1992 11:25 NA 1207-55910 RT I <NA> 412 ## # … with 9 more variables: Weight <dbl>, Culmen <dbl>, Hallux <dbl>, ## # Tail <dbl>, StandardTail <dbl>, Tarsus <dbl>, WingPitFat <dbl>, ## # KeelFat <dbl>, Crop <dbl>
Now let us make a vertically stacked barplot for the numbers of missing data in each column.
Convert Dataframe to a Boolean Dataframe based on NAs
First, we need to find if each element of every column is missing data or not. We can use is.na() function and summarise_all() function to convert our dataframe into boolean dataframe. Note the use of list to apply a lambda function, here is.na() function to find the missing data.
hawks %>% summarise_all(list(~is.na(.)))
Here an element is True when the original dataframe has missing data and an element is False if it is not a missing data.
## # A tibble: 908 x 19 ## Month Day Year CaptureTime ReleaseTime BandNumber Species Age Sex ## <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> <lgl> ## 1 FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE ## 2 FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE ## 3 FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE ## 4 FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE ## 5 FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
Convert Wide Boolean Dataframe to a Long Dataframe with pivot_wider()
Our boolean data is still in wide form, we can use pivot_wider() from tidyr to reshape the dataframe to convert to long form.
hawks %>% summarise_all(list(~is.na(.)))%>% pivot_longer(everything(), names_to = "variables", values_to="missing")
## # A tibble: 17,252 x 2 ## variables missing ## <chr> <lgl> ## 1 Month FALSE ## 2 Day FALSE ## 3 Year FALSE ## 4 CaptureTime FALSE ## 5 ReleaseTime TRUE ## 6 BandNumber FALSE ## 7 Species FALSE ## 8 Age FALSE ## 9 Sex TRUE ## 10 Wing FALSE ## # … with 17,242 more rows
Note that we can first convert to tidy form and then use is.na() to find if the element is NA or not when our dataframe contains columns/data of the same type.
Count the number of missing values for each column variable
We are ready do compute the number of missing values and non-missing values for each column using count().
hawks %>% summarise_all(list(~is.na(.)))%>% pivot_longer(everything(), names_to = "variables", values_to="missing") %>% count(variables, missing)
## # A tibble: 32 x 3 ## variables missing n ## <chr> <lgl> <int> ## 1 Age FALSE 908 ## 2 BandNumber FALSE 906 ## 3 BandNumber TRUE 2 ## 4 CaptureTime FALSE 907 ## 5 CaptureTime TRUE 1 ## 6 Crop FALSE 565 ## 7 Crop TRUE 343 ## 8 Culmen FALSE 901 ## 9 Culmen TRUE 7 ## 10 Day FALSE 908 ## # … with 22 more rows
Stacked Barplot with missing value counts
Finally we are ready to make the stacked barplot with the missing value counts for each column in the data.
hawks %>% summarise_all(list(~is.na(.)))%>% pivot_longer(everything(), names_to = "variables", values_to="missing") %>% count(variables, missing) %>% ggplot(aes(y=variables,x=n,fill=missing))+ geom_col()
The stacked barplot clearly shows the extent of missing values in each of the columns.
Customizing Stacked Barplot with missing value counts
Let us customize the vertically stacked barplot of missing value counts.
hawks %>% summarise_all(list(~is.na(.)))%>% pivot_longer(everything(), names_to = "variables", values_to="missing") %>% count(variables, missing) %>% ggplot(aes(y=variables,x=n,fill=missing))+ geom_col()+ scale_fill_manual(values=c("skyblue3","gold"))+ theme(axis.title.y=element_blank())
Stacked Barplot with missing value proportions
Another way to visualize missing data using barplots is to plot the proportion instead of counts. To make barplots with proportion, we add the argument position=”fill” to geom_col() function.
hawks %>% summarise_all(list(~is.na(.)))%>% pivot_longer(everything(), names_to = "variables", values_to="missing") %>% count(variables, missing) %>% ggplot(aes(y=variables,x=n,fill=missing))+ geom_col(position = "fill")+ labs(x="Proportion")+ scale_fill_manual(values=c("skyblue3","gold"))+ theme(axis.title.y=element_blank())
And we get the stacked barplot with proportion.
A small note, this example is one of the rare examples where I prefer vertically stacked over barplots with bars side by side.
Visualizing missing data counts in rows
So far we saw, the extent of missing data in each column of a dataframe. Another useful data exploration strategy is to quickly visualize the number of missing data in each row.
Here we use the same hawks dataset to compute the number missiung values in each row. The basic idea is the same, but we need to add unique row ID for each row. Here we manually add row number and to compute the number of missing values per row.
hawks %>% summarise_all(list(~is.na(.))) %>% mutate(row_id=row_number()) %>% pivot_longer(-row_id, names_to = "variables", values_to="missing") %>% count(row_id, missing) %>% ggplot(aes(y=as.factor(row_id),x=n,fill=missing))+ geom_col()+ scale_fill_manual(values=c("skyblue3","gold"))+ theme(axis.text.y=element_blank())+ labs(y="row_id")