Density Plots with Pandas in Python

Density Plot on log-scale with Pandas
Density Plot on log-scale with Pandas

Pandas’ plot function is extremely useful in quickly making a variety of plots including density plots, boxplots and many more. In this post, we will see examples of making simple density plots using Pandas plot.density() function in Python.

Let us first load the packages needed.

# import pandas
import pandas as pd
# import matplotlib
import matplotlib.pyplot as plt

We will use data from 2019 Stack Overflow developer survey. The survey data is processed and accessible from datavizpyr.com‘s github page.

stackoverflow_file = "https://raw.githubusercontent.com/datavizpyr/data/master/SO_data_2019/StackOverflow_survey_filtered_subsampled_2019.csv"
# load Stack Overflow survey data
survey = pd.read_csv(stackoverflow_file)
survey.head()

  CompTotal	Gender	Manager	YearsCode	Age1stCode	YearsCodePro	Education
0	180000.0	Man	IC	25	17	20	Master's
1	55000.0	Man	IC	5	18	3	Bachelor's
2	77000.0	Man	IC	6	19	2	Bachelor's
3	67017.0	Man	IC	4	20	1	Bachelor's
4	90000.0	Man	IC	6	26	4	Less than bachelor's

Let us subset the data to contain education and annual salary information. And we will also filter our low annual salary rows.

salary = survey[['CompTotal','Education']].dropna()
salary = salary.query("CompTotal > 25000")
# save the data to a file
salary.to_csv("2019_Stack_Overflow_Survey_Education_Salary_US.tsv", sep="\t", index=False)
salary.head()


CompTotal	Education
0	180000.0	Master's
1	55000.0	Bachelor's
2	77000.0	Bachelor's
3	67017.0	Bachelor's
4	90000.0	Less than bachelor's

Basic Density Plot with Pandas Using plot.density()

Let us first make a simple density plot to see the distribution of developer salary in US using Pandas. We will use Pandas’ plot function and its accessor density() function to make the density plot.

salary.CompTotal.plot.density(figsize=(8,6),
                              fontsize=14,
                              xlim=(10000,1e6),
                              linewidth=4)
plt.xlabel("Salary in US",size=16)
plt.savefig("Simple_density_plot_with_Pandas_Python.jpg")

In this example for simple density plot, we specify the thickness of the density line, x-axis values limit, and font size. Due to the outliers with large annual salary, we can see that the density plot is skewed towards left with a long tail.

Simple Density Plot with Pandas

Density Plot on log-scale with Pandas

A better way to make the density plot is to change the scale of the data to log-scale. Density plot on log-scale will reduce the long tail we see here.

We can change to log-scale on x-axis by setting logx=True as argument inside plot.density() function.

salary.CompTotal.plot.density(figsize=(8,6),
                              logx=True,
                              fontsize=14,
                              xlim=(10000,1e6),
                              linewidth=4)
plt.xlabel("Salary in US",size=16)
plt.savefig("density_plot_with_log_scale_Pandas_Python.jpg")

Now our density plot on log-scale looks much better when compared to the density plot on original data.

Density Plot on log-scale with Pandas

Density Plot with Pandas Using plot.kde()

In addition to plot.density() function, Pandas also has plot.kde() function which can make density plots. KDE stands for kernel density estimation and it is a non-parametric technique to estimate the probability density function of a variable.

We can use plot.kde() function to make multiple density plots on log scale as we made the plot using plot.density() function.

salary.CompTotal.plot.kde(figsize=(8,6),
                              logx=True,
                              fontsize=14,
                              xlim=(10000,1e6),
                              linewidth=4)
plt.xlabel("Salary in US",size=16)
plt.savefig("density_plot_with_log_scale_using_kde_Pandas_Python.jpg")
Density Plot with Pandas kde() function
Exit mobile version