In this tutorial, we will learn how to connect paired data points with lines in a scatter plot using Matplotlib in python. Adding lines to paired data points can be extremely helpful in understanding the relationship between two variables with respect to a third variable. Mainly we use Matplotlib’s plot() function and scatter() function to make scatter plot and add lines to paired data points.
Let us load the packages needed.
import pandas as pd import matplotlib.pyplot as plt
To make scatterplot with lines connecting paired data points, we will use gapminder data set. And we will load it directly from datavizpyr.com’s github page.
p2data = "https://raw.githubusercontent.com/datavizpyr/data/master/gapminder-FiveYearData.csv" gapminder = pd.read_csv(p2data) gapminder.head() country year pop continent lifeExp gdpPercap 0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314 1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030 2 Afghanistan 1962 10267083.0 Asia 31.997 853.100710 3 Afghanistan 1967 11537966.0 Asia 34.020 836.197138 4 Afghanistan 1972 13079460.0 Asia 36.088 739.981106
Here, we will focus on subset of data to create paired data. An example dataset with paired information is same measurements at two time points.
df = gapminder.query('year in [1952,2007] & continent =="Asia"')
Simple Line PLot with Data points in Matplotlib
We have life lifeExp and gdpPercap values from Asian countries in two different years. Note that we are not trying to create a simple line plot connecting the data points in a scatterplot. We can easily create a simple line plot connecting points using Matplotlib’s plot() and scatter() functions as shown below.
plt.plot(df.gdpPercap, df.lifeExp, color='gray', zorder=-1) plt.scatter(df.gdpPercap, df.lifeExp, s=120) plt.xscale("log")
And our scatter plot with connected lines would look this “spaghetti”.
Scatter plot in Matplotlib
What we are interested in is understanding how the relationship between two quantitiave variables on the x-and y-axis in the scatter plot changes over time. To do that, we aim to connect the data points at one time point with the corresponding data point in the next time point. The key idea here is we have paired data points.
Let us start with scatterplot and color it by the third variable.
colors = {1952:'green', 2007:'orange'} plt.scatter(df.gdpPercap, df.lifeExp, s=150, c=df.year.map(colors)) plt.xscale("log") plt.xlabel("gdpPercap", size=24) plt.ylabel("lifeExp", size=24) plt.savefig("scatterplot_point_colored_by_variable_matplotlib_Python.png", format='png',dpi=150)
Plot connecting two coordinates with lines in Matplotlib
One of the key feature of the plot is to connect two paired data points with lines. First, let us make a plot without points, but connecting the locations of paired data points with a line. For example, if we have (x1,y1) and (x2,y2) from the same country for two years, we need to add a line between them.
Let us create numpy arrays with coordinates for X and Y of these paired data points.
X_coords= np.array([df.query('year==1952').gdpPercap, df.query('year==2007').gdpPercap]) Y_coords=np.array([df.query('year==1952').lifeExp, df.query('year==2007').lifeExp])
Our arrays are two dimesinonal and they look like this
X_coords[0:2,0:3] array([[ 779.4453145, 9867.084765 , 684.2441716], [ 974.5803384, 29796.04834 , 1391.253792 ]])
Y_coords[0:2,0:3] array([[28.801, 50.939, 37.484], [43.828, 75.635, 64.062]])
One of the lesser known features (to me :-)) of Matplotlib’s plot function, it can take the two dimensional arrays as input and make a plot. Here specify the line of the color to be gray
plt.figure(figsize=(8,6)) plt.plot(X_coords, Y_coords, color='gray') plt.xscale("log") plt.xlabel("gdpPercap", size=24) plt.ylabel("lifeExp", size=24) plt.savefig("paired_lineplot_matplotlib_Python.png", format='png',dpi=150)
We get a line plot connecting the paired data points.
Scatterplot with lines between paired datapoints
Now we are ready to put together the scatterplot colored by the variable with the paired lineplot to get the scatterplot connecting paird data points with lines.
plt.figure(figsize=(8,6)) plt.plot(X_coords, Y_coords, color='gray') colors = {1952:'green', 2007:'orange'} plt.scatter(df.gdpPercap, df.lifeExp, s=150, c=df.year.map(colors)) plt.xscale("log") plt.xlabel("gdpPercap", size=24) plt.ylabel("lifeExp", size=24) plt.savefig("Connecting_paired_points_scatterplot_matplotlib_Python.png", format='png',dpi=150)
We get a fantastic looking plot that shows how each country has fared over time with respect to gdp and life expectancy.
Idea of connecting the paired data points can be useful in multiple places in slightly different contexts. We hope to see some of the examples in later posts.