How To Connect Paired Data Points with Lines in Scatter Plot with Matplotlib

In this tutorial, we will learn how to connect paired data points with lines in a scatter plot using Matplotlib in python. Adding lines to paired data points can be extremely helpful in understanding the relationship between two variables with respect to a third variable. Mainly we use Matplotlib’s plot() function and scatter() function to make scatter plot and add lines to paired data points.

How To Connect Paired Data Points with Lines using Matplotlib in Python?

Let us load the packages needed.

import pandas as pd
import matplotlib.pyplot as plt

To make scatterplot with lines connecting paired data points, we will use gapminder data set. And we will load it directly from datavizpyr.com’s github page.

p2data = "https://raw.githubusercontent.com/datavizpyr/data/master/gapminder-FiveYearData.csv"
gapminder = pd.read_csv(p2data)
gapminder.head()

country	year	pop	continent	lifeExp	gdpPercap
0	Afghanistan	1952	8425333.0	Asia	28.801	779.445314
1	Afghanistan	1957	9240934.0	Asia	30.332	820.853030
2	Afghanistan	1962	10267083.0	Asia	31.997	853.100710
3	Afghanistan	1967	11537966.0	Asia	34.020	836.197138
4	Afghanistan	1972	13079460.0	Asia	36.088	739.981106

Here, we will focus on subset of data to create paired data. An example dataset with paired information is same measurements at two time points.

df = gapminder.query('year in [1952,2007] & continent =="Asia"')

Simple Line PLot with Data points in Matplotlib

We have life lifeExp and gdpPercap values from Asian countries in two different years. Note that we are not trying to create a simple line plot connecting the data points in a scatterplot. We can easily create a simple line plot connecting points using Matplotlib’s plot() and scatter() functions as shown below.

plt.plot(df.gdpPercap, 
         df.lifeExp,
         color='gray', 
         zorder=-1)
plt.scatter(df.gdpPercap, 
            df.lifeExp,
            s=120)
plt.xscale("log")

And our scatter plot with connected lines would look this “spaghetti”.

Scatter plot in Matplotlib

What we are interested in is understanding how the relationship between two quantitiave variables on the x-and y-axis in the scatter plot changes over time. To do that, we aim to connect the data points at one time point with the corresponding data point in the next time point. The key idea here is we have paired data points.

Let us start with scatterplot and color it by the third variable.

colors = {1952:'green', 2007:'orange'}
plt.scatter(df.gdpPercap, 
            df.lifeExp,
            s=150,
            c=df.year.map(colors))
plt.xscale("log")
plt.xlabel("gdpPercap", size=24)
plt.ylabel("lifeExp", size=24)
plt.savefig("scatterplot_point_colored_by_variable_matplotlib_Python.png",
                    format='png',dpi=150)

Matplotlib Scatterplot colored by variable

Plot connecting two coordinates with lines in Matplotlib

One of the key feature of the plot is to connect two paired data points with lines. First, let us make a plot without points, but connecting the locations of paired data points with a line. For example, if we have (x1,y1) and (x2,y2) from the same country for two years, we need to add a line between them.

Let us create numpy arrays with coordinates for X and Y of these paired data points.

X_coords= np.array([df.query('year==1952').gdpPercap, 
                           df.query('year==2007').gdpPercap])
Y_coords=np.array([df.query('year==1952').lifeExp, 
                           df.query('year==2007').lifeExp])

Our arrays are two dimesinonal and they look like this

X_coords[0:2,0:3]
array([[  779.4453145,  9867.084765 ,   684.2441716],
       [  974.5803384, 29796.04834  ,  1391.253792 ]])

Y_coords[0:2,0:3]
array([[28.801, 50.939, 37.484],
       [43.828, 75.635, 64.062]])

One of the lesser known features (to me :-)) of Matplotlib’s plot function, it can take the two dimensional arrays as input and make a plot. Here specify the line of the color to be gray

plt.figure(figsize=(8,6))
plt.plot(X_coords, 
         Y_coords, 
         color='gray')
plt.xscale("log")
plt.xlabel("gdpPercap", size=24)
plt.ylabel("lifeExp", size=24)
plt.savefig("paired_lineplot_matplotlib_Python.png",
                    format='png',dpi=150)

We get a line plot connecting the paired data points.

Scatterplot with lines between paired datapoints

Now we are ready to put together the scatterplot colored by the variable with the paired lineplot to get the scatterplot connecting paird data points with lines.

plt.figure(figsize=(8,6))
plt.plot(X_coords, 
         Y_coords, 
         color='gray')
colors = {1952:'green', 2007:'orange'}
plt.scatter(df.gdpPercap, 
            df.lifeExp,
            s=150,
            c=df.year.map(colors))
plt.xscale("log")
plt.xlabel("gdpPercap", size=24)
plt.ylabel("lifeExp", size=24)
plt.savefig("Connecting_paired_points_scatterplot_matplotlib_Python.png",
            format='png',dpi=150)

We get a fantastic looking plot that shows how each country has fared over time with respect to gdp and life expectancy.

Connect Paired Points with Lines in Matplotlib

Idea of connecting the paired data points can be useful in multiple places in slightly different contexts. We hope to see some of the examples in later posts.