Data Visualizations Best Practices Tutorial

Visualize Historical Time Comparisons

Often times with datasets, we want to understand the changes in some metric over time.

For example, if your business recorded sales each day, we may want a comparison of the sales value today versus yesterday, and perhaps this month versus previous months. This is an important comparison because it would help the business understand if sales are growing over time or declining.

There are various types of plots we can use for historical comparisons. I'll detail a few below.

Import Modules

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
% matplotlib inline

Set Global Styles for Visualizations

sns.set_context('talk')
sns.set_style("darkgrid")

Example: Historical Flights Data

In the examples below, I utilize the dataset flights provided in the Seaborn visualization library.

Each month, from 1949 to 1960, there was a record of the count of passengers on all flights.

Load flights Dataset

df_flights = sns.load_dataset('flights')

Preview the First 5 Rows of Data

df_flights.head()
year month passengers
0 1949 January 112
1 1949 February 118
2 1949 March 132
3 1949 April 129
4 1949 May 121

Change month Column to Show Short Month Names

Create a Python dictionary that's a mapping of a month's long name to its short name.

month_long_to_short = {"January": "Jan",
                       "February": "Feb",
                       "March": "Mar",
                       "April": "Apr",
                       "May": "May",
                       "June": "Jun",
                       "July": "Jul",
                       "August": "Aug",
                       "September": "Sep",
                       "October": "Oct",
                       "November": "Nov",
                       "December": "Dec"
                      }

Create a new column called month that just shows each month's short name. This will make it easier to display the month's value on a visualization.

df_flights['month'] = df_flights['month'].map(month_long_to_short)

Create New Column of year_month

Convert year and month columns to be of string type

df_flights['year'] = df_flights['year'].astype(str)
df_flights['month'] = df_flights['month'].astype(str)

Concatenate columns of month and year with a space inbetween to create a new column called month_year

df_flights['month_year'] = df_flights['month'] + " - " + df_flights['year']

Visualize Historical Count of Passengers Per Month

This visualization below shows the change over time in count of passengers per month.

This visualization can be helpful if we'd like to see a general trend month by month. However, with flights, I'd expect a lot of seasonality in which there's lots of passengers flying on Thanksgiving, Christmas and summer months.

This visualization below is not ideal to see the seasonality. It's difficult to compare December of 1950 with December of 1951. The bars are so far apart.

This visualization also has lots of x-tick values that seemed crammed next to one another. I think there are better ways to visualize this data that we'll explore further.

df_flights[df_flights['year']<'1951'].set_index('month_year')['passengers'].plot(kind='bar',
                                                                                figsize=(19, 10),
                                                                                 color='cadetblue',
                                                                                rot=45)
plt.title("Historical Count of Passengers Per Month", y=1.013, fontsize=22)
plt.ylabel("Count [Passengers]", labelpad=16)
plt.xlabel("Date [Month - Year]", labelpad=16);

png

Interpretation of Historical Count of Passengers Per Month

For both 1949 and 1950, the months with the most passengers were July and August. It's likely people traveled more during those times because students had summer breaks and parents could easily travel with their kids. Also, July and August are typically warm summer months - ideal for vacation to tropical places.

Visualize Historical Count of Passengers by Month: Month by Month Comparison as Bar Chart

I want to create a visualization to more easily compare the count of passengers by months of various years.

Therefore, I will pivot my data to the format below so I can create a visualization and see bars each month of several years next to one another. I should be able to easily see January's count of passengers in 1949, 1950 and 1951 all next to one another.

months_in_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
df_flights_pivoted = df_flights[df_flights['year']<'1952'].pivot(index='month', columns='year', values='passengers').reindex(months_in_order)
df_flights_pivoted
year 1949 1950 1951
month
Jan 112 115 145
Feb 118 126 150
Mar 132 141 178
Apr 129 135 163
May 121 125 172
Jun 135 149 178
Jul 148 170 199
Aug 148 170 199
Sep 136 158 184
Oct 119 133 162
Nov 104 114 146
Dec 118 140 166

With this visualization below, we can now easily compare the count of passengers for each specific month of 1949, 1950 and 1951.

df_flights_pivoted.plot(kind='bar', figsize=(17, 10), color=['lightgray', 'gray', 'black'], rot=0)                                       
plt.title("Historical Count of Passengers: Month by Month Comparison", y=1.013, fontsize=22)
plt.xlabel("Date [Month]", labelpad=16)
plt.ylabel("Count [Passengers]", labelpad=16);

png

Interpretation of Historical Count of Passengers: Month by Month Comaprison

There was an increase in the count of passengers by each month every year from 1949 to 1951.

July and August are the months with typically the most passengers every year.

Visualize Historical Count of Passengers by Month: Month by Month Comparison as Line Plot

This visualization below is similar to the one above in that we easily compare counts of passengers by month.

This line plot works well because we have no intersections between lines.

df_flights_pivoted.plot(kind='line', figsize=(17, 10), color=['lightgray', 'gray', 'black'])
plt.title("Historical Count of Passengers: Month by Month Comaprison", y=1.013, fontsize=22)
plt.xlabel("Date [Month]", labelpad=16)
plt.ylabel("Count [Passengers]", labelpad=16);                                                                      

png

Visualize Cumulative Sum of Passengers Each Year

Below we plot the running total of passengers on flights, by each year, with each year starting in January.

df_flights_pivoted.cumsum().plot(kind='line', figsize=(17, 10), color=['lightgray', 'gray', 'black'])
plt.title("Cumulative Sum of Passengers Per Year", y=1.013, fontsize=22)
plt.xlabel("Date [Month]", labelpad=16)
plt.ylabel("Count [Passengers]", labelpad=16);

png

Interpretation of Cumulative Sum of Passengers Per Year

Each successive year, there is an increase in the count of passengers on flights at any point in time.

There was a larger increase in total passengers who had flown year-to-date from 1951 to 1950 than 1950 to 1949 at any point in time.

Heatmap of Passenger Counts on Flights

A heatmap is also ideal for this type of data.

Below, we can easily see the change in count of passengers per month over the years. We can also easily see the change of passengers per month for a single year or over multiple years.

figure, axes = plt.subplots(figsize=(10, 11))
sns.heatmap(df_flights_pivoted, annot=True, fmt="d", linewidths=.5, ax=axes, cmap="Greys")
axes.axes.set_title("Count of Passengers on Flights by Month and Year", fontsize=20, y=1.01)
axes.axes.set_ylabel("month", labelpad=50, rotation=0)
axes.axes.set_xlabel("year", labelpad=16);
plt.yticks(rotation=0);

png

Interpretation of Heatmap of Passenger Counts on Flights

This visualization helps detail flights as a cyclical business. Typically, the summer months of June, July and August and the first fall month of September have the most passengers on flights.

March and December are also common peaks for count of passengers compared to other months.