Data Analysis Data Wrangling Tutorial

Create Year-Month Column from Dates

Import Modules

In [8]:
import pandas as pd
import datetime
import random

Why Create a Year-Month Numerical Column

Often times, you'll be asked to create an aggregate metric per month. Can you calculate sales per month? You'll have to create a new column for a year-month combination and then sum sales for each year-month combination.

I recommend calculating year-month in the format of year as a numerical number first and then month as a numerical number. For example, activity in August 2012 should shorten in Python to "2012-8". Why?

This format is appropriate for ordering dates from oldest to newest or newest to oldest. If you format months with an abbreviated name such as "August 2012" and "May 2012", ordering in Python will think "August" comes before "May" which is incorrect by the calendar. See code below that executes to True:

In [9]:
"August 2012" < "May 2012"

Also, year must come before month because proper ordering of dates should start with year, then month, day, hour, minute, second, etc.

If we reformat the code above to numbers, the code evaluates to False which is correct because August 2012 does not occur before May 2012.

In [10]:
"2012-8" < "2012-5"

Example: Calculate User Signups Per Month

Create dates

Let's assume we work for a software as a service (SaaS) business that receives signups for our app. We will create random datetime values in increasing order to represent data for the times people signed up and assign those values to the list signup_datetimes.

In [11]:
signup_datetime =
signup_datetimes = []
total_signups = 60

for signup in range(total_signups):
    random_days = random.randint(1, 13)
    random_minutes = random.randint(1, 50)
    signup_datetime = signup_datetime + datetime.timedelta(days=random_days, minutes=random_minutes)

Create a DataFrame assigned to df with columns for time users signed up and a unique user id value for each signup.

In [12]:
df = pd.DataFrame({'user_created_at': signup_datetimes, 'user_id': range(0, total_signups)})

Preview df.

In [13]:
user_created_at user_id
0 2019-10-07 15:11:40.354691 0
1 2019-10-18 15:40:40.354691 1
2 2019-10-24 15:56:40.354691 2
3 2019-10-28 16:28:40.354691 3
4 2019-11-08 17:00:40.354691 4

Create year-month column from date

Python has a method called strftime() that stands for string format time and can be applied to datetime objects. The method takes as an argument a format for re-formatting a datetime. Popular directives - parts to extract a year, month, etc. are:

Directive Meaning
%Y Year with century as a decimal number
%m Month as a decimal number
%d Day of the month as a decimal number
%I Hour (12-hour clock) as a decimal number [01, 12]
%M Minute as a decimal number

Below, I apply the Pandas series `strftime()` method to the user_created_at datetime column to convert values to the string format of %Y-%m.

In [14]:
df['user_created_at_year_month'] = df['user_created_at'].dt.strftime('%Y-%m')

See all possible pandas string formatting of datetime directives on this official documentation page.

Preview df.

In [15]:
user_created_at user_id user_created_at_year_month
0 2019-10-07 15:11:40.354691 0 2019-10
1 2019-10-18 15:40:40.354691 1 2019-10
2 2019-10-24 15:56:40.354691 2 2019-10
3 2019-10-28 16:28:40.354691 3 2019-10
4 2019-11-08 17:00:40.354691 4 2019-11

Calculate user signups per month

I can group by the user_created_at_year_month and count the occurences of unique values using the method below in Pandas.

Then, I cast the resultant Pandas series object to a DataFrame using the reset_index() method and then apply the rename() method to rename the new created column to count_signups.

In [16]:
df.groupby('user_created_at_year_month').size().reset_index().rename(columns={0: 'count_signups'})
user_created_at_year_month count_signups
0 2019-10 4
1 2019-11 6
2 2019-12 2
3 2020-01 5
4 2020-02 4
5 2020-03 4
6 2020-04 4
7 2020-05 5
8 2020-06 3
9 2020-07 3
10 2020-08 3
11 2020-09 3
12 2020-10 6
13 2020-11 4
14 2020-12 4

Since the dates in df were in order from latest to earliest, we see this same pattern as a result of the group by operation. However, if the original dates were out of order, we could simply order a DataFrame's datetime values with the Pandas sort_values() method.

In [ ]: