Create Year-Month Column from Dates¶

Date published: 2019-10-04

Category: Data Analysis

Subcategory: Data Wrangling

Tags: datetime, python, pandas, year-month

Import Modules¶

In [8]:

                
                    Copied!
                    
import pandas as pd
import datetime
import random
import pandas as pd
import datetime
import random

Why Create a Year-Month Numerical Column¶

Often times, you'll be asked to create an aggregate metric per month. Can you calculate sales per month? You'll have to create a new column for a year-month combination and then sum sales for each year-month combination.

I recommend calculating year-month in the format of year as a numerical number first and then month as a numerical number. For example, activity in August 2012 should shorten in Python to "2012-8". Why?

This format is appropriate for ordering dates from oldest to newest or newest to oldest. If you format months with an abbreviated name such as "August 2012" and "May 2012", ordering in Python will think "August" comes before "May" which is incorrect by the calendar. See code below that executes to True:

In [9]:

                
                    Copied!
                    
"August 2012" < "May 2012"
"August 2012" < "May 2012"

Out[9]:

True

Also, year must come before month because proper ordering of dates should start with year, then month, day, hour, minute, second, etc.

If we reformat the code above to numbers, the code evaluates to False which is correct because August 2012 does not occur before May 2012.

In [10]:

                
                    Copied!
                    
"2012-8" < "2012-5"
"2012-8" < "2012-5"

Out[10]:

False

Example: Calculate User Signups Per Month¶

Create dates¶

Let's assume we work for a software as a service (SaaS) business that receives signups for our app. We will create random datetime values in increasing order to represent data for the times people signed up and assign those values to the list signup_datetimes.

In [11]:

                
                    Copied!
                    
                        
                        
                    
                    

            
random.seed(18)
signup_datetime = datetime.datetime.today()
signup_datetimes = []
total_signups = 60

for signup in range(total_signups):
    random_days = random.randint(1, 13)
    random_minutes = random.randint(1, 50)
    signup_datetime = signup_datetime + datetime.timedelta(days=random_days, minutes=random_minutes)
    signup_datetimes.append(signup_datetime)
random.seed(18)
signup_datetime = datetime.datetime.today()
signup_datetimes = []
total_signups = 60

for signup in range(total_signups):
    random_days = random.randint(1, 13)
    random_minutes = random.randint(1, 50)
    signup_datetime = signup_datetime + datetime.timedelta(days=random_days, minutes=random_minutes)
    signup_datetimes.append(signup_datetime)

Create a DataFrame assigned to df with columns for time users signed up and a unique user id value for each signup.

In [12]:

                
                    Copied!
                    
df = pd.DataFrame({'user_created_at': signup_datetimes, 'user_id': range(0, total_signups)})
df = pd.DataFrame({'user_created_at': signup_datetimes, 'user_id': range(0, total_signups)})

Preview df.

In [13]:

                
                    Copied!
                    
df.head()
df.head()

Out[13]:

	user_created_at	user_id
0	2019-10-07 15:11:40.354691	0
1	2019-10-18 15:40:40.354691	1
2	2019-10-24 15:56:40.354691	2
3	2019-10-28 16:28:40.354691	3
4	2019-11-08 17:00:40.354691	4

Create year-month column from date¶

Python has a method called strftime() that stands for string format time and can be applied to datetime objects. The method takes as an argument a format for re-formatting a datetime. Popular directives - parts to extract a year, month, etc. are:

Directive	Meaning
`%Y`	Year with century as a decimal number
`%m`	Month as a decimal number
`%d`	Day of the month as a decimal number
`%I`	Hour (12-hour clock) as a decimal number [01, 12]
`%M`	Minute as a decimal number

Below, I apply the Pandas series strftime() method to the user_created_at datetime column to convert values to the string format of %Y-%m.

In [14]:

                
                    Copied!
                    
df['user_created_at_year_month'] = df['user_created_at'].dt.strftime('%Y-%m')
df['user_created_at_year_month'] = df['user_created_at'].dt.strftime('%Y-%m')

See all possible pandas string formatting of datetime directives on this official documentation page.

Preview df.

In [15]:

                
                    Copied!
                    
df.head()
df.head()

Out[15]:

	user_created_at	user_id	user_created_at_year_month
0	2019-10-07 15:11:40.354691	0	2019-10
1	2019-10-18 15:40:40.354691	1	2019-10
2	2019-10-24 15:56:40.354691	2	2019-10
3	2019-10-28 16:28:40.354691	3	2019-10
4	2019-11-08 17:00:40.354691	4	2019-11

Calculate user signups per month¶

I can group by the user_created_at_year_month and count the occurences of unique values using the method below in Pandas.

Then, I cast the resultant Pandas series object to a DataFrame using the reset_index() method and then apply the rename() method to rename the new created column to count_signups.

In [16]:

                
                    Copied!
                    
df.groupby('user_created_at_year_month').size().reset_index().rename(columns={0: 'count_signups'})
df.groupby('user_created_at_year_month').size().reset_index().rename(columns={0: 'count_signups'})

Out[16]:

	user_created_at_year_month	count_signups
0	2019-10	4
1	2019-11	6
2	2019-12	2
3	2020-01	5
4	2020-02	4
5	2020-03	4
6	2020-04	4
7	2020-05	5
8	2020-06	3
9	2020-07	3
10	2020-08	3
11	2020-09	3
12	2020-10	6
13	2020-11	4
14	2020-12	4

Since the dates in df were in order from latest to earliest, we see this same pattern as a result of the group by operation. However, if the original dates were out of order, we could simply order a DataFrame's datetime values with the Pandas sort_values() method.

In [ ]: