Create Year-Month Column from Dates¶
Date published: 2019-10-04
Category: Data Analysis
Subcategory: Data Wrangling
Tags: datetime, python, pandas, year-month
Import Modules¶
import pandas as pd
import datetime
import random
Why Create a Year-Month Numerical Column¶
Often times, you'll be asked to create an aggregate metric per month. Can you calculate sales per month? You'll have to create a new column for a year-month combination and then sum sales for each year-month combination.
I recommend calculating year-month in the format of year as a numerical number first and then month as a numerical number. For example, activity in August 2012 should shorten in Python to "2012-8". Why?
This format is appropriate for ordering dates from oldest to newest or newest to oldest. If you format months with an abbreviated name such as "August 2012" and "May 2012", ordering in Python will think "August" comes before "May" which is incorrect by the calendar. See code below that executes to True
:
"August 2012" < "May 2012"
True
Also, year must come before month because proper ordering of dates should start with year, then month, day, hour, minute, second, etc.
If we reformat the code above to numbers, the code evaluates to False
which is correct because August 2012 does not occur before May 2012.
"2012-8" < "2012-5"
False
Example: Calculate User Signups Per Month¶
Create dates¶
Let's assume we work for a software as a service (SaaS) business that receives signups for our app. We will create random datetime values in increasing order to represent data for the times people signed up and assign those values to the list signup_datetimes
.
random.seed(18)
signup_datetime = datetime.datetime.today()
signup_datetimes = []
total_signups = 60
for signup in range(total_signups):
random_days = random.randint(1, 13)
random_minutes = random.randint(1, 50)
signup_datetime = signup_datetime + datetime.timedelta(days=random_days, minutes=random_minutes)
signup_datetimes.append(signup_datetime)
Create a DataFrame assigned to df
with columns for time users signed up and a unique user id value for each signup.
df = pd.DataFrame({'user_created_at': signup_datetimes, 'user_id': range(0, total_signups)})
Preview df
.
df.head()
user_created_at | user_id | |
---|---|---|
0 | 2019-10-07 15:11:40.354691 | 0 |
1 | 2019-10-18 15:40:40.354691 | 1 |
2 | 2019-10-24 15:56:40.354691 | 2 |
3 | 2019-10-28 16:28:40.354691 | 3 |
4 | 2019-11-08 17:00:40.354691 | 4 |
Create year-month column from date¶
Python has a method called strftime()
that stands for string format time and can be applied to datetime objects. The method takes as an argument a format for re-formatting a datetime. Popular directives - parts to extract a year, month, etc. are:
Directive | Meaning |
---|---|
%Y |
Year with century as a decimal number |
%m |
Month as a decimal number |
%d |
Day of the month as a decimal number |
%I |
Hour (12-hour clock) as a decimal number [01, 12] |
%M |
Minute as a decimal number |
Below, I apply the Pandas series strftime()
method to the user_created_at
datetime column to convert values to the string format of %Y-%m
.
df['user_created_at_year_month'] = df['user_created_at'].dt.strftime('%Y-%m')
See all possible pandas string formatting of datetime directives on this official documentation page.
Preview df
.
df.head()
user_created_at | user_id | user_created_at_year_month | |
---|---|---|---|
0 | 2019-10-07 15:11:40.354691 | 0 | 2019-10 |
1 | 2019-10-18 15:40:40.354691 | 1 | 2019-10 |
2 | 2019-10-24 15:56:40.354691 | 2 | 2019-10 |
3 | 2019-10-28 16:28:40.354691 | 3 | 2019-10 |
4 | 2019-11-08 17:00:40.354691 | 4 | 2019-11 |
Calculate user signups per month¶
I can group by the user_created_at_year_month
and count the occurences of unique values using the method below in Pandas.
Then, I cast the resultant Pandas series object to a DataFrame using the reset_index() method and then apply the rename() method to rename the new created column to count_signups
.
df.groupby('user_created_at_year_month').size().reset_index().rename(columns={0: 'count_signups'})
user_created_at_year_month | count_signups | |
---|---|---|
0 | 2019-10 | 4 |
1 | 2019-11 | 6 |
2 | 2019-12 | 2 |
3 | 2020-01 | 5 |
4 | 2020-02 | 4 |
5 | 2020-03 | 4 |
6 | 2020-04 | 4 |
7 | 2020-05 | 5 |
8 | 2020-06 | 3 |
9 | 2020-07 | 3 |
10 | 2020-08 | 3 |
11 | 2020-09 | 3 |
12 | 2020-10 | 6 |
13 | 2020-11 | 4 |
14 | 2020-12 | 4 |
Since the dates in df
were in order from latest to earliest, we see this same pattern as a result of the group by operation. However, if the original dates were out of order, we could simply order a DataFrame's datetime values with the Pandas sort_values() method.