Data Analysis Data Wrangling Tutorial

# Create Year-Month Column from Dates

### Import Modules

```import pandas as pd
import datetime
import random
```

### Why Create a Year-Month Numerical Column

Often times, you'll be asked to create an aggregate metric per month. Can you calculate sales per month? You'll have to create a new column for a year-month combination and then sum sales for each year-month combination.

I recommend calculating year-month in the format of year as a numerical number first and then month as a numerical number. For example, activity in August 2012 should shorten in Python to "2012-8". Why?

This format is appropriate for ordering dates from oldest to newest or newest to oldest. If you format months with an abbreviated name such as "August 2012" and "May 2012", ordering in Python will think "August" comes before "May" which is incorrect by the calendar. See code below that executes to `True`:

```"August 2012" < "May 2012"
```
```True
```

Also, year must come before month because proper ordering of dates should start with year, then month, day, hour, minute, second, etc.

If we reformat the code above to numbers, the code evaluates to `False` which is correct because August 2012 does not occur before May 2012.

```"2012-8" < "2012-5"
```
```False
```

### Example: Calculate User Signups Per Month

#### Create dates

Let's assume we work for a software as a service (SaaS) business that receives signups for our app. We will create random datetime values in increasing order to represent data for the times people signed up and assign those values to the list `signup_datetimes`.

```random.seed(18)
signup_datetime = datetime.datetime.today()
signup_datetimes = []
total_signups = 60

for signup in range(total_signups):
random_days = random.randint(1, 13)
random_minutes = random.randint(1, 50)
signup_datetime = signup_datetime + datetime.timedelta(days=random_days, minutes=random_minutes)
signup_datetimes.append(signup_datetime)
```

Create a DataFrame assigned to `df` with columns for time users signed up and a unique user id value for each signup.

```df = pd.DataFrame({'user_created_at': signup_datetimes, 'user_id': range(0, total_signups)})
```

Preview `df`.

```df.head()
```
user_created_at user_id
0 2019-10-07 15:11:40.354691 0
1 2019-10-18 15:40:40.354691 1
2 2019-10-24 15:56:40.354691 2
3 2019-10-28 16:28:40.354691 3
4 2019-11-08 17:00:40.354691 4

#### Create year-month column from date

Python has a method called `strftime()` that stands for string format time and can be applied to datetime objects. The method takes as an argument a format for re-formatting a datetime. Popular directives - parts to extract a year, month, etc. are:

Directive Meaning
`%Y` Year with century as a decimal number
`%m` Month as a decimal number
`%d` Day of the month as a decimal number
`%I` Hour (12-hour clock) as a decimal number [01, 12]
`%M` Minute as a decimal number

Below, I apply the Pandas series `strftime()` method to the `user_created_at` datetime column to convert values to the string format of `%Y-%m`.

```df['user_created_at_year_month'] = df['user_created_at'].dt.strftime('%Y-%m')
```

See all possible pandas string formatting of datetime directives on this official documentation page.

Preview `df`.

```df.head()
```
user_created_at user_id user_created_at_year_month
0 2019-10-07 15:11:40.354691 0 2019-10
1 2019-10-18 15:40:40.354691 1 2019-10
2 2019-10-24 15:56:40.354691 2 2019-10
3 2019-10-28 16:28:40.354691 3 2019-10
4 2019-11-08 17:00:40.354691 4 2019-11

#### Calculate user signups per month

I can group by the `user_created_at_year_month` and count the occurences of unique values using the method below in Pandas.

Then, I cast the resultant Pandas series object to a DataFrame using the `reset_index()` method and then apply the `rename()` method to rename the new created column to `count_signups`.

```df.groupby('user_created_at_year_month').size().reset_index().rename(columns={0: 'count_signups'})
```
user_created_at_year_month count_signups
0 2019-10 4
1 2019-11 6
2 2019-12 2
3 2020-01 5
4 2020-02 4
5 2020-03 4
6 2020-04 4
7 2020-05 5
8 2020-06 3
9 2020-07 3
10 2020-08 3
11 2020-09 3
12 2020-10 6
13 2020-11 4
14 2020-12 4

Since the dates in `df` were in order from latest to earliest, we see this same pattern as a result of the group by operation. However, if the original dates were out of order, we could simply order a DataFrame's datetime values with the Pandas `sort_values()` method