Data Analysis Data Wrangling Tutorial

cut() Method: Bin Values into Discrete Intervals

Import Modules

import pandas as pd
import numpy as np

Why Bin Data

Often times you have numerical data on very large scales. Sometimes, it can be easier to bin the values into groups. This is helpful to more easily perform descriptive statistics by groups as a generalization of patterns in the data.

We'll cover an example below of binning age values into groups.

Binning in pandas with Age Example

First, let's create a simple pandas DataFrame assigned to the variable df_ages with just one colum for age. This column will contain 40 random age values between 20 and 100 (inclusive on each end).

df_ages = pd.DataFrame({'age': np.random.randint(20, 100, 40)})

Let's preview the first 5 rows of df_ages.

df_ages.head()
age
0 37
1 37
2 84
3 99
4 93

Ages are often denoted by decade - saying people are in their 20s or 30s. To get to that denotion, it helps to understand that someone in their 20s is between the age of 20-29. Let's create a new column called age_range that provides that nearly 10-year window of ages.

First, let's create a list assigned to the variable age_ranges that creates list items in a list comprehension. Each list item created is a value of age - age+9 for ages in the range of 20 to 100 with a step value of 10. I also printed the output of age_ranges.

If the syntax below still seems daunting, you can learn more about string formatting from this tutorial on my website and the range() function via this article on Real Python.

age_ranges = ["{0} - {1}".format(age, age + 9) for age in range(20, 100, 10)]
age_ranges
['20 - 29',
 '30 - 39',
 '40 - 49',
 '50 - 59',
 '60 - 69',
 '70 - 79',
 '80 - 89',
 '90 - 99']

Create a variable count_unique_age_ranges that's the count of items in the age_ranges list above.

count_unique_age_ranges = len(age_ranges)
count_unique_age_ranges
8

In order to bin our ages, we want to use the pandas cut() method.

For the argument x, we pass in the values in the age column from our df_ages DataFrame.

For the argument bins, we pass in the number of bins we want to create designated by the variable count_unique_age_ranges.

For the argument labels, we specify the labels for our returned binned column which is the list we created above assigned to the variable age_ranges.

With the returned output of the cut() method, we'll create a new column in df_ages called age_range.

Since we want 8 bins originally from the age column, this cut() method knows to make bins with ages between 20-29, 30-39 and so forth. It just so happens that we designate the values in age_range to be a similar looking string value.

The order of values we specify in the labels age_ranges matter too! The first item, 20-29, corresponds to the first bin created that will contain age values in the range of 20-29, and so forth.

df_ages['age_range'] = pd.cut(x=df_ages['age'], bins=count_unique_age_ranges, labels=age_ranges)

Let's preview the first few rows of df_ages.

df_ages.head()
age age_range
0 37 30 - 39
1 37 30 - 39
2 84 80 - 89
3 99 90 - 99
4 93 90 - 99

We can apply the info() method to our DataFrame. Notice next to our new column called age_range that the data type is a category.

df_ages.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 2 columns):
age          40 non-null int64
age_range    40 non-null category
dtypes: category(1), int64(1)
memory usage: 504.0 bytes

Let's create another new column to define someone's age denoted by decade. So, if you're 28, you're in your 20s and if you're 54, you're in your 50s. We'll later call this column age_by_decade.

First, let's create a list assigned to the variable age_by_decade that creates list items by the decade names. We'll simply take each age from 20 to 100 with a step of 10 and simply append an s to the end of each age value.

age_by_decade = ["{0}s".format(age) for age in range(20, 100, 10)]
age_by_decade
['20s', '30s', '40s', '50s', '60s', '70s', '80s', '90s']

Create a variable count_unique_age_decades that's the count of items in the age_by_decade list above.

count_unique_age_decades = len(age_by_decade)
count_unique_age_decades
8

Let's create a new column called age_by_decade that's made using the pandas cut() method.

For the argument x, we pass in the values in the age column from our df_ages DataFrame.

For the argument bins, we pass in the number of bins we want to create designated by the variable count_unique_age_decades.

For the argument labels, we specify the labels for our returned binned column which is the list we created above assigned to the variable age_by_decade.

With the returned output of the cut() method, we'll create a new column in df_ages called age_by_decade.

df_ages['age_by_decade'] = pd.cut(x=df_ages['age'], bins=count_unique_age_decades, labels=age_by_decade)

Preview the first 5 rows of df_ages.

df_ages.head()
age age_range age_by_decade
0 37 30 - 39 30s
1 37 30 - 39 30s
2 84 80 - 89 80s
3 99 90 - 99 90s
4 93 90 - 99 90s

You can learn more about the intricate details of the Pandas cut() method on the official documentation page.