cut() Method: Bin Values into Discrete Intervals¶
Date published: 2019-07-16
Category: Data Analysis
Subcategory: Data Wrangling
Tags: categorical data, python, pandas, bin
Import Modules¶
import pandas as pd
import numpy as np
Why Bin Data¶
Often times you have numerical data on very large scales. Sometimes, it can be easier to bin the values into groups. This is helpful to more easily perform descriptive statistics by groups as a generalization of patterns in the data.
Binning in Pandas with Age Example¶
Create Random Age Data¶
First, let's create a simple pandas DataFrame assigned to the variable df_ages with just one colum for age. This column will contain 8 random age values between 21 inclusive and 51 exclusive,
df_ages = pd.DataFrame({'age': np.random.randint(21, 51, 8)})
Print outdf_ages.
df_ages
| age | |
|---|---|
| 0 | 45 |
| 1 | 47 |
| 2 | 37 |
| 3 | 41 |
| 4 | 29 |
| 5 | 30 |
| 6 | 30 |
| 7 | 49 |
Create New Column of age_bins Via Defining Bin Edges¶
This code creates a new column called age_bins that sets the x argument to the age column in df_ages and sets the bins argument to a list of bin edge values. The left bin edge will be exclusive and the right bin edge will be inclusive.
The bins will be for ages: (20, 29] (someone in their 20s), (30, 39], and (40, 49].
df_ages['age_bins'] = pd.cut(x=df_ages['age'], bins=[20, 29, 39, 49])
Print outdf_ages. We can see age values are assigned to a proper bin.
df_ages
| age | age_bins | |
|---|---|---|
| 0 | 45 | (39, 49] |
| 1 | 47 | (39, 49] |
| 2 | 37 | (29, 39] |
| 3 | 41 | (39, 49] |
| 4 | 29 | (20, 29] |
| 5 | 30 | (29, 39] |
| 6 | 30 | (29, 39] |
| 7 | 49 | (39, 49] |
Let's verify the unique age_bins values.
df_ages['age_bins'].unique()
[(39, 49], (29, 39], (20, 29]] Categories (3, interval[int64]): [(20, 29] < (29, 39] < (39, 49]]
Create New Column of of age_by_decade With Labels 20s, 30s, and 40s¶
This code creates a new column called age_by_decade with the same first 2 arguments as above, and a third argument of labels set to a list of values that correspond to how the age values will be put in bins by decades.
df_ages['age_by_decade'] = pd.cut(x=df_ages['age'], bins=[20, 29, 39, 49], labels=['20s', '30s', '40s'])
Print outdf_ages.
df_ages
| age | age_bins | age_by_decade | |
|---|---|---|---|
| 0 | 45 | (39, 49] | 40s |
| 1 | 47 | (39, 49] | 40s |
| 2 | 37 | (29, 39] | 30s |
| 3 | 41 | (39, 49] | 40s |
| 4 | 29 | (20, 29] | 20s |
| 5 | 30 | (29, 39] | 30s |
| 6 | 30 | (29, 39] | 30s |
| 7 | 49 | (39, 49] | 40s |
Learn more about the Pandas cut() method from the official documentation page.