cut() Method: Bin Values into Discrete Intervals¶
Date published: 2019-07-16
Category: Data Analysis
Subcategory: Data Wrangling
Tags: categorical data, python, pandas, bin
Import Modules¶
import pandas as pd
import numpy as np
Why Bin Data¶
Often times you have numerical data on very large scales. Sometimes, it can be easier to bin the values into groups. This is helpful to more easily perform descriptive statistics by groups as a generalization of patterns in the data.
Binning in Pandas with Age Example¶
Create Random Age Data¶
First, let's create a simple pandas DataFrame assigned to the variable df_ages
with just one colum for age
. This column will contain 8
random age values between 21
inclusive and 51
exclusive,
df_ages = pd.DataFrame({'age': np.random.randint(21, 51, 8)})
Print outdf_ages
.
df_ages
age | |
---|---|
0 | 45 |
1 | 47 |
2 | 37 |
3 | 41 |
4 | 29 |
5 | 30 |
6 | 30 |
7 | 49 |
Create New Column of age_bins
Via Defining Bin Edges¶
This code creates a new column called age_bins
that sets the x
argument to the age
column in df_ages
and sets the bins
argument to a list of bin edge values. The left bin edge will be exclusive and the right bin edge will be inclusive.
The bins will be for ages: (20, 29]
(someone in their 20s), (30, 39]
, and (40, 49]
.
df_ages['age_bins'] = pd.cut(x=df_ages['age'], bins=[20, 29, 39, 49])
Print outdf_ages
. We can see age
values are assigned to a proper bin.
df_ages
age | age_bins | |
---|---|---|
0 | 45 | (39, 49] |
1 | 47 | (39, 49] |
2 | 37 | (29, 39] |
3 | 41 | (39, 49] |
4 | 29 | (20, 29] |
5 | 30 | (29, 39] |
6 | 30 | (29, 39] |
7 | 49 | (39, 49] |
Let's verify the unique age_bins
values.
df_ages['age_bins'].unique()
[(39, 49], (29, 39], (20, 29]] Categories (3, interval[int64]): [(20, 29] < (29, 39] < (39, 49]]
Create New Column of of age_by_decade
With Labels 20s
, 30s
, and 40s
¶
This code creates a new column called age_by_decade
with the same first 2 arguments as above, and a third argument of labels
set to a list of values that correspond to how the age values will be put in bins by decades.
df_ages['age_by_decade'] = pd.cut(x=df_ages['age'], bins=[20, 29, 39, 49], labels=['20s', '30s', '40s'])
Print outdf_ages
.
df_ages
age | age_bins | age_by_decade | |
---|---|---|---|
0 | 45 | (39, 49] | 40s |
1 | 47 | (39, 49] | 40s |
2 | 37 | (29, 39] | 30s |
3 | 41 | (39, 49] | 40s |
4 | 29 | (20, 29] | 20s |
5 | 30 | (29, 39] | 30s |
6 | 30 | (29, 39] | 30s |
7 | 49 | (39, 49] | 40s |
Learn more about the Pandas cut()
method from the official documentation page.