cut() Method: Bin Values into Discrete Intervals¶

Date published: 2019-07-16

Category: Data Analysis

Subcategory: Data Wrangling

Tags: categorical data, python, pandas, bin

Import Modules¶

In [81]:

                
                    Copied!
                    
import pandas as pd
import numpy as np
import pandas as pd
import numpy as np

Why Bin Data¶

Often times you have numerical data on very large scales. Sometimes, it can be easier to bin the values into groups. This is helpful to more easily perform descriptive statistics by groups as a generalization of patterns in the data.

Binning in Pandas with Age Example¶

Create Random Age Data¶

First, let's create a simple pandas DataFrame assigned to the variable df_ages with just one colum for age. This column will contain 8 random age values between 21 inclusive and 51 exclusive,

In [82]:

                
                    Copied!
                    
df_ages = pd.DataFrame({'age': np.random.randint(21, 51, 8)})
df_ages = pd.DataFrame({'age': np.random.randint(21, 51, 8)})

Print outdf_ages.

In [83]:

                
                    Copied!
                    
df_ages
df_ages

Out[83]:

	age
0	45
1	47
2	37
3	41
4	29
5	30
6	30
7	49

Create New Column of `age_bins` Via Defining Bin Edges¶

This code creates a new column called age_bins that sets the x argument to the age column in df_ages and sets the bins argument to a list of bin edge values. The left bin edge will be exclusive and the right bin edge will be inclusive.

The bins will be for ages: (20, 29] (someone in their 20s), (30, 39], and (40, 49].

In [90]:

                
                    Copied!
                    
df_ages['age_bins'] = pd.cut(x=df_ages['age'], bins=[20, 29, 39, 49])
df_ages['age_bins'] = pd.cut(x=df_ages['age'], bins=[20, 29, 39, 49])

Print outdf_ages. We can see age values are assigned to a proper bin.

In [85]:

                
                    Copied!
                    
df_ages
df_ages

Out[85]:

	age	age_bins
0	45	(39, 49]
1	47	(39, 49]
2	37	(29, 39]
3	41	(39, 49]
4	29	(20, 29]
5	30	(29, 39]
6	30	(29, 39]
7	49	(39, 49]

Let's verify the unique age_bins values.

In [86]:

                
                    Copied!
                    
df_ages['age_bins'].unique()
df_ages['age_bins'].unique()

Out[86]:

[(39, 49], (29, 39], (20, 29]]
Categories (3, interval[int64]): [(20, 29] < (29, 39] < (39, 49]]

Create New Column of of `age_by_decade` With Labels `20s`, `30s`, and `40s`¶

This code creates a new column called age_by_decade with the same first 2 arguments as above, and a third argument of labels set to a list of values that correspond to how the age values will be put in bins by decades.

In [87]:

                
                    Copied!
                    
df_ages['age_by_decade'] = pd.cut(x=df_ages['age'], bins=[20, 29, 39, 49], labels=['20s', '30s', '40s'])
df_ages['age_by_decade'] = pd.cut(x=df_ages['age'], bins=[20, 29, 39, 49], labels=['20s', '30s', '40s'])

Print outdf_ages.

In [88]:

                
                    Copied!
                    
df_ages
df_ages

Out[88]:

	age	age_bins	age_by_decade
0	45	(39, 49]	40s
1	47	(39, 49]	40s
2	37	(29, 39]	30s
3	41	(39, 49]	40s
4	29	(20, 29]	20s
5	30	(29, 39]	30s
6	30	(29, 39]	30s
7	49	(39, 49]	40s

Learn more about the Pandas cut() method from the official documentation page.