Data Analysis Data Wrangling Tutorial

# cut() Method: Bin Values into Discrete Intervals

### Import Modules

```import pandas as pd
import numpy as np
```

### Why Bin Data

Often times you have numerical data on very large scales. Sometimes, it can be easier to bin the values into groups. This is helpful to more easily perform descriptive statistics by groups as a generalization of patterns in the data.

### Binning in Pandas with Age Example

#### Create Random Age Data

First, let's create a simple pandas DataFrame assigned to the variable `df_ages` with just one colum for `age`. This column will contain `8` random age values between `21` inclusive and `51` exclusive,

```df_ages = pd.DataFrame({'age': np.random.randint(21, 51, 8)})
```

Print out`df_ages`.

```df_ages
```
age
0 45
1 47
2 37
3 41
4 29
5 30
6 30
7 49

### Create New Column of `age_bins` Via Defining Bin Edges

This code creates a new column called `age_bins` that sets the `x` argument to the `age` column in `df_ages` and sets the `bins` argument to a list of bin edge values. The left bin edge will be exclusive and the right bin edge will be inclusive.

The bins will be for ages: `(20, 29]` (someone in their 20s), `(30, 39]`, and `(40, 49]`.

```df_ages['age_bins'] = pd.cut(x=df_ages['age'], bins=[20, 29, 39, 49])
```

Print out`df_ages`. We can see `age` values are assigned to a proper bin.

```df_ages
```
age age_bins
0 45 (39, 49]
1 47 (39, 49]
2 37 (29, 39]
3 41 (39, 49]
4 29 (20, 29]
5 30 (29, 39]
6 30 (29, 39]
7 49 (39, 49]

Let's verify the unique `age_bins` values.

```df_ages['age_bins'].unique()
```
```[(39, 49], (29, 39], (20, 29]]
Categories (3, interval[int64]): [(20, 29] < (29, 39] < (39, 49]]
```

### Create New Column of of `age_by_decade` With Labels `20s`, `30s`, and `40s`

This code creates a new column called `age_by_decade` with the same first 2 arguments as above, and a third argument of `labels` set to a list of values that correspond to how the age values will be put in bins by decades.

```df_ages['age_by_decade'] = pd.cut(x=df_ages['age'], bins=[20, 29, 39, 49], labels=['20s', '30s', '40s'])
```

Print out`df_ages`.

```df_ages
```
Learn more about the Pandas `cut()` method from the official documentation page.