# groupby() Method: Split Data into Groups, Apply a Function to Groups, Combine the Results

- November 28, 2018
- Key Terms: groupby, python, pandas

A **group by** is a process that tyipcally involves splitting the data into groups based on some criteria, applying a function to each group independently, and then combining the outputted results.

### Import Modules¶

```
import pandas as pd
import seaborn as sns
import numpy as np
```

### Get Tips Dataset¶

Let's get the `tips`

dataset from the `seaborn`

library and assign it to the DataFrame `df_tips`

.

```
df_tips = sns.load_dataset('tips')
```

Each row represents a unique meal at a restaurant for a party of people; the dataset contains the following fields:

column name | column description |
---|---|

`total_bill` |
financial amount of meal in U.S. dollars |

`tip` |
financial amount of the meal's tip in U.S. dollars |

`sex` |
gender of server |

`smoker` |
boolean to represent if server smokes or not |

`day` |
day of week |

`time` |
meal name (Lunch or Dinner) |

`size` |
count of people eating meal |

Preview the first 5 rows of `df_tips`

.

```
df_tips.head()
```

### Implement Group bys with Tips Dataset¶

#### Group by of a Single Column and Apply a Single Aggregate Method on a Column¶

The simplest example of a `groupby()`

operation is to compute the size of groups in a single column. By size, the calculation is a count of unique occurences of values in a single column. Here is the official documentation for this operation.

This is the same operation as utilizing the `value_counts()`

method in pandas.

Below, for the `df_tips`

DataFrame, I call the `groupby()`

method, pass in the `sex`

column, and then chain the `size()`

method.

```
df_tips.groupby(by='sex').size()
```

To interpret the output above, 157 meals were served by males and 87 meals were served by females.

A note, if there are any `NaN`

or `NaT`

values in the grouped column that would appear in the index, those are automatically excluded in your output (reference here).

In pandas, we can also group by one columm and then perform an aggregate method on a different column.

For example, in our dataset, I want to group by the `sex`

column and then across the `total_bill`

column, find the mean bill size. To do this in pandas, given our `df_tips`

DataFrame, apply the `groupby()`

method and pass in the `sex`

column (that'll be our index), and then reference our `['total_bill']`

column (that'll be our returned column) and chain the `mean()`

method.

Meals served by males had a mean bill size of 20.74 while meals served by females had a mean bill size of 18.06.

```
df_tips.groupby(by='sex')['total_bill'].mean()
```

We can verify the output above with a query. We get the same result that meals served by males had a mean bill size of 20.74

```
df_tips.query("sex=='Male'")['total_bill'].mean()
```

Another interesting tidbit with the `groupby()`

method is the ability to group by a single column, and call an aggregate method that will apply to all other numeric columns in the DataFrame.

For example, if I group by the `sex`

column and call the `mean()`

method, the mean is calculated for the three other numeric columns in `df_tips`

which are `total_bill`

, `tip`

, and `size`

.

```
df_tips.groupby(by='sex').mean()
```

#### Aggregate Methods¶

Other aggregate methods you could perform with a `groupby()`

method in pandas are:

method | description |
---|---|

`sum()` |
summation |

`mean()` |
average |

`count()` |
count of all values |

`size()` |
count of non-null values |

`max()` |
maximum value |

`min()` |
minimum value |

`std()` |
standard deviation |

`median()` |
median |

To illustrate the difference between the `size()`

and `count()`

methods, I included this simple example below.

The DataFrame below of `df_rides`

includes Dan and Jamie's ride data.

```
data = {'person': ['Dan', 'Dan', 'Jamie', 'Jamie'],
'ride_duration_minutes': [4, np.NaN, 8, 10]}
df_rides = pd.DataFrame(data)
df_rides
```

For one of Dan's rides, the `ride_duration_minutes`

value is null. However, if we apply the `size`

method, we'll still see a count of `2`

rides for Dan. We are 100% sure he took 2 rides but there's only a small issue in our dataset in which the the exact duration of one ride wasn't recorded.

```
df_rides.groupby(by='person')['ride_duration_minutes'].size()
```

Upon applying the `count()`

method, we only see a count of `1`

for Dan because that's the number of non-null values in the `ride_duration_minutes`

field that belongs to him.

```
df_rides.groupby(by='person')['ride_duration_minutes'].count()
```

#### Group by of a Single Column and Apply the `describe()`

Method on a Single Column¶

With grouping of a single column, you can also apply the `describe()`

method to a numerical column. Below, I group by the `sex`

column, reference the `total_bill`

column and apply the `describe()`

method on its values. The `describe`

method outputs many descriptive statistics. Learn more about the `describe()`

method on the official documentation page.

```
df_tips.groupby(by='sex')['total_bill'].describe()
```

#### Group by of a Single Column and Apply a Lambda Expression on a Single Column¶

Most examples in this tutorial involve using simple aggregate methods like calculating the mean, sum or a count. However, with group bys, we have flexibility to apply custom lambda functions.

You can learn more about lambda expressions from the Python 3 documentation and about using instance methods in group bys from the official pandas documentation.

Below, I group by the `sex`

column and apply a lambda expression to the `total_bill`

column. The expression is to find the range of `total_bill`

values. The range is the maximum value subtracted by the minimum value. I also rename the single column returned on output so it's understandable.

```
df_tips.groupby(by='sex').agg({'total_bill': lambda bill: bill.max() - bill.min()}).rename(columns={'total_bill': "range_total_bill"})
```

In this dataset, males had a bigger range of `total_bill`

values.

#### Group by of Multiple Columns and Apply a Single Aggregate Method on a Column¶

We can group by multiple columns too. For example, I want to know the count of meals served by people's gender for each day of the week. So, call the `groupby()`

method and set the `by`

argument to a list of the columns we want to group by.

```
df_tips.groupby(by=['sex', 'day']).size()
```

We can also group by multiple columns and apply an aggregate method on a different column. Below I group by people's gender and day of the week and find the total sum of those groups' bills.

```
df_tips.groupby(by=['sex', 'day'])['total_bill'].sum()
```

#### Group by of a Single Column and Apply Multiple Aggregate Methods on a Column¶

The `agg()`

method allows us to specify multiple functions to apply to each column. Below, I group by the `sex`

column and then we'll apply multiple aggregate methods to the `total_bill`

column. Inside the `agg()`

method, I pass a dictionary and specify `total_bill`

as the key and a list of aggregate methods as the value.

You can pass various types of syntax inside the argument for the `agg()`

method. I chose a dictionary because that syntax will be helpful when we want to apply aggregate methods to multiple columns later on in this tutorial.

```
df_tips.groupby(by='sex').agg({'total_bill': ['count', 'mean', 'sum']})
```

You can learn more about the `agg()`

method on the official pandas documentation page.

The code below performs the same group by operation as above, and additionally I rename columns to have clearer names.

```
df_tips.groupby(by='sex').agg({'total_bill': ['count', 'mean', 'sum']}).rename(columns={'count': 'count_meals_served', 'mean': 'average_bill_of_meal', 'sum': 'total_bills_of_meals'})
```

We can modify the format of the output above through chaining the `unstack()`

and `reset_index()`

methods after our group by operation. This format may be ideal for additional analysis later on.

```
df_tips.groupby(by='sex').agg({'total_bill': ['count', 'mean', 'sum']}).unstack().reset_index().rename(columns={'level_0': 'aggregated_column', 'level_1': 'aggregate_metric', 'sex': 'grouped_column', 0: 'aggregate_calculation'}).round(2)
```

#### Group by of a Single Column and Apply Multiple Aggregate Methods on Multiple Columns¶

Below, I use the `agg()`

method to apply two different aggregate methods to two different columns. I group by the `sex`

column and for the `total_bill`

column, apply the `max`

method, and for the `tip`

column, apply the `min`

method.

```
df_tips.groupby(by='sex').agg({'total_bill': 'max', 'tip': 'min'}).rename(columns={'total_bill': 'max_total_bill', 'tip': 'min_tip_amount'})
```

#### Group by of Multiple Columns and Apply a Groupwise Calculation on Multiple Columns¶

In restaurants, common math by guests is to calculate the tip for the waiter/waittress. My mom thinks 20% tip is customary. So, if the bill was 10, you should tip 2 and pay 12 in total. I'm curious what the tip percentages are based on the gender of servers, meal and day of the week. We can perform that calculation with a `groupby()`

and the `pipe()`

method.

The `pipe()`

method allows us to call functions in a *chain*. So as the `groupby()`

method is called, at the same time, another function is being called to perform data manipulations. You can learn more about `pipe()`

from the official documentation.

To perform this calculation, we need to group by `sex`

, `time`

and `day`

, then call our `pipe()`

method and calculate the `tip`

divided by `total_bill`

multiplied by `100`

.

```
df_tips.groupby(by=['sex', 'time', 'day']).pipe(lambda group: group.tip.sum()/group.total_bill.sum()*100)
```

The highest tip percentage has been for females for dinner on Sunday.