groupby() Method: Split Data into Groups, Apply a Function to Groups, Combine the Results¶

Date published: 2018-11-28

Category: Data Analysis

Subcategory: Data Wrangling

Tags: groupby, python, pandas

A group by is a process that tyipcally involves splitting the data into groups based on some criteria, applying a function to each group independently, and then combining the outputted results.

Import Modules¶

In [89]:

                
                    Copied!
                    
import pandas as pd
import seaborn as sns
import numpy as np
import pandas as pd
import seaborn as sns
import numpy as np

Get Tips Dataset¶

Let's get the tips dataset from the seaborn library and assign it to the DataFrame df_tips.

In [90]:

                
                    Copied!
                    
df_tips = sns.load_dataset('tips')
df_tips = sns.load_dataset('tips')

Each row represents a unique meal at a restaurant for a party of people; the dataset contains the following fields:

column name	column description
`total_bill`	financial amount of meal in U.S. dollars
`tip`	financial amount of the meal's tip in U.S. dollars
`sex`	gender of server
`smoker`	boolean to represent if server smokes or not
`day`	day of week
`time`	meal name (Lunch or Dinner)
`size`	count of people eating meal

Preview the first 5 rows of df_tips.

In [91]:

                
                    Copied!
                    
df_tips.head()
df_tips.head()

Out[91]:

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

Implement Group bys with Tips Dataset¶

Group by of a Single Column and Apply a Single Aggregate Method on a Column¶

The simplest example of a groupby() operation is to compute the size of groups in a single column. By size, the calculation is a count of unique occurences of values in a single column. Here is the official documentation for this operation.

This is the same operation as utilizing the value_counts() method in pandas.

Below, for the df_tips DataFrame, I call the groupby() method, pass in the sex column, and then chain the size() method.

In [92]:

                
                    Copied!
                    
df_tips.groupby(by='sex').size()
df_tips.groupby(by='sex').size()

Out[92]:

sex
Male      157
Female     87
dtype: int64

To interpret the output above, 157 meals were served by males and 87 meals were served by females.

A note, if there are any NaN or NaT values in the grouped column that would appear in the index, those are automatically excluded in your output (reference here).

In pandas, we can also group by one columm and then perform an aggregate method on a different column.

For example, in our dataset, I want to group by the sex column and then across the total_bill column, find the mean bill size. To do this in pandas, given our df_tips DataFrame, apply the groupby() method and pass in the sex column (that'll be our index), and then reference our ['total_bill'] column (that'll be our returned column) and chain the mean() method.

Meals served by males had a mean bill size of 20.74 while meals served by females had a mean bill size of 18.06.

In [93]:

                
                    Copied!
                    
df_tips.groupby(by='sex')['total_bill'].mean()
df_tips.groupby(by='sex')['total_bill'].mean()

Out[93]:

sex
Male      20.744076
Female    18.056897
Name: total_bill, dtype: float64

We can verify the output above with a query. We get the same result that meals served by males had a mean bill size of 20.74

In [94]:

                
                    Copied!
                    
df_tips.query("sex=='Male'")['total_bill'].mean()
df_tips.query("sex=='Male'")['total_bill'].mean()

Out[94]:

20.744076433121034

Another interesting tidbit with the groupby() method is the ability to group by a single column, and call an aggregate method that will apply to all other numeric columns in the DataFrame.

For example, if I group by the sex column and call the mean() method, the mean is calculated for the three other numeric columns in df_tips which are total_bill, tip, and size.

In [95]:

                
                    Copied!
                    
df_tips.groupby(by='sex').mean()
df_tips.groupby(by='sex').mean()

Out[95]:

	total_bill	tip	size
sex
Male	20.744076	3.089618	2.630573
Female	18.056897	2.833448	2.459770

Aggregate Methods¶

Other aggregate methods you could perform with a groupby() method in pandas are:

method	description
`sum()`	summation
`mean()`	average
`count()`	count of all values
`size()`	count of non-null values
`max()`	maximum value
`min()`	minimum value
`std()`	standard deviation
`median()`	median

To illustrate the difference between the size() and count() methods, I included this simple example below.

The DataFrame below of df_rides includes Dan and Jamie's ride data.

In [96]:

                
                    Copied!
                    
data = {'person': ['Dan', 'Dan', 'Jamie', 'Jamie'], 
        'ride_duration_minutes': [4, np.NaN, 8, 10]}
df_rides = pd.DataFrame(data)
df_rides
data = {'person': ['Dan', 'Dan', 'Jamie', 'Jamie'], 
        'ride_duration_minutes': [4, np.NaN, 8, 10]}
df_rides = pd.DataFrame(data)
df_rides

Out[96]:

	person	ride_duration_minutes
0	Dan	4.0
1	Dan	NaN
2	Jamie	8.0
3	Jamie	10.0

For one of Dan's rides, the ride_duration_minutes value is null. However, if we apply the size method, we'll still see a count of 2 rides for Dan. We are 100% sure he took 2 rides but there's only a small issue in our dataset in which the the exact duration of one ride wasn't recorded.

In [97]:

                
                    Copied!
                    
df_rides.groupby(by='person')['ride_duration_minutes'].size()
df_rides.groupby(by='person')['ride_duration_minutes'].size()

Out[97]:

person
Dan      2
Jamie    2
Name: ride_duration_minutes, dtype: int64

Upon applying the count() method, we only see a count of 1 for Dan because that's the number of non-null values in the ride_duration_minutes field that belongs to him.

In [98]:

                
                    Copied!
                    
df_rides.groupby(by='person')['ride_duration_minutes'].count()
df_rides.groupby(by='person')['ride_duration_minutes'].count()

Out[98]:

person
Dan      1
Jamie    2
Name: ride_duration_minutes, dtype: int64

Group by of a Single Column and Apply the `describe()` Method on a Single Column¶

With grouping of a single column, you can also apply the describe() method to a numerical column. Below, I group by the sex column, reference the total_bill column and apply the describe() method on its values. The describe method outputs many descriptive statistics. Learn more about the describe() method on the official documentation page.

In [99]:

                
                    Copied!
                    
df_tips.groupby(by='sex')['total_bill'].describe()
df_tips.groupby(by='sex')['total_bill'].describe()

Out[99]:

	count	mean	std	min	25%	50%	75%	max
sex
Male	157.0	20.744076	9.246469	7.25	14.00	18.35	24.71	50.81
Female	87.0	18.056897	8.009209	3.07	12.75	16.40	21.52	44.30

Group by of a Single Column and Apply a Lambda Expression on a Single Column¶

Most examples in this tutorial involve using simple aggregate methods like calculating the mean, sum or a count. However, with group bys, we have flexibility to apply custom lambda functions.

You can learn more about lambda expressions from the Python 3 documentation and about using instance methods in group bys from the official pandas documentation.

Below, I group by the sex column and apply a lambda expression to the total_bill column. The expression is to find the range of total_bill values. The range is the maximum value subtracted by the minimum value. I also rename the single column returned on output so it's understandable.

In [100]:

                
                    Copied!
                    
df_tips.groupby(by='sex').agg({'total_bill': lambda bill: bill.max() - bill.min()}).rename(columns={'total_bill': "range_total_bill"})
df_tips.groupby(by='sex').agg({'total_bill': lambda bill: bill.max() - bill.min()}).rename(columns={'total_bill': "range_total_bill"})

Out[100]:

	range_total_bill
sex
Male	43.56
Female	41.23

In this dataset, males had a bigger range of total_bill values.

Group by of Multiple Columns and Apply a Single Aggregate Method on a Column¶

We can group by multiple columns too. For example, I want to know the count of meals served by people's gender for each day of the week. So, call the groupby() method and set the by argument to a list of the columns we want to group by.

In [101]:

                
                    Copied!
                    
df_tips.groupby(by=['sex', 'day']).size()
df_tips.groupby(by=['sex', 'day']).size()

Out[101]:

sex     day 
Male    Thur    30
        Fri     10
        Sat     59
        Sun     58
Female  Thur    32
        Fri      9
        Sat     28
        Sun     18
dtype: int64

We can also group by multiple columns and apply an aggregate method on a different column. Below I group by people's gender and day of the week and find the total sum of those groups' bills.

In [102]:

                
                    Copied!
                    
df_tips.groupby(by=['sex', 'day'])['total_bill'].sum()
df_tips.groupby(by=['sex', 'day'])['total_bill'].sum()

Out[102]:

sex     day 
Male    Thur     561.44
        Fri      198.57
        Sat     1227.35
        Sun     1269.46
Female  Thur     534.89
        Fri      127.31
        Sat      551.05
        Sun      357.70
Name: total_bill, dtype: float64

Group by of a Single Column and Apply Multiple Aggregate Methods on a Column¶

The agg() method allows us to specify multiple functions to apply to each column. Below, I group by the sex column and then we'll apply multiple aggregate methods to the total_bill column. Inside the agg() method, I pass a dictionary and specify total_bill as the key and a list of aggregate methods as the value.

You can pass various types of syntax inside the argument for the agg() method. I chose a dictionary because that syntax will be helpful when we want to apply aggregate methods to multiple columns later on in this tutorial.

In [103]:

                
                    Copied!
                    
df_tips.groupby(by='sex').agg({'total_bill': ['count', 'mean', 'sum']})
df_tips.groupby(by='sex').agg({'total_bill': ['count', 'mean', 'sum']})

Out[103]:

	total_bill
	count	mean	sum
sex
Male	157	20.744076	3256.82
Female	87	18.056897	1570.95

You can learn more about the agg() method on the official pandas documentation page.

The code below performs the same group by operation as above, and additionally I rename columns to have clearer names.

In [104]:

                
                    Copied!
                    
df_tips.groupby(by='sex').agg({'total_bill': ['count', 'mean', 'sum']}).rename(columns={'count': 'count_meals_served', 'mean': 'average_bill_of_meal', 'sum': 'total_bills_of_meals'})
df_tips.groupby(by='sex').agg({'total_bill': ['count', 'mean', 'sum']}).rename(columns={'count': 'count_meals_served', 'mean': 'average_bill_of_meal', 'sum': 'total_bills_of_meals'})

Out[104]:

	total_bill
	count_meals_served	average_bill_of_meal	total_bills_of_meals
sex
Male	157	20.744076	3256.82
Female	87	18.056897	1570.95

We can modify the format of the output above through chaining the unstack() and reset_index() methods after our group by operation. This format may be ideal for additional analysis later on.

In [105]:

                
                    Copied!
                    
df_tips.groupby(by='sex').agg({'total_bill': ['count', 'mean', 'sum']}).unstack().reset_index().rename(columns={'level_0': 'aggregated_column', 'level_1': 'aggregate_metric', 'sex': 'grouped_column', 0: 'aggregate_calculation'}).round(2)
df_tips.groupby(by='sex').agg({'total_bill': ['count', 'mean', 'sum']}).unstack().reset_index().rename(columns={'level_0': 'aggregated_column', 'level_1': 'aggregate_metric', 'sex': 'grouped_column', 0: 'aggregate_calculation'}).round(2)

Out[105]:

	aggregated_column	aggregate_metric	grouped_column	aggregate_calculation
0	total_bill	count	Male	157.00
1	total_bill	count	Female	87.00
2	total_bill	mean	Male	20.74
3	total_bill	mean	Female	18.06
4	total_bill	sum	Male	3256.82
5	total_bill	sum	Female	1570.95

Group by of a Single Column and Apply Multiple Aggregate Methods on Multiple Columns¶

Below, I use the agg() method to apply two different aggregate methods to two different columns. I group by the sex column and for the total_bill column, apply the max method, and for the tip column, apply the min method.

In [106]:

                
                    Copied!
                    
df_tips.groupby(by='sex').agg({'total_bill': 'max', 'tip': 'min'}).rename(columns={'total_bill': 'max_total_bill', 'tip': 'min_tip_amount'})
df_tips.groupby(by='sex').agg({'total_bill': 'max', 'tip': 'min'}).rename(columns={'total_bill': 'max_total_bill', 'tip': 'min_tip_amount'})

Out[106]:

	max_total_bill	min_tip_amount
sex
Male	50.81	1.0
Female	44.30	1.0

Group by of Multiple Columns and Apply a Groupwise Calculation on Multiple Columns¶

In restaurants, common math by guests is to calculate the tip for the waiter/waittress. My mom thinks 20% tip is customary. So, if the bill was 10, you should tip 2 and pay 12 in total. I'm curious what the tip percentages are based on the gender of servers, meal and day of the week. We can perform that calculation with a groupby() and the pipe() method.

The pipe() method allows us to call functions in a chain. So as the groupby() method is called, at the same time, another function is being called to perform data manipulations. You can learn more about pipe() from the official documentation.

To perform this calculation, we need to group by sex, time and day, then call our pipe() method and calculate the tip divided by total_bill multiplied by 100.

In [107]:

                
                    Copied!
                    
df_tips.groupby(by=['sex', 'time', 'day']).pipe(lambda group: group.tip.sum()/group.total_bill.sum()*100)
df_tips.groupby(by=['sex', 'time', 'day']).pipe(lambda group: group.tip.sum()/group.total_bill.sum()*100)

Out[107]:

sex     time    day 
Male    Lunch   Thur    15.925121
                Fri     16.686183
        Dinner  Fri     12.912840
                Sat     14.824622
                Sun     14.713343
Female  Lunch   Thur    15.388192
                Fri     19.691535
        Dinner  Thur    15.974441
                Fri     19.636618
                Sat     14.236458
                Sun     16.944367
dtype: float64

The highest tip percentage has been for females for dinner on Sunday.