# Mean, Median and Mode

- November 28, 2018
- Key Terms: mean, median, mode, python, pandas

I'll discuss several summary statistics that are popular for describing the central tendancy of a field. Mean, median and mode are three popular summary statistics. Each of those are appropriate to describe the central tendancy of a field under certain conditions.

I'll use an example of meal bill amounts in the `tips`

dataset to elaborate on these three concepts.

### Import Modules¶

```
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
% matplotlib inline
```

Visualization styling code

```
sns.set_context('talk')
```

### Example 1: Tips Dataset¶

Let's get the `tips`

dataset from the `seaborn`

library and assign it to the DataFrame `df_tips`

.

```
df_tips = sns.load_dataset('tips')
```

Each row represents a unique meal at a restaurant for a party of people; the dataset contains the following fields:

column name | column description |
---|---|

`total_bill` |
financial amount of meal in U.S. dollars |

`tip` |
financial amount of the meal's tip in U.S. dollars |

`sex` |
gender of server |

`smoker` |
boolean to represent if server smokes or not |

`day` |
day of week |

`time` |
meal name (Lunch or Dinner) |

`size` |
count of people eating meal |

Preview the first 5 rows of `df_tips`

.

```
df_tips.head()
```

#### Visualize Distribution of Total Bill Amounts¶

First off, I like to visualize the distribution of data in a field. Below is a histogram to illustrate the distribution of values in the `total_bill`

field. The x-axis indicates total bill amounts and the y-axis indicates the frequency (aka count) of occurences of those total bill amounts.

From a visual standpoint, it looks like most total bill amounts are around 12 to 20. I can also see a long tail to the right in which there are total bill amounts up to 50.

```
df_tips['total_bill'].plot(kind='hist', figsize=(10, 8), linewidth=2, color='whitesmoke', edgecolor='gray')
plt.xlabel("total bill [$]", labelpad=15)
plt.ylabel("frequency", labelpad=15)
plt.title("Distribution of Total Bill Amounts", y=1.012, fontsize=22);
```

#### Mode¶

The **mode** is the most frequently occuring value in a field.

We can use the pandas `mode()`

method to calculate the mode of `total_bill`

column in `df_tips`

.

```
mode_total_bill = df_tips['total_bill'].mode().iloc[0]
mode_total_bill
```

Our mode is 13.42.

We can also query our `df_tips`

DataFrame to see the bills in which the `total_bill`

is equal to 13.42. We're returned three unique meals.

```
df_tips.query('total_bill==13.42')
```

#### Mean¶

The arithmetic **mean**, also known as the *average*, is the average of the values in a field. To calculate this: add up all the numbers and divide by how many numbers there are. This is equivalent to the sum of a field divided by the count of numbers in a field.

The mean of a population can be denoted by $\mu$ and is the sum of all values in the population divided by the count of values in the population. It's represented by the formula:

$$ \mu = \frac{\sum{x}}{N} $$The mean of a sample can be denoted by $\bar{x}$ and is the sum of all values in the sample divided by the count of values in the sample. It's represented by the formula:

$$ \bar{x} = \frac{\sum{x}}{n} $$We can use the pandas `mean()`

method to calculate the mean of the `total_bill`

column in `df_tips`

and I'll round this value to two decimal places.

```
mean_total_bill = round(df_tips['total_bill'].mean(), 2)
mean_total_bill
```

The mean is 19.79

#### Median¶

Once a set of values are sorted, the **median** is the *middle* value.

We can use the pandas `median()`

method to calculate the median of the `total_bill`

column in `df_tips`

.

```
median_total_bill = df_tips['total_bill'].median()
median_total_bill
```

The median is 17.795

In another example, we can have a list of the following numbers: $1$ $1$ $3$ $5$ $9$ $10$

This list is already sorted. We have 6 numbers in this list so there's no perfect middle number. To find the median, we have to calculate the arithmetic mean of the middle two numbers: $3$ and $5$

The mean of these two numbers is $4$ and that's our median.

#### Visualize Mean, Median and Mode on Our Total Bill Histogram¶

```
df_tips['total_bill'].plot(kind='hist', figsize=(10, 8), linewidth=2, color='whitesmoke', edgecolor='gray')
plt.xlabel("total bill [$]", labelpad=15)
plt.ylabel("frequency", labelpad=15)
plt.title("Distribution of Total Bill Amounts", y=1.012, fontsize=22)
measurements = [mode_total_bill, median_total_bill, mean_total_bill]
names = ["mode", "median", "mean"]
colors = ['green', 'blue', 'orange']
for measurement, name, color in zip(measurements, names, colors):
plt.axvline(x=measurement, linestyle='--', linewidth=2.5, label='{0} at {1}'.format(name, measurement), c=color)
plt.legend();
```

We can now visually see that the long tail of large total bill values causes our mean to be the greatest of the three summary statistics to describe central tendancy.

### Pros and Cons of These Summary Statistics to Describe Central Tendancy¶

#### Mode¶

The mode can be flawed for describing the central tendancy of a field because it ignores a lot of data. In our `total_bill`

field, the majority of values seem to be above 18 yet our mode is much lower at just 13.42.

Hypothetically, if we had a record of four meals with a total bill size of 30, that would be our mode. However, a value of 30 would seem far off from the what we visually can see as the "center" of the data.

#### Mean¶

The mean is typically the most common summary statistic for the "center" of a set of values. The mean works well if the data is normally distributed. Otherwise, if there's just a few values in a field that are incredibly small or large (often called outliers), then our mean can be distorted.

Hypothetically, if in one day four meals had total bill values of 15, 20, 25 and 60, our mean would be 40; yet a value of 40 would be much higher than the majority of our values since it's far larger than 15, 20 and 25. So, in this instance, the mean would be a poor measure to describe the "center" of the data.

#### Median¶

The median is typically a good measurement to describe the "center" of a set of values. An advantage of the median is that it's less biased when working with skewed data. For example, to continue on the simple example above, if in one day four meals had total bill values of 15, 20, 25 and 60, our mean would be 40; this number seems very far off from the "center" of the values. The mean calculation is heavily skewed by this large value of 60. However, the median of these four values is 22.5 (halfway between 20 and 25) which seems closer to the "center" of the data and is less biased than the mean.

### Example 2: Geography Majors at UNC¶

This is a fictional dataset that includes the net worth values of geography graduates from the last 20 years at the University of North Carolina at Chapel Hill. We'll illustrate the central tendancy of these net worth values.

First, I'll generate this random data using numpy's random module and normal() method and assign it to the DataFrame `df_net_worth`

.

```
np.random.seed(42) # seed random number generator with fixed value so we always get same values below
net_worth_values = list(np.random.normal(loc=80000, scale=25000, size=2800))
michael_jordan_net_worth = 1700000000
net_worth_values.append(michael_jordan_net_worth)
df_net_worth = pd.DataFrame({'net_worth': net_worth_values})
```

Preview the first few rows of `df_net_worth`

.

```
df_net_worth.head()
```

Let's examine common percentile values. We'll use the quantile() Series method in pandas.

```
for percentile in [0.01, 0.25, 0.50, 0.75, 0.99]:
value = df_net_worth['net_worth'].quantile(q=percentile)
print("{0}th percentile is {1}".format(percentile, value))
```

It seems our 55th percentile is a value of 80,707 and values are typically within 24,000 and 140,000.

We can visualize our net worth values in this range of 26,000 to 14,000. The visualization below shows a fairly normal distributon and I'd figure the "center" of our data is around 80,000.

```
df_net_worth['net_worth'].plot(kind='hist', range=[26000, 140000], color='skyblue', figsize=(10, 6), bins=20, rot=25)
plt.xlabel("net worth [$]", labelpad=15)
plt.ylabel("frequency", labelpad=15)
plt.title("Distribution of Net Worth Values of UNC Geography Graduates", y=1.015);
```

We can calculate the mean `net_worth`

value at 687,757

```
int(df_net_worth['net_worth'].mean())
```

This value of 687,757 seems way higher than the 80,000 we expected. Why is that?

Well, the famous NBA basketball player Michael Jordan also attended UNC and graduated with a degree in Geography. He has a net worth of 1.7B. His super high net worth skews the mean to be a very high number.

Let's calculate the median value of the `net_worth`

column below. The median is 80,708 which which seems like a far more reasonable metric to describe the "center" of the data.

```
median = int(df_net_worth['net_worth'].median())
median
```

Let's plot this median on our histogram so we can better see how it represents the "center" of the data.

```
df_net_worth['net_worth'].plot(kind='hist', range=[26000, 140000], color='skyblue', figsize=(10, 6), bins=20, rot=25)
plt.xlabel("net worth [$]", labelpad=15)
plt.ylabel("frequency", labelpad=15)
plt.title("Distribution of Net Worth Values of UNC Geography Graduates", y=1.015)
plt.axvline(x=median, linestyle='--', linewidth=2.5, label='median', c='darkslateblue')
plt.legend();
```