Math Descriptive Statistics Article

# Skewness

In a histogram to visualize a set of values, data can be considered "skewed" meaning it can have a long tail on a side.

This article will cover common interpretations of skewness.

### Import Modules

```import seaborn as sns
import numpy as np
import scipy.stats as stats
import random
import warnings
import matplotlib.pyplot as plt
% matplotlib inline
```

I turn warnings off in this post because of an issue in Scipy that will be fixed in a later version.

```warnings.filterwarnings('ignore')
```

Visualization styling code

```sns.set(rc={'figure.figsize':(10.5, 7.5)})
sns.set_context('talk')
```

### No Skew

#### Generate a Normal Distribution

Using the `numpy` package's `random` module, we can call the `normal()` method to create a list of values with a normal distribution by setting the following arguments:

• `loc` as the mean of the distribution
• `scale` as the standard deviation of the distribution
• `size` as number of samples
```np.random.seed(4) # seed random number generator with fixed value so we always get same values below
normal_distr_values = list(np.random.normal(loc=100, scale=20, size=1300))
```

#### View Distribution of `normal_distr_values`

Below is a plot of a histogram of these values that resemble a normal distribution.

```sns.distplot(normal_distr_values, kde=False, color='darkseagreen')
plt.title("Normal Distribution of Values", fontsize=20, y=1.012)
plt.xlabel("values", labelpad=15)
plt.ylabel("frequency", labelpad=15);
``` This distribution has no skew because it's perfectly symmetrical. With this distribution of no skew, the mean and median are typically at the peak. Let's calculate those and visualize them.

```mean_normal_distr_values = round(np.mean(normal_distr_values), 2)
median_normal_distr_values = round(np.median(normal_distr_values), 2)
```
```ax = sns.distplot(normal_distr_values, kde=False, color='darkseagreen')
plt.title("Normal Distribution of Values", fontsize=20, y=1.012)
plt.xlabel("values", labelpad=15)
plt.ylabel("frequency", labelpad=15);
names = ["median", "mean"]
colors = ['darkmagenta', 'darkorange']
measurements = [median_normal_distr_values, mean_normal_distr_values]
for measurement, name, color in zip(measurements, names, colors):
plt.axvline(x=measurement, linestyle='--', linewidth=2.5, label='{0} at {1}'.format(name, measurement), c=color)
plt.legend();
``` Looks like the median and mean values are the exact same and at the peak.

### Positive Skew

#### Get Diamonds Dataset

Import `diamonds` dataset from Seaborn library and assign to DataFrame `df_diamonds`.

Each row of `df_diamonds` contains details about a specific diamond purchased. We'll just utilize the `price` column in our analysis below.

```df_diamonds = sns.load_dataset('diamonds')
```

Preview the first few rows of `df_diamonds`.

```df_diamonds.head()
```
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75

#### View Distribution of Diamond Prices

Below is a histogram of the `price` field in `df_diamonds`.

```ax = sns.distplot(df_diamonds['price'], kde=False, color='lightskyblue')
plt.title("Distribution of Price of Diamonds Purchase", fontsize=20, y=1.012)
plt.xlabel("price (U.S. dollars)", labelpad=15)
plt.ylabel("frequency", labelpad=15)
bbox_props = dict(fc=(1, 1, 0.9), ec="b", lw=2)
t = ax.text(8500, 3000, "Long tail of expensive diamonds", ha="center", va="center",
rotation=325, size=17, bbox=bbox_props)
bb = t.get_bbox_patch()
bb.set_boxstyle("rarrow", pad=0.6)
``` Most people tend to buy diamonds that are just a few hundred or few thousand dollars. Yet, it seems a small group of people are willing to pay over 10,000.

This distribution has positive skew because there's a long tail of values on the positive side of the peak. You would otherwise say this data is "skewed to the right".

In distributions in which there's positive skew, it's important to understand where the mean and median lie. Let's plot this distribution again and mark the mean and median values.

Calculate the median and mean value of `horsepower_numerical_values` using numpy methods.

```median_price = round(df_diamonds['price'].median(), 2)
mean_price = round(df_diamonds['price'].mean(), 2)
```
```ax = sns.distplot(df_diamonds['price'], kde=False, color='lightskyblue')
plt.title("Distribution of Price of Diamonds Purchase", fontsize=20, y=1.012)
plt.xlabel("price (U.S. dollars)", labelpad=15)
plt.ylabel("frequency", labelpad=15)
names = ["median", "mean"]
colors = ['darkmagenta', 'darkorange']
measurements = [median_price, mean_price]
for measurement, name, color in zip(measurements, names, colors):
plt.axvline(x=measurement, linestyle='--', linewidth=2.5, label='{0} at {1}'.format(name, measurement), c=color)
plt.legend();
``` Nearly with all histograms that are positively skewed, the mean is greater than the median. The reason for this is that the long tail of values skews the mean higher than with a more normal distribution. Yet, the median isn't skewed by large values because it's just the "middle" of a list of sorted numbers.

### Negative Skew

#### Generate Fictional Meal Data

Using the `scipy` package's `stats` module, we can call the `beta()` method to create a list of values with a negatively skewed distribtion.

The data below ressembles the prices of meals for a fast casual restaurant.

```typical_meal_value_dollars = list(stats.beta.rvs(10, 2, loc=-8, scale=23, size=35000))
```

I use the `randrage()` method from Python's `random` module to generate a list of large meal orders between the prices of 15 and 25.

```large_meal_value_dollars = [random.randrange(15, 23) for value in range(0, 200)]
```

I concatenate the `typical_meal_value_dollars` and `large_meal_value_dollars` to create a new list assigned to the variable `meal_values_dollars`.

```meal_values_dollars = typical_meal_value_dollars + large_meal_value_dollars
```

#### View Distribution of Meal Orders [\$]

```ax = sns.distplot(meal_values_dollars, kde=False, color='green')
plt.title("Distribution of Price of Meals at Fast Casual Restaurant", fontsize=20, y=1.012)
plt.xlabel("price [\$]", labelpad=15)
plt.ylabel("frequency", labelpad=15)
bbox_props = dict(fc=(1, 1, 0.9), ec="green", lw=2)
t = ax.text(2, 600, "Long tail of\nsmall orders [\$]", ha="center", va="center",
rotation=335, size=17, bbox=bbox_props)
bb = t.get_bbox_patch()
bb.set_boxstyle("rarrow", pad=0.6)
``` Most people tend to spend around 10 to 14 dollars on their order. Very few people spend greater than 15. Lots of people spend between 0 and 8 and likely buy small items like appetizers or just drinks.

This distribution has negative skew because there's a long tail of values on the negative side of the peak. You would otherwise say this data is "skewed to the left".

In distributions in which there's negative skew, it's important to understand where the mean and median lie. Let's plot this distribution again and mark the mean and median values.

Calculate the median and mean value of `meal_values_dollars` using numpy methods.

```median_meal_price = round(np.median(meal_values_dollars), 2)
mean_meal_price = round(np.mean(meal_values_dollars), 2)
```
```ax = sns.distplot(meal_values_dollars, kde=False, color='green')
plt.title("Distribution of Price of Meals at Fast Casual Restaurant", fontsize=20, y=1.012)
plt.xlabel("price [\$]", labelpad=15)
plt.ylabel("frequency", labelpad=15)
names = ["median", "mean"]
colors = ['darkmagenta', 'darkorange']
measurements = [median_meal_price, mean_meal_price]
for measurement, name, color in zip(measurements, names, colors):
plt.axvline(x=measurement, linestyle='--', linewidth=2.5, label='{0} at {1}'.format(name, measurement), c=color)
plt.legend();
``` Nearly with all histograms that are negatively skewed, the mean is less than the median. The reason for this is that the long tail of values skews the mean lower than with a more normal distribution. Yet, the median isn't skewed by lots of small values because it's just the "middle" of a list of sorted numbers.