# Visual Introduction to Classification and Logistic Regression

- Dec 25 • 43 min read
- Key Terms: classification, logistic regression, math

The tutorial is a high-level overview of classification problems in machine learning and how Logistic Regression works with a single feature and a binary target.

I'll cover the following topics:

- Overview of Classification & Key Terms
- Most Popular Classification Algorithms
- Examples of Classification in Industry
- Walkthrough of Credit Card Application Visualizations
- Logistic Regression
- Evaluation Metrics
- Accuracy
- Precision & Recall

#### Import Modules

```
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from collections import Counter
import itertools
import seaborn as sns
import matplotlib.ticker as tick
import matplotlib.patches as patches
import matplotlib.pyplot as plt
% matplotlib inline
```

#### Visualization Setup

```
sns.set_context("talk")
sns.set_style("whitegrid", {'grid.color': '.92'})
def reformat_large_tick_values(tick_val, pos):
"""
Turns large tick values (in the billions, millions and thousands) such as 4500 into 4.5K and also appropriately turns 4000 into 4K (no zero after the decimal).
"""
if tick_val >= 1000000000:
val = round(tick_val/1000000000, 1)
new_tick_format = '{:}B'.format(val)
elif tick_val >= 1000000:
val = round(tick_val/1000000, 1)
new_tick_format = '{:}M'.format(val)
elif tick_val >= 1000:
val = round(tick_val/1000, 1)
new_tick_format = '{:}K'.format(val)
elif tick_val < 1000:
new_tick_format = round(tick_val, 1)
else:
new_tick_format = tick_val
# make new_tick_format into a string value
new_tick_format = str(new_tick_format)
# code below will keep 4.5M as is but change values such as 4.0M to 4M since that zero after the decimal isn't needed
index_of_decimal = new_tick_format.find(".")
if index_of_decimal != -1:
value_after_decimal = new_tick_format[index_of_decimal+1]
if value_after_decimal == "0":
# remove the 0 after the decimal point since it's not needed
new_tick_format = new_tick_format[0:index_of_decimal] + new_tick_format[index_of_decimal+2:]
return new_tick_format
```

### Overview of Classification & Key Terms

#### Problem Setup

Classification involves building a best fit equation to predict the probability an observation is of a certain class/label. Let's break down what some of those terms mean.

In order to use a classification algorithm, you must be provided with observations, features and labeled data. Let's say we work for a bank and it's our job to approve or deny credit card applicants.

#### Example: Credit Card Application

Most Americans age 18+ likely have a debit or credit card. Credit cards are issued by banks and businesses, allowing the holder to purchase goods or services on credit. Essentially, you can buy items without the exact cash on hand - but in agreement to pay back the bank later. You can learn more about applying for credit cards on Nerdwallet.

To obtain a credit card, you must apply for one through a bank. A bank will decide/classify if you are worthy of the card.

Below is some fake sample data that indicates important features considered before one is able to be approved for a credit card.

*Credit score*, *debt*, *yearly income*, and *age* are all features. The field for the classification label is *credit card decision*.

Name | Credit Score | Debt (\$) | Yearly Income (\$) | Age | Credit Card Decision |
---|---|---|---|---|---|

Joe Smith | 610 | 3000 | 45000 | 24 | Denied |

Jill Mason | 620 | 12000 | 58000 | 22 | Approved |

Brandon Cohen | 700 | 0 | 90000 | 27 | Approved |

Ariel Pan | 720 | 0 | 110000 | 29 | Approved |

There's a small trend that people that have a higher credit score, less debt and more income are more likely to be approved for a particular credit card.

In the table above, each row is considered an **observation**. With each observation as a record of a person, there's additional details on their credit score, debt, yearly income and age.

The columns *Credit Score* and *Debt (\\()*, *Yearly Income (\\))* and *Age* are considered **features**. They are measurable characteristics for each observation.

The last column in our table is *Credit Card Decision* and represents a **class** or sometimes considered a label. In the past, a representative from the bank would choose to approve or deny each credit card application based on that observation's features. A popular name for this scenario is called **labeled data**.

#### Binary Classification

In the example above, there were just two possible class labels - either *Denied* or *Approved*. Therefore, this is considered a binary classification example. The term *binary* means related to two things.

#### Multiclass Classification

In machine learning, a **multiclass classifier** is the problem of classifying observations into *three* or more classes.

### Most Popular Classification Algorithms

These five below I consider *classical* machine learning algorithms - as many were first profiled tens of years ago.

- Logistic Regression
- SVM
- Naive Bayes
- Decision Trees
- Random Forest

In this notebook, we're just going to learn Logistic Regression. Why? Well Logistic Regression is simple to implement and fits to data quickly. Also, this model is very interpretable - both in the math with how it works and interpretability of features. Not all other algorithms listed above also fit this criteria.

However, note Logistic Regression is often regarded as one of the simpler classification algorithms. It is just OK at capturing the variance with many features. So, you likely won't get as strong of a fit of a model with it compared to more complex machine learning models like XGBoost or neural networks.

Let's next talk about some additional business scenarios in which classification is commonly used in industry.

### Popular Examples of Classification Models in Industry

- Diagnosed with a certain type of disease
- Spam classification of text comments
- Image classification
- Hot dog or not hot dog
- Nudity or not nudity
- Food safe to eat or not
- recognize objects on streets for self-driving cars

- Customer churn - essentially will a person continue using an app/service or leave
- Default or not on a credit card

### Visual Derivation of Logistic Regression Equation

Let's continue with our credit card application example. We'll utilize a single feature, *yearly income*, and two class labels of *approved* and *denied*.

```
yearly_incomes_us_dollars = [12000, 14000, 25000, 28000, 38000, 45000, 50000, 55000, 60000, 66000, 73000, 75000]
credit_card_decisions = ["denied"]*4 + ["approved"] + ["denied"]*2 + ["approved"]*5
```

```
application_data = {'yearly_income': yearly_incomes_us_dollars,
'credit_card_decision': credit_card_decisions
}
df_credit_card_applications = pd.DataFrame(application_data)
```

```
plt.figure(figsize=(9, 7))
sns.stripplot(data=df_credit_card_applications, x="yearly_income", y="credit_card_decision", jitter=False, size=13)
plt.xlabel("yearly income ($)", labelpad=13)
plt.ylabel("credit card decision", labelpad=13)
plt.title("Credit Card Decisions Based on Yearly Income", y=1.015, fontsize=20)
ax = plt.gca()
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));
```

Upon looking at this graph, I think of a divide in which people that earn over $51,000 per year are more likely to be approved for this credit card. Below, I visualize my first thought with a *Denied Side* and *Approved Side* annotation.

```
plt.figure(figsize=(9, 7))
ax = sns.stripplot(data=df_credit_card_applications, x="yearly_income", y="credit_card_decision", jitter=False, size=13)
plt.xlabel("yearly income ($)", labelpad=13)
plt.ylabel("credit card decision", labelpad=13)
plt.title("Credit Card Decisions Based on Yearly Income", y=1.015, fontsize=20)
plt.axvline(x=51000, linestyle="--", color='green')
bbox_props_approved = dict(boxstyle="round", fc="orange", ec="0.8", alpha=0.8)
ax.text(65000, 0.5, "Approved Side", ha="center", va="center", size=23, bbox=bbox_props_approved)
bbox_props_denied = dict(boxstyle="round", fc="blue", ec="0.8", alpha=0.3)
ax.text(30000, 0.5, "Denied Side", ha="center", va="center", size=23, bbox=bbox_props_denied)
ax = plt.gca()
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));
```

#### 2 Features

Let's gather additional data to help us build our best fit of equation to classify whether someone is approved or denied of a credit card.

Below, I incorporated data for people's credit score.

```
credit_scores = [580, 600, 620, 640, 680, 670, 650, 700, 690, 710, 680, 715]
```

```
application_data["credit_score"] = credit_scores
df_credit_card_applications = pd.DataFrame(application_data)
```

```
df_credit_card_applications
```

yearly_income | credit_card_decision | credit_score | |
---|---|---|---|

0 | 12000 | denied | 580 |

1 | 14000 | denied | 600 |

2 | 25000 | denied | 620 |

3 | 28000 | denied | 640 |

4 | 38000 | approved | 680 |

5 | 45000 | denied | 670 |

6 | 50000 | denied | 650 |

7 | 55000 | approved | 700 |

8 | 60000 | approved | 690 |

9 | 66000 | approved | 710 |

10 | 73000 | approved | 680 |

11 | 75000 | approved | 715 |

Let's visualize our two features and two class labels.

```
plt.figure(figsize=(10, 8))
sns.scatterplot(x='yearly_income', y='credit_score', hue='credit_card_decision', data=df_credit_card_applications, s=200)
plt.title("Credit Card Application Details", y=1.015, fontsize=20)
plt.xlabel("yearly income", labelpad=13)
plt.ylabel("credit score", labelpad=13)
ax = plt.gca()
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));
```

#### Visually Predict Which Class a New Observation Belongs To

Below we see a new green plus sign to mark a new observation that doesn't have a labeled class yet. What class do you think it belongs to?

I would classify the new observation as being denied.

```
plt.figure(figsize=(10, 8))
ax = sns.scatterplot(x='yearly_income', y='credit_score', hue='credit_card_decision', data=df_credit_card_applications, s=200)
plt.scatter([40000], [600], c='green', s=200, marker='+')
bbox_props = dict(boxstyle="rarrow, pad=0.6", fc="snow", ec="g", lw=2.5)
t = ax.text(30000, 600, "Predict class", ha="center", va="center", size=17, bbox=bbox_props)
plt.title("Credit Card Application Details", y=1.015, fontsize=20)
plt.xlabel("yearly income", labelpad=13)
plt.ylabel("credit score", labelpad=13)
ax = plt.gca()
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));
```

Below is another example, a purple plus sign, in which we have a new observation but no class label. What class do you think it belongs to?

This one is trickier. There's a decent chance/probability that this observation could be approved or denied. I would hypothesize there's a 0.60 probability that this new observation is denied.

```
plt.figure(figsize=(10, 8))
ax = sns.scatterplot(x='yearly_income', y='credit_score', hue='credit_card_decision', data=df_credit_card_applications, s=200)
plt.scatter([45000], [660], c='purple', s=200, marker='+')
bbox_props = dict(boxstyle="rarrow, pad=0.6", fc="snow", ec="purple", lw=2.5)
t = ax.text(35000, 660, "Predict class", ha="center", va="center", size=17, bbox=bbox_props)
plt.title("Credit Card Application Details", y=1.015, fontsize=20)
plt.xlabel("yearly income", labelpad=13)
plt.ylabel("credit score", labelpad=13)
ax = plt.gca()
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));
```

In my mind, I think of a dividing line that would separate the two classes. I plotted one below.

However, this isn't perfect. There's an observation that was approved in my denied side. I figure with additional data points, there will also be more mis-classifications like this one.

```
plt.figure(figsize=(10, 8))
ax = sns.scatterplot(x='yearly_income', y='credit_score', hue='credit_card_decision', data=df_credit_card_applications, s=200)
bbox_props_approved = dict(boxstyle="round", fc="orange", ec="0.8", alpha=0.8)
ax.text(65000, 660, "Approved Side", ha="center", va="center", size=23, bbox=bbox_props_approved)
bbox_props_denied = dict(boxstyle="round", fc="blue", ec="0.8", alpha=0.3)
ax.text(28000, 660, "Denied Side", ha="center", va="center", size=23, bbox=bbox_props_denied);
plt.plot([35000, 69000], [715, 585], linestyle="--", color='green')
plt.title("Credit Card Application Details", y=1.015, fontsize=20)
plt.xlabel("yearly income", labelpad=13)
plt.ylabel("credit score", labelpad=13)
ax = plt.gca()
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));
```

#### Let's Try Linear Regression for Classification

Below I create a new column called `decision_label`

that marks a 1 if the `credit_card_decision`

is `approved`

and a `0`

if `denied`

.

```
df_credit_card_applications['decision_label'] = np.where(df_credit_card_applications['credit_card_decision']=="approved", 1, 0)
```

Preview the DataFrame to visually check our logic is correct.

```
df_credit_card_applications[['credit_card_decision', 'decision_label']].sample(5)
```

credit_card_decision | decision_label | |
---|---|---|

0 | denied | 0 |

10 | approved | 1 |

7 | approved | 1 |

5 | denied | 0 |

1 | denied | 0 |

We can visualize this new `1`

and `0`

class label name on a plot below.

```
plt.figure(figsize=(9, 6))
sns.scatterplot(x="yearly_income", y="decision_label", hue="decision_label", data=df_credit_card_applications, s=200)
plt.xlabel("yearly income ($)", labelpad=13)
plt.ylabel("credit card decision", labelpad=13)
plt.title("Credit Card Decisions Based on Yearly Income", y=1.015, fontsize=20)
ax = plt.gca()
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));
```

We'll fit a linear regression model to our entire dataset. First, instantiate the `LinearRegression`

object that was imported at the top of our script and assign it to the variable `linear_regression`

. You can read more about the official documentation of Linear Regression on sklearn.

```
linear_regression = LinearRegression()
```

Let's build our linear regression line of best fit and assign it to `lr`

. First, we have to get our list of x-values and y-values in the right data format for sklearn.

```
x_values = df_credit_card_applications['yearly_income'].values.reshape(-1,1)
```

```
x_values
```

```
array([[12000],
[14000],
[25000],
[28000],
[38000],
[45000],
[50000],
[55000],
[60000],
[66000],
[73000],
[75000]])
```

```
y_values = df_credit_card_applications['decision_label'].values
```

```
y_values
```

```
array([0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1])
```

```
lr = linear_regression.fit(x_values, y_values)
```

The coefficient of our linear regression model.

```
lr.coef_
```

```
array([1.83358404e-05])
```

The intercept of our linear regression model.

```
lr.intercept_
```

```
-0.3266408043702007
```

Our linear regression line of best fit can be modeled by the equation:

`target = lr.intercept_ + yearly_income * lr.coef_`

Let's plot our linear regression line of best fit using the minimum and maximum values from our x and y axes.

```
min_income = df_credit_card_applications['yearly_income'].min()
max_income = df_credit_card_applications['yearly_income'].max()
```

```
print("min_income: {0}".format(min_income))
print("max_income: {0}".format(max_income))
```

```
min_income: 12000
max_income: 75000
```

We can use the `predict`

method to predict a credit card decision value given a yearly annual income value.

```
lr.predict(min_income)
```

```
array([-0.10661072])
```

Here is our linear regression line of best fit in the range of yearly income values in our dataset.

```
plt.figure(figsize=(9, 6))
sns.scatterplot(x="yearly_income", y="decision_label", hue="decision_label", data=df_credit_card_applications, s=200)
plt.plot([min_income, max_income], [lr.predict(min_income), lr.predict(max_income)], c='darkviolet')
plt.xlabel("yearly income ($)", labelpad=13)
plt.ylabel("credit card decision", labelpad=13)
plt.title("Credit Card Decisions Based on Yearly Income", y=1.015, fontsize=20)
ax = plt.gca()
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));
```

Below, we calculate our R^2 when trained on the entire dataset. At `0.58`

, it's reasonable, but we fail to capture a lot of the variance in our data.

```
r_squared = lr.score(x_values, y_values)
```

```
r_squared
```

```
0.5898028659646901
```

Below, I plot a decision threshold as a dotted green line that's equivalent to `y=0.5`

. A **decision threshold** represents the result of a quantitative test to a simple binary decision.

For example, given an input of a yearly income value, if we get a prediction value greater than `0.5`

, we'll simply round up and classify that observation as approved.

Let's try a yearly income of \$50,000.

```
lr.predict(50000)
```

```
array([0.59015122])
```

Our returned result of `0.59`

is greater than `0.5`

so we can round up to `1`

and classify this observation as being approved for a credit card.

Now let's try a yearly income of \$33,000.

```
lr.predict(33000)
```

```
array([0.27844193])
```

Our returned result of `0.27`

is less than `0.5`

so we can round down to `0`

and classify this observation as being rejected for a credit card.

Below, I visualize this decision boundary.

```
plt.figure(figsize=(11, 8))
sns.scatterplot(x="yearly_income", y="decision_label", hue="decision_label", data=df_credit_card_applications, s=200)
plt.plot([min_income, max_income], [lr.predict(min_income), lr.predict(max_income)], c='darkviolet')
plt.axhline(y=0.5, linestyle="--", color='green')
plt.xlabel("yearly income ($)", labelpad=13)
plt.ylabel("credit card decision", labelpad=13)
plt.title("Credit Card Decisions Based on Yearly Income", y=1.015, fontsize=20)
ax = plt.gca()
orange_rect = patches.Rectangle((45000,0.5), 30000, 0.5, linewidth=1, edgecolor='orange', facecolor='navajowhite', alpha=0.5)
ax.add_patch(orange_rect)
blue_rect = patches.Rectangle((10000, 0), 35000, 0.5, linewidth=1, edgecolor='blue', facecolor='skyblue', alpha=0.5)
ax.add_patch(blue_rect)
bbox_props_approved = dict(boxstyle="round", fc="orange", ec="0.8", alpha=0.8)
ax.text(67000, 0.7, "Always Approve", ha="center", va="center", size=20, bbox=bbox_props_approved)
bbox_props_denied = dict(boxstyle="round", fc="blue", ec="0.8", alpha=0.2)
ax.text(20000, 0.3, "Always Deny", ha="center", va="center", size=20, bbox=bbox_props_denied)
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));
```

This model clearly isn't perfect. Below, I point to two observations that our model misclassifies given our linear model and rounding up/down logic.

More info on drawing rectangles in Matplotlib: patches.Rectangle documentation page.

```
plt.figure(figsize=(11, 8))
sns.scatterplot(x="yearly_income", y="decision_label", hue="decision_label", data=df_credit_card_applications, s=200)
plt.plot([min_income, max_income], [lr.predict(min_income), lr.predict(max_income)], c='darkviolet')
plt.axhline(y=0.5, linestyle="--", color='green')
plt.xlabel("yearly income ($)", labelpad=13)
plt.ylabel("credit card decision", labelpad=13)
plt.title("Credit Card Decisions Based on Yearly Income", y=1.015, fontsize=20)
ax = plt.gca()
orange_rect = patches.Rectangle((45000,0.5), 30000, 0.5, linewidth=1, edgecolor='orange', facecolor='navajowhite', alpha=0.5)
ax.add_patch(orange_rect)
blue_rect = patches.Rectangle((10000, 0), 35000, 0.5, linewidth=1, edgecolor='blue', facecolor='skyblue', alpha=0.5)
ax.add_patch(blue_rect)
bbox_props_approved = dict(boxstyle="round", fc="orange", ec="0.8", alpha=0.8)
ax.text(67000, 0.7, "Always Approve", ha="center", va="center", size=20, bbox=bbox_props_approved)
bbox_props_denied = dict(boxstyle="round", fc="blue", ec="0.8", alpha=0.2)
ax.text(20000, 0.3, "Always Deny", ha="center", va="center", size=20, bbox=bbox_props_denied)
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));
ax.legend_.remove() # removes legend
bbox_props_left = dict(boxstyle="larrow, pad=0.6", fc="snow", ec="firebrick", lw=2.5)
t = ax.text(63000, 0, "Model misclassification", ha="center", va="center", size=17, bbox=bbox_props_left)
bbox_props_left = dict(boxstyle="rarrow, pad=0.6", fc="snow", ec="firebrick", lw=2.5)
t = ax.text(25000, 1, "Model misclassification", ha="center", va="center", size=17, bbox=bbox_props_left)
```

```
df_credit_card_applications.head()
```

yearly_income | credit_card_decision | credit_score | decision_label | |
---|---|---|---|---|

0 | 12000 | denied | 580 | 0 |

1 | 14000 | denied | 600 | 0 |

2 | 25000 | denied | 620 | 0 |

3 | 28000 | denied | 640 | 0 |

4 | 38000 | approved | 680 | 1 |

Let's show another reason that a linear regression isn't the best model to fit for this data. I'll now add a single outlier - an individual that applied for this credit card and has a income of \$250,000.

```
wealthy_person = {'yearly_income': [250000],
'credit_card_decision': ["approved"],
'credit_score': [715],
'decision_label': [1]
}
df_wealthy = pd.DataFrame(wealthy_person)
```

```
df_cc_applications_with_outlier = pd.concat([df_credit_card_applications, df_wealthy])
```

The steps below fit a new linear regression model to our new data that includes this outlier.

```
linear_regression2 = LinearRegression()
x_values2 = df_cc_applications_with_outlier['yearly_income'].values.reshape(-1,1)
y_values2 = df_cc_applications_with_outlier['decision_label'].values
lr2 = linear_regression2.fit(x_values2, y_values2)
```

We assign a variable `max_income_outlier`

to our new max income value of \$250,000.

```
max_income_outlier = df_cc_applications_with_outlier['yearly_income'].max()
```

Our visualization below plots the new linear regression line of best fit with this additional outlier point - the orange dot to the far-most right.

It's clear this line **doesn't** capture the variance in our data well. If we followed the previous decision threshold rules mentioned earlier, we'd likely misclassify more observations moving forward.

```
plt.figure(figsize=(9, 6))
sns.scatterplot(x="yearly_income", y="decision_label", hue="decision_label", data=df_cc_applications_with_outlier, s=200)
plt.plot([min_income, max_income_outlier], [lr2.predict(min_income), lr2.predict(max_income_outlier)], c='darkviolet')
plt.xlabel("yearly income ($)", labelpad=13)
plt.ylabel("credit card decision", labelpad=13)
plt.title("Credit Card Decisions Based on Yearly Income with Outlier", y=1.015, fontsize=20)
ax = plt.gca()
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));
```

Even without this outlier, it's clear this linear regression model doesn't fit to our data well. Let's not use the data with that outlier. Let's take an earlier visualization of our linear regression line of best fit and view it on a larger x and y scale below.

Wow this linear regression seems off! How could someone have a credit card decision greater than `1`

? That doesn't make sense since you're either approved or denied, what we marked as `1`

or `0`

.

```
zero_income = 0
super_high_income = 200000
plt.figure(figsize=(10, 8))
sns.scatterplot(x="yearly_income", y="decision_label", hue="decision_label", data=df_credit_card_applications, s=200)
plt.plot([zero_income, super_high_income], [lr.predict(zero_income), lr.predict(super_high_income)], c='darkviolet')
plt.xlabel("yearly income ($)", labelpad=13)
plt.ylabel("credit card decision", labelpad=13)
plt.title("Credit Card Decisions Based on Yearly Income", y=1.015, fontsize=20)
ax = plt.gca()
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));
```

### Logistic Regression to the Rescue!

We need a better line of fit in order to classify our credit card application data. First, let's learn how the sigmoid function can be relevant.

#### Sigmoid Function

A **sigmoid function** is a mathematical function that represents an "s"-shaped curve. It's defined by the formula:

I'll create a list of `x_values`

to plot the sigmoid function. The line of code below creates a numpy array of 120 values linearly spaced between -10 and 10. Here's the official documentation page.

```
x_values_for_sigmoid = np.linspace(-10, 10, 120)
```

Preview the first 5 values of `x_values`

.

```
x_values_for_sigmoid[0:5]
```

```
array([-10. , -9.83193277, -9.66386555, -9.49579832,
-9.32773109])
```

Our sigmoid equation from above wrapped into a Python function.

```
def sigmoid(x):
return 1 / (1 + np.exp(-x))
```

We can see from the plot below that as `x`

increase, the line approaches `1.0`

but never reaches it exactly (unless you round up).

Similarly, as `x`

decreases, the line approaches `0.0`

but never reaches it exactly (unless you round down).

These limits of this function are exactly what we need to build our line of best fit for a binary classification. This resolves many of the problems we had with using linear regression in this situation.

```
plt.figure(figsize=(8, 7))
plt.plot(x_values_for_sigmoid, sigmoid(x_values_for_sigmoid), c='teal')
plt.title('Sigmoid Function')
plt.grid(True)
plt.text(2, 0.3, r'$\sigma(x)=\frac{1}{1+e^{-x}}$', fontsize=26)
plt.show()
```

### Logistic Function

Below is the logistic function for a single feature. The denominator looks similar to linear regression with an intercept, coefficient and error value.

#### Fit Logistic Regression Model on Credit Card Data

We'd expect a normal distribution with income levels. So, most incomes would be around the mean and median value. I use numpy's `normal`

method from the `random`

module to generate the data. See the documentation here.

```
yearly_incomes_us_dollars = np.random.normal(loc=50000, scale=15000, size=500)
```

```
sns.distplot(yearly_incomes_us_dollars)
```

```
/Users/dan/anaconda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
<matplotlib.axes._subplots.AxesSubplot at 0x1a15d3e278>
```

```
len(yearly_incomes_us_dollars)
```

```
500
```

I'm going to sort `yearly_income_us_dollars`

from least to greatest.

```
yearly_incomes_us_dollars = sorted(yearly_incomes_us_dollars)
```

```
credit_card_decisions = [0]*180 + [1]*4 + [0]*16 + [1]*5 + [0]*10 + [1]*(500-215)
```

```
application_data = {'yearly_income': yearly_incomes_us_dollars,
'credit_card_decision': credit_card_decisions
}
df_credit_card_applications2 = pd.DataFrame(application_data)
```

I want to now see the pattern between income level and credit card decision. I figure people that earn more money generally are accepted more than people who earn less money.

I'll use the Pandas cut method to bin income levels by `very low`

, `low`

, `medium`

, `high`

and `very high`

.

```
income_bin_labels = pd.cut(df_credit_card_applications2['yearly_income'], bins=5, labels=["very low", "low", "medium", "high", "very high"])
```

```
df_credit_card_applications2['income_bucket'] = income_bin_labels
```

There's an equal number of people in each of the five bins.

```
df_credit_card_applications2['income_bucket'].value_counts()
```

```
medium 245
high 156
low 71
very high 20
very low 8
Name: income_bucket, dtype: int64
```

Now that our continuous variable (income) is a categorical variable (income level), we can see the mix of the count of credit card decisions by income level. To visualize the data, I use Seaborn's `countplot()`

method.

Our hypothesis was expected in that wealther individuals are more likely to get accepted for a credit card (decision value of 1) compared to less wealthy people.

```
plt.figure(figsize=(14, 8))
sns.countplot(x='income_bucket', hue='credit_card_decision', data=df_credit_card_applications2)
plt.title("Credit Card Decisions by Income Buckets", y=1.015)
plt.ylabel("count [people]", labelpad=13)
plt.xlabel("income bucket level", labelpad=13);
```

```
plt.figure(figsize=(14, 8))
sns.catplot(x="credit_card_decision", y="yearly_income", kind="swarm", data=df_credit_card_applications2, height=8.5, aspect=.9)
plt.xlabel("credit card decision", labelpad=15)
plt.ylabel("yearly income [$]", labelpad=15)
plt.title("Distribution of Yearly Income by Credit Card Decision", y=1.013);
```

```
<matplotlib.figure.Figure at 0x1a1750e978>
```

```
X = df_credit_card_applications2['yearly_income'].values.reshape(-1, 1)
```

We need to standardize our feature `yearly_income`

and can do so using StandardScaler in sklearn.

Why do we standardize features here? We talked about this in a previous class. Logistic regression by default uses regularization and regularization works best when we standardize our features. Otherwise, if we have outliers, we'll have issues in regularization.

The equation to transform each of our feature values to get a standardized feature value is below. This rescales our features to have a mean of 0 and unit variance - a variance qual to 1.

In layman's terms, I'd describe this as: for every feature value, subtract the mean of the feature vector and divide that numerator by the standard deviation of the feature vector.

```
scaler = StandardScaler()
```

```
scaler.fit(X)
```

```
StandardScaler(copy=True, with_mean=True, with_std=True)
```

```
X_transformed = scaler.transform(X)
```

We can preview a histogram to see the distribution of our new feature-scaled yearly income values.

There's a uniform distribution with a median of roughly 0 and a mean of roughly 0.

```
plt.figure(figsize=(8, 6))
sns.distplot(X_transformed)
plt.title("Distribution of transformed yearly income values", y=1.015)
plt.xlabel("transformed income value", labelpad=12)
plt.ylabel("frequency of occurence", labelpad=12);
```

```
/Users/dan/anaconda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
```

```
y = df_credit_card_applications2['credit_card_decision'].values
```

We will use the Logistic Regression module built into sklearn. Remember to always read the default parameters since this implementation uses regularization. Here is the documentation page.

```
log_reg = LogisticRegression()
```

Let's do a train-test-split and use 65% of observations for training our model and the other 35% for testing how our model performs on unseen data. The official documentation page for train-test-split is here on sklearn.

```
X_train, X_test, y_train, y_test = train_test_split(X_transformed, y, test_size=0.35, random_state=42)
```

Let's fit our training data to our Logistic Regression object.

```
log_reg.fit(X_train, y_train)
```

```
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
```

#### Visualize Results of Logistic Regression Model Predictions

Our model is fitted. We can now visualize our line of best fit on our training data.

Later, we'll cover model evaluation metrics.

Our `log_reg`

object has a `predict_proba()`

method. So, if we pass in an x-value (after our feature scaling transformation), we're outputted the probability that observations belongs to class 0 (denied) and class 1 (approved).

In the example below, the first item in that list is the probability that this observation belongs to class 0, and the second list item is the probability this this observation belongs to class 1.

```
log_reg.predict_proba(X_train)[0]
```

```
array([0.71828833, 0.28171167])
```

I want to create a single array of *just* the probabilities that our model predicts in which each of the observations from our training data belongs to class 1.

```
X_train_probability_class_1 = log_reg.predict_proba(X_train)[:, 1]
```

```
X_train_probability_class_1[0]
```

```
0.2817116677754508
```

To plot the data, I'm going to put our `X_train`

observations and `X_train_probability_class_1`

predictions as two columns.

```
predictions = {"x_train_points": X_train.flatten(),
"x_train_probability_being_class_1": X_train_probability_class_1}
df_predictions = pd.DataFrame(predictions)
```

Hey look! It's our plotted predictions and it looks like a sigmoid function!

```
ax = df_predictions.plot.scatter(x="x_train_points", y="x_train_probability_being_class_1", figsize=(10, 10), c='blueviolet')
plt.title("Predictions on Training Data", y=1.015, fontsize=20)
plt.xlabel("yearly income feature scaled", labelpad=14)
plt.ylabel("probability of a prediction being class 1 (approved)", labelpad=14);
```

So if our logistic regression model outputs a probability between 0 and 1 of an observation belonging to class 1, then how do we classify if the observation belongs to class 0 or class 1?

As we talked about earlier, you have a **decision threshold**. In binary classification problems in sklearn, there is a default probability decision threshold of 0.5. So, if for an observation we predict a 0.6 probability it belongs to class 1, since 0.6 is greater than 0.5, we'll round up and classify that observation as belonging to class 1.

```
ax = df_predictions.plot.scatter(x="x_train_points", y="x_train_probability_being_class_1", figsize=(10, 10), c='blueviolet')
plt.title("Predictions on Training Data", y=1.015, fontsize=20)
plt.xlabel("yearly income feature scaled", labelpad=14)
plt.ylabel("probability of a prediction being class 1 (accepted)", labelpad=14)
plt.axhline(y=0.5, linestyle="--", color='green')
bbox_props_decision_threshold = dict(boxstyle="round", fc="snow", ec="0.8", alpha=0.8)
ax.text(0.35, 0.53, "Decision Threshold", bbox=bbox_props_decision_threshold, size=25);
```

Below, I shaded the areas in which if we predict a new observation lies in that area, we'll classify the observation as belonging either to class 0 or class 1.

```
ax = df_predictions.plot.scatter(x="x_train_points", y="x_train_probability_being_class_1", figsize=(10, 10), c='blueviolet')
plt.title("Predictions on Training Data", y=1.015, fontsize=20)
plt.xlabel("yearly income feature scaled", labelpad=14)
plt.ylabel("probability of a prediction being class 1 (accepted)", labelpad=14)
plt.axhline(y=0.5, linestyle="--", color='green')
bbox_props_decision_threshold = dict(boxstyle="round", fc="snow", ec="0.8", alpha=0.8)
ax.text(-2.60, 0.53, "Decision Threshold", bbox=bbox_props_decision_threshold, size=22);
orange_rect = patches.Rectangle((-0.2, 0.5), 3.7, 0.5, linewidth=1, edgecolor='orange', facecolor='navajowhite', alpha=0.5)
ax.add_patch(orange_rect)
bbox_props_approved = dict(boxstyle="round", fc="orange", ec="0.8", alpha=0.3)
ax.text(1.5, 0.7, "Round up to 1", ha="center", va="center", size=23, bbox=bbox_props_approved)
blue_rect = patches.Rectangle((-2.7, 0), 2.5, 0.5, linewidth=1, edgecolor='blue', facecolor='steelblue', alpha=0.5)
ax.add_patch(blue_rect)
bbox_props_denied = dict(boxstyle="round", fc="white", ec="0.8", alpha=0.3)
ax.text(-1.5, 0.4, "Round down to 0", ha="center", va="center", size=23, bbox=bbox_props_denied);
```

As a hypothetical scenario, we work for this bank, and let's say all of a sudden many people who we give credit cards out to are not paying off their bills in time and so now the bank is low in cash. We don't know the exact underlying reason why this is happening all of a sudden. One hypothesis could be that we've been too lenient about giving out credit cards to people who are on the fence of deserving them.

We can utilize this knowledge to modify our predictions. Notice how above the probability decision threshold is at 0.5. If we wanted to be pickier about selecting high-quality appilcants for credit cards, we could move our decision threshold up to a value such as 0.7. Then, for our simple example, we'd only approve applicants with a higher income score.

#### Visualize Predictions - Writing our Logistic Regression Prediction Function

Below is the logistic function for a single feature again.

Let's look at our coefficient.

```
log_reg.coef_
```

```
array([[5.76770022]])
```

Let's look at our intercept.

```
log_reg.intercept_
```

```
array([1.1217481])
```

We can use the intercept and coefficient from our logistic regression model and simply plug it into the logistic function to visualize our predictions (same as what's above).

```
def log_reg_model_credit_card_applications_predict(yearly_income):
return 1 / (1 + np.exp(-(log_reg.intercept_ + log_reg.coef_ * yearly_income)))
```

```
plt.figure(figsize=(10, 10))
plt.scatter(x=X_train, y=log_reg_model_credit_card_applications_predict(X_train), c='blueviolet', alpha=0.5)
plt.title("Predictions on Training Data", y=1.015, fontsize=20)
plt.xlabel("yearly income feature scaled", labelpad=14)
plt.ylabel("probability of a prediction being class 1 (accepted)", labelpad=14);
```

### Classification Evaluation Metrics

#### Accuracy

**Accuracy** is the proportion of correct predictions. You could also multiply the proportion by 100 to report a percentage value.

Let's view a simplified example of our predictions on an unseen dataset and calculcate accuracy.

Person id | Annual Income in U.S. dollars | Model's prediction of credit card decision | Actual credit card decision |
---|---|---|---|

1 | 30000 | 0 | 0 |

2 | 45000 | 0 | 1 |

3 | 48000 | 1 | 1 |

4 | 55000 | 1 | 1 |

5 | 80000 | 1 | 1 |

There were 5 total observations. For 4/5 observations, our model predicted a label equivalent to the actual label. 4/5 is 0.8. So we can say the accuracy score of this model is `0.8`

.

In sklearn, for our logistic regression object, which we assigned to `log_reg`

, we can use the `score`

method and pass in our test data to return an accuracy score.

Typically you report on the accuracy of your model against the test set as a measure for the model's performance.

```
accuracy_test_set = log_reg.score(X_test, y_test)
```

```
accuracy_test_set
```

```
0.9714285714285714
```

I want to round this value to two decimal places. See this Stack Overflow article for the method.

```
print("{0:.2f}".format(round(accuracy_test_set, 2)))
```

```
0.97
```

There is one major caveat of using accuracy as an evaluation metric of your predictive algorithm. Let's illustrate this caveat through an example. Let's say we're a doctor and we're trying to diagnose people that may have a deadly disease. For most people that come into our office, they *aren't* diagnosed with the disease.

Below is a sample of the data.

Person id | Age | Family history of disease | Model's disease prediction | Actual diagnosis of disease |
---|---|---|---|---|

1 | 38 | "yes" | "no" | "yes" |

2 | 42 | "no" | "no" | "no" |

3 | 58 | "yes" | "no" | "no" |

4 | 39 | "no" | "no" | "no" |

5 | 55 | "no" | "no" | "no" |

6 | 61 | "no" | "no" | "no" |

7 | 52 | "no" | "no" | "no" |

8 | 45 | "no" | "no" | "no" |

9 | 59 | "no" | "no" | "no" |

10 | 48 | "no" | "no" | "no" |

We could build a model with close to 1.0 accuracy by *always* predicting someone doesn't have the disease. But what happens when someone ends up having the disease? If we say they don't, we won't give them the proper care they need to treat the disease.

This is an example of **imbalanced classes** - since there's more observations of one class than another. You can learn more about how to fix evaluation of models with imbalanced classes through this article on Machine Learning Mastery. We won't go into detail on the topic in this tutorial.

#### Confusion Matrix

Below we'll pre-process our data to build a confusion matrix. A **confusion matrix** is a table that describes the performance of a classification model on a set of test data for which the actual labels are known. Let's begin...

Our `X_test`

values are meant to utilized to evaluate how our model performs on these unseen inputs. We use our model to make predictions/classify the observations given `X_test`

. We assign our list of class predictions to the variable `y_test_predictions`

.

```
y_test_predictions = log_reg.predict(X_test)
```

As expected, we see a list of 1s and 0s.

```
y_test_predictions
```

```
array([1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1,
1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0,
1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1,
1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0,
0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0,
1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0,
1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0])
```

We can use `Counter`

to create a dictionary of how many predictions are of class 1 versus of class 0.

```
Counter(y_test_predictions)
```

```
Counter({1: 94, 0: 81})
```

```
class_names = ["denied", "approved"]
```

Let's plot our confusion matrix and below we'll explain each element of it. The function below was originally modified from this sklearn documentation page.

```
def plot_confusion_matrix(cm, classes, title='Confusion matrix', cmap=plt.cm.Blues):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=0)
plt.yticks(tick_marks, classes)
name = None
fmt = 'd'
thresh = cm.max() / 2
# i and j tell us the coordinates of boxes
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
if i == 0 and j == 0:
name = "True Negatives"
elif i == 0 and j == 1:
name = "False Positives"
elif i == 1 and j == 1:
name = "True Positives"
else:
name = "False Negatives"
plt.text(j, i, format(cm[i, j], fmt) + "\n" + name, horizontalalignment="center",
fontsize=23, color="white" if cm[i, j] > thresh else "black")
plt.ylabel('True label', labelpad=12)
plt.xlabel('Predicted label', labelpad=12)
plt.title(title, y=1.02)
plt.tight_layout()
# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, y_test_predictions);
np.set_printoptions(precision=2);
# Plot non-normalized confusion matrix
plt.figure(figsize=(7, 7))
plt.grid(False)
plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix, without normalization')
```

Our first insights, ignoring the text in the boxes:

- There are two class labels:
*approved*and*denied*. - The x-axis is labeled
*Predicted label*- meaning our model's predictions given`X_test`

. - The y-axis is labeled
*True label*- meaning the class labels for our`X_test`

observations. - The model made a total of 175 predictions (equivalent to 75+3+4+93) on if those 175 individuals should be approved or denied of a credit card.
- Out of the 175 predictions, 79 (equivalent to 4 + 75) were for the
*denied*class. - Out of the 175 predictions, 96 (equivalent to 93 + 3) were for the
*approved*class.

Let's explain each of the numbers in the boxes:

- For 75 observations (from
`X_test`

) the model predicted*denied*and the label was actually*denied*. - For 4 observations (from
`X_test`

) the model predicted*denied*but the label was actually*approved*. - For 3 observations (from
`X_test`

) the model predicted*approved*but the label was actually*denied*. - For 93 observations (from
`X_test`

) the model predicted*approved*and it was actually*approved*.

The exact names of True Negatives, False Positives, False Negatives and True Positives will come in handy for other significant evaluation metrics of classification models.

#### Precision

**Precision**: for our predictions of the positive class (*approved* in our example), how often is it correct?

The equation is:

Precision = True Positives / (True Positives + False Positives)

We can manually calculcate precision given the values in the confusion matrix above.

Precision = 93 / (93+3) = 0.96875

In sklearn, we can use the `precision_score`

method from the `metrics`

module.

```
print("Precision: {0}".format(precision_score(y_test, y_test_predictions)))
```

```
Precision: 0.9893617021276596
```

#### Recall

**Recall**: when the actual label is positive (*approved* in our example), how often did we predict correctly?

The equation is:

Recall = True Positives / (True Positives + False Negatives)

We can manually calculate recall given the values in the confusion matrix above.

Recall = 93 / (93 + 4) = 0.96

In sklearn, we can use the `recall`

method from the `metrics`

module.

```
print("Recall:", round(recall_score(y_test, y_test_predictions), 2))
```

```
Recall: 0.96
```