# Visual Introduction to Classification and Logistic Regression

- December 25, 2018
- Key Terms: classification, logistic regression, math

The tutorial is a high-level overview of classification problems in machine learning and how Logistic Regression works with a single feature and a binary target.

I'll cover the following topics:

- Overview of Classification & Key Terms
- Most Popular Classification Algorithms
- Examples of Classification in Industry
- Walkthrough of Credit Card Application Visualizations
- Logistic Regression
- Evaluation Metrics
- Accuracy
- Precision & Recall

#### Import Modules¶

```
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from collections import Counter
import itertools
import seaborn as sns
import matplotlib.ticker as tick
import matplotlib.patches as patches
import matplotlib.pyplot as plt
% matplotlib inline
```

#### Visualization Setup¶

```
sns.set_context("talk")
sns.set_style("whitegrid", {'grid.color': '.92'})
def reformat_large_tick_values(tick_val, pos):
"""
Turns large tick values (in the billions, millions and thousands) such as 4500 into 4.5K and also appropriately turns 4000 into 4K (no zero after the decimal).
"""
if tick_val >= 1000000000:
val = round(tick_val/1000000000, 1)
new_tick_format = '{:}B'.format(val)
elif tick_val >= 1000000:
val = round(tick_val/1000000, 1)
new_tick_format = '{:}M'.format(val)
elif tick_val >= 1000:
val = round(tick_val/1000, 1)
new_tick_format = '{:}K'.format(val)
elif tick_val < 1000:
new_tick_format = round(tick_val, 1)
else:
new_tick_format = tick_val
# make new_tick_format into a string value
new_tick_format = str(new_tick_format)
# code below will keep 4.5M as is but change values such as 4.0M to 4M since that zero after the decimal isn't needed
index_of_decimal = new_tick_format.find(".")
if index_of_decimal != -1:
value_after_decimal = new_tick_format[index_of_decimal+1]
if value_after_decimal == "0":
# remove the 0 after the decimal point since it's not needed
new_tick_format = new_tick_format[0:index_of_decimal] + new_tick_format[index_of_decimal+2:]
return new_tick_format
```

### Overview of Classification & Key Terms¶

#### Problem Setup¶

Classification involves building a best fit equation to predict the probability an observation is of a certain class/label. Let's break down what some of those terms mean.

In order to use a classification algorithm, you must be provided with observations, features and labeled data. Let's say we work for a bank and it's our job to approve or deny credit card applicants.

#### Example: Credit Card Application¶

Most Americans age 18+ likely have a debit or credit card. Credit cards are issued by banks and businesses, allowing the holder to purchase goods or services on credit. Essentially, you can buy items without the exact cash on hand - but in agreement to pay back the bank later. You can learn more about applying for credit cards on Nerdwallet.

To obtain a credit card, you must apply for one through a bank. A bank will decide/classify if you are worthy of the card.

Below is some fake sample data that indicates important features considered before one is able to be approved for a credit card.

*Credit score*, *debt*, *yearly income*, and *age* are all features. The field for the classification label is *credit card decision*.

Name | Credit Score | Debt (\$) | Yearly Income (\$) | Age | Credit Card Decision |
---|---|---|---|---|---|

Joe Smith | 610 | 3000 | 45000 | 24 | Denied |

Jill Mason | 620 | 12000 | 58000 | 22 | Approved |

Brandon Cohen | 700 | 0 | 90000 | 27 | Approved |

Ariel Pan | 720 | 0 | 110000 | 29 | Approved |

There's a small trend that people that have a higher credit score, less debt and more income are more likely to be approved for a particular credit card.

In the table above, each row is considered an **observation**. With each observation as a record of a person, there's additional details on their credit score, debt, yearly income and age.

The columns *Credit Score* and *Debt (\$)*, *Yearly Income (\$)* and *Age* are considered **features**. They are measurable characteristics for each observation.

The last column in our table is *Credit Card Decision* and represents a **class** or sometimes considered a label. In the past, a representative from the bank would choose to approve or deny each credit card application based on that observation's features. A popular name for this scenario is called labeled data.

#### Binary Classification¶

In the example above, there were just two possible class labels - either *Denied* or *Approved*. Therefore, this is considered a binary classification example. The term *binary* means related to two things.

#### Multiclass Classification¶

In machine learning, a multiclass classifier is the problem of classifying observations into *three* or more classes.

### Most Popular Classification Algorithms¶

These five below I consider *classical* machine learning algorithms - as many were first profiled tens of years ago.

- Logistic Regression
- SVM
- Naive Bayes
- Decision Trees
- Random Forest

In this notebook, we're just going to learn Logistic Regression. Why? Well Logistic Regression is simple to implement and fits to data quickly. Also, this model is very interpretable - both in the math with how it works and interpretability of features. Not all other algorithms listed above also fit this criteria.

However, note Logistic Regression is often regarded as one of the simpler classification algorithms. It is just OK at capturing the variance with many features. So, you likely won't get as strong of a fit of a model with it compared to more complex machine learning models like XGBoost or neural networks.

Let's next talk about some additional business scenarios in which classification is commonly used in industry.

### Popular Examples of Classification Models in Industry¶

- Diagnosed with a certain type of disease
- Spam classification of text comments
- Image classification
- Hot dog or not hot dog
- Nudity or not nudity
- Food safe to eat or not
- recognize objects on streets for self-driving cars

- Customer churn - essentially will a person continue using an app/service or leave
- Default or not on a credit card

### Visual Derivation of Logistic Regression Equation¶

Let's continue with our credit card application example. We'll utilize a single feature, *yearly income*, and two class labels of *approved* and *denied*.

```
yearly_incomes_us_dollars = [12000, 14000, 25000, 28000, 38000, 45000, 50000, 55000, 60000, 66000, 73000, 75000]
credit_card_decisions = ["denied"]*4 + ["approved"] + ["denied"]*2 + ["approved"]*5
```

```
application_data = {'yearly_income': yearly_incomes_us_dollars,
'credit_card_decision': credit_card_decisions
}
df_credit_card_applications = pd.DataFrame(application_data)
```

```
plt.figure(figsize=(9, 7))
sns.stripplot(data=df_credit_card_applications, x="yearly_income", y="credit_card_decision", jitter=False, size=13)
plt.xlabel("yearly income ($)", labelpad=13)
plt.ylabel("credit card decision", labelpad=13)
plt.title("Credit Card Decisions Based on Yearly Income", y=1.015, fontsize=20)
ax = plt.gca()
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));
```

Upon looking at this graph, I think of a divide in which people that earn over $51,000 per year are more likely to be approved for this credit card. Below, I visualize my first thought with a *Denied Side* and *Approved Side* annotation.

```
plt.figure(figsize=(9, 7))
ax = sns.stripplot(data=df_credit_card_applications, x="yearly_income", y="credit_card_decision", jitter=False, size=13)
plt.xlabel("yearly income ($)", labelpad=13)
plt.ylabel("credit card decision", labelpad=13)
plt.title("Credit Card Decisions Based on Yearly Income", y=1.015, fontsize=20)
plt.axvline(x=51000, linestyle="--", color='green')
bbox_props_approved = dict(boxstyle="round", fc="orange", ec="0.8", alpha=0.8)
ax.text(65000, 0.5, "Approved Side", ha="center", va="center", size=23, bbox=bbox_props_approved)
bbox_props_denied = dict(boxstyle="round", fc="blue", ec="0.8", alpha=0.3)
ax.text(30000, 0.5, "Denied Side", ha="center", va="center", size=23, bbox=bbox_props_denied)
ax = plt.gca()
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));
```

#### 2 Features¶

Let's gather additional data to help us build our best fit of equation to classify whether someone is approved or denied of a credit card.

Below, I incorporated data for people's credit score.

```
credit_scores = [580, 600, 620, 640, 680, 670, 650, 700, 690, 710, 680, 715]
```

```
application_data["credit_score"] = credit_scores
df_credit_card_applications = pd.DataFrame(application_data)
```

```
df_credit_card_applications
```

Let's visualize our two features and two class labels.

```
plt.figure(figsize=(10, 8))
sns.scatterplot(x='yearly_income', y='credit_score', hue='credit_card_decision', data=df_credit_card_applications, s=200)
plt.title("Credit Card Application Details", y=1.015, fontsize=20)
plt.xlabel("yearly income", labelpad=13)
plt.ylabel("credit score", labelpad=13)
ax = plt.gca()
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));
```

#### Visually Predict Which Class a New Observation Belongs To¶

Below we see a new green plus sign to mark a new observation that doesn't have a labeled class yet. What class do you think it belongs to?

I would classify the new observation as being denied.

```
plt.figure(figsize=(10, 8))
ax = sns.scatterplot(x='yearly_income', y='credit_score', hue='credit_card_decision', data=df_credit_card_applications, s=200)
plt.scatter([40000], [600], c='green', s=200, marker='+')
bbox_props = dict(boxstyle="rarrow, pad=0.6", fc="snow", ec="g", lw=2.5)
t = ax.text(30000, 600, "Predict class", ha="center", va="center", size=17, bbox=bbox_props)
plt.title("Credit Card Application Details", y=1.015, fontsize=20)
plt.xlabel("yearly income", labelpad=13)
plt.ylabel("credit score", labelpad=13)
ax = plt.gca()
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));
```

Below is another example, a purple plus sign, in which we have a new observation but no class label. What class do you think it belongs to?

This one is trickier. There's a decent chance/probability that this observation could be approved or denied. I would hypothesize there's a 0.60 probability that this new observation is denied.

```
plt.figure(figsize=(10, 8))
ax = sns.scatterplot(x='yearly_income', y='credit_score', hue='credit_card_decision', data=df_credit_card_applications, s=200)
plt.scatter([45000], [660], c='purple', s=200, marker='+')
bbox_props = dict(boxstyle="rarrow, pad=0.6", fc="snow", ec="purple", lw=2.5)
t = ax.text(35000, 660, "Predict class", ha="center", va="center", size=17, bbox=bbox_props)
plt.title("Credit Card Application Details", y=1.015, fontsize=20)
plt.xlabel("yearly income", labelpad=13)
plt.ylabel("credit score", labelpad=13)
ax = plt.gca()
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));
```

In my mind, I think of a dividing line that would separate the two classes. I plotted one below.

However, this isn't perfect. There's an observation that was approved in my denied side. I figure with additional data points, there will also be more mis-classifications like this one.

```
plt.figure(figsize=(10, 8))
ax = sns.scatterplot(x='yearly_income', y='credit_score', hue='credit_card_decision', data=df_credit_card_applications, s=200)
bbox_props_approved = dict(boxstyle="round", fc="orange", ec="0.8", alpha=0.8)
ax.text(65000, 660, "Approved Side", ha="center", va="center", size=23, bbox=bbox_props_approved)
bbox_props_denied = dict(boxstyle="round", fc="blue", ec="0.8", alpha=0.3)
ax.text(28000, 660, "Denied Side", ha="center", va="center", size=23, bbox=bbox_props_denied);
plt.plot([35000, 69000], [715, 585], linestyle="--", color='green')
plt.title("Credit Card Application Details", y=1.015, fontsize=20)
plt.xlabel("yearly income", labelpad=13)
plt.ylabel("credit score", labelpad=13)
ax = plt.gca()
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));
```

#### Let's Try Linear Regression for Classification¶

Below I create a new column called `decision_label`

that marks a 1 if the `credit_card_decision`

is `approved`

and a `0`

if `denied`

.

```
df_credit_card_applications['decision_label'] = np.where(df_credit_card_applications['credit_card_decision']=="approved", 1, 0)
```

Preview the DataFrame to visually check our logic is correct.

```
df_credit_card_applications[['credit_card_decision', 'decision_label']].sample(5)
```

We can visualize this new `1`

and `0`

class label name on a plot below.

```
plt.figure(figsize=(9, 6))
sns.scatterplot(x="yearly_income", y="decision_label", hue="decision_label", data=df_credit_card_applications, s=200)
plt.xlabel("yearly income ($)", labelpad=13)
plt.ylabel("credit card decision", labelpad=13)
plt.title("Credit Card Decisions Based on Yearly Income", y=1.015, fontsize=20)
ax = plt.gca()
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));
```

We'll fit a linear regression model to our entire dataset. First, instantiate the `LinearRegression`

object that was imported at the top of our script and assign it to the variable `linear_regression`

. You can read more about the official documentation of Linear Regression on sklearn.

```
linear_regression = LinearRegression()
```

Let's build our linear regression line of best fit and assign it to `lr`

. First, we have to get our list of x-values and y-values in the right data format for sklearn.

```
x_values = df_credit_card_applications['yearly_income'].values.reshape(-1,1)
```

```
x_values
```

```
y_values = df_credit_card_applications['decision_label'].values
```

```
y_values
```

```
lr = linear_regression.fit(x_values, y_values)
```

The coefficient of our linear regression model.

```
lr.coef_
```

The intercept of our linear regression model.

```
lr.intercept_
```

Our linear regression line of best fit can be modeled by the equation:

`target = lr.intercept_ + yearly_income * lr.coef_`

Let's plot our linear regression line of best fit using the minimum and maximum values from our x and y axes.

```
min_income = df_credit_card_applications['yearly_income'].min()
max_income = df_credit_card_applications['yearly_income'].max()
```

```
print("min_income: {0}".format(min_income))
print("max_income: {0}".format(max_income))
```

We can use the `predict`

method to predict a credit card decision value given a yearly annual income value.

```
lr.predict(min_income)
```

Here is our linear regression line of best fit in the range of yearly income values in our dataset.

```
plt.figure(figsize=(9, 6))
sns.scatterplot(x="yearly_income", y="decision_label", hue="decision_label", data=df_credit_card_applications, s=200)
plt.plot([min_income, max_income], [lr.predict(min_income), lr.predict(max_income)], c='darkviolet')
plt.xlabel("yearly income ($)", labelpad=13)
plt.ylabel("credit card decision", labelpad=13)
plt.title("Credit Card Decisions Based on Yearly Income", y=1.015, fontsize=20)
ax = plt.gca()
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));
```

Below, we calculate our R^2 when trained on the entire dataset. At `0.58`

, it's reasonable, but we fail to capture a lot of the variance in our data.

```
r_squared = lr.score(x_values, y_values)
```

```
r_squared
```

Below, I plot a decision threshold as a dotted green line that's equivalent to `y=0.5`

. A **decision threshold** represents the result of a quantitative test to a simple binary decision.

For example, given an input of a yearly income value, if we get a prediction value greater than `0.5`

, we'll simply round up and classify that observation as approved.

Let's try a yearly income of \$50,000.

```
lr.predict(50000)
```

Our returned result of `0.59`

is greater than `0.5`

so we can round up to `1`

and classify this observation as being approved for a credit card.

Now let's try a yearly income of \$33,000.

```
lr.predict(33000)
```

Our returned result of `0.27`

is less than `0.5`

so we can round down to `0`

and classify this observation as being rejected for a credit card.

Below, I visualize this decision boundary.

```
plt.figure(figsize=(11, 8))
sns.scatterplot(x="yearly_income", y="decision_label", hue="decision_label", data=df_credit_card_applications, s=200)
plt.plot([min_income, max_income], [lr.predict(min_income), lr.predict(max_income)], c='darkviolet')
plt.axhline(y=0.5, linestyle="--", color='green')
plt.xlabel("yearly income ($)", labelpad=13)
plt.ylabel("credit card decision", labelpad=13)
plt.title("Credit Card Decisions Based on Yearly Income", y=1.015, fontsize=20)
ax = plt.gca()
orange_rect = patches.Rectangle((45000,0.5), 30000, 0.5, linewidth=1, edgecolor='orange', facecolor='navajowhite', alpha=0.5)
ax.add_patch(orange_rect)
blue_rect = patches.Rectangle((10000, 0), 35000, 0.5, linewidth=1, edgecolor='blue', facecolor='skyblue', alpha=0.5)
ax.add_patch(blue_rect)
bbox_props_approved = dict(boxstyle="round", fc="orange", ec="0.8", alpha=0.8)
ax.text(67000, 0.7, "Always Approve", ha="center", va="center", size=20, bbox=bbox_props_approved)
bbox_props_denied = dict(boxstyle="round", fc="blue", ec="0.8", alpha=0.2)
ax.text(20000, 0.3, "Always Deny", ha="center", va="center", size=20, bbox=bbox_props_denied)
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));
```

This model clearly isn't perfect. Below, I point to two observations that our model misclassifies given our linear model and rounding up/down logic.

More info on drawing rectangles in Matplotlib: patches.Rectangle documentation page.

```
plt.figure(figsize=(11, 8))
sns.scatterplot(x="yearly_income", y="decision_label", hue="decision_label", data=df_credit_card_applications, s=200)
plt.plot([min_income, max_income], [lr.predict(min_income), lr.predict(max_income)], c='darkviolet')
plt.axhline(y=0.5, linestyle="--", color='green')
plt.xlabel("yearly income ($)", labelpad=13)
plt.ylabel("credit card decision", labelpad=13)
plt.title("Credit Card Decisions Based on Yearly Income", y=1.015, fontsize=20)
ax = plt.gca()
orange_rect = patches.Rectangle((45000,0.5), 30000, 0.5, linewidth=1, edgecolor='orange', facecolor='navajowhite', alpha=0.5)
ax.add_patch(orange_rect)
blue_rect = patches.Rectangle((10000, 0), 35000, 0.5, linewidth=1, edgecolor='blue', facecolor='skyblue', alpha=0.5)
ax.add_patch(blue_rect)
bbox_props_approved = dict(boxstyle="round", fc="orange", ec="0.8", alpha=0.8)
ax.text(67000, 0.7, "Always Approve", ha="center", va="center", size=20, bbox=bbox_props_approved)
bbox_props_denied = dict(boxstyle="round", fc="blue", ec="0.8", alpha=0.2)
ax.text(20000, 0.3, "Always Deny", ha="center", va="center", size=20, bbox=bbox_props_denied)
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));
ax.legend_.remove() # removes legend
bbox_props_left = dict(boxstyle="larrow, pad=0.6", fc="snow", ec="firebrick", lw=2.5)
t = ax.text(63000, 0, "Model misclassification", ha="center", va="center", size=17, bbox=bbox_props_left)
bbox_props_left = dict(boxstyle="rarrow, pad=0.6", fc="snow", ec="firebrick", lw=2.5)
t = ax.text(25000, 1, "Model misclassification", ha="center", va="center", size=17, bbox=bbox_props_left)
```

```
df_credit_card_applications.head()
```

Let's show another reason that a linear regression isn't the best model to fit for this data. I'll now add a single outlier - an individual that applied for this credit card and has a income of \$250,000.

```
wealthy_person = {'yearly_income': [250000],
'credit_card_decision': ["approved"],
'credit_score': [715],
'decision_label': [1]
}
df_wealthy = pd.DataFrame(wealthy_person)
```

```
df_cc_applications_with_outlier = pd.concat([df_credit_card_applications, df_wealthy])
```

The steps below fit a new linear regression model to our new data that includes this outlier.

```
linear_regression2 = LinearRegression()
x_values2 = df_cc_applications_with_outlier['yearly_income'].values.reshape(-1,1)
y_values2 = df_cc_applications_with_outlier['decision_label'].values
lr2 = linear_regression2.fit(x_values2, y_values2)
```

We assign a variable `max_income_outlier`

to our new max income value of \$250,000.

```
max_income_outlier = df_cc_applications_with_outlier['yearly_income'].max()
```

Our visualization below plots the new linear regression line of best fit with this additional outlier point - the orange dot to the far-most right.

It's clear this line **doesn't** capture the variance in our data well. If we followed the previous decision threshold rules mentioned earlier, we'd likely misclassify more observations moving forward.

```
plt.figure(figsize=(9, 6))
sns.scatterplot(x="yearly_income", y="decision_label", hue="decision_label", data=df_cc_applications_with_outlier, s=200)
plt.plot([min_income, max_income_outlier], [lr2.predict(min_income), lr2.predict(max_income_outlier)], c='darkviolet')
plt.xlabel("yearly income ($)", labelpad=13)
plt.ylabel("credit card decision", labelpad=13)
plt.title("Credit Card Decisions Based on Yearly Income with Outlier", y=1.015, fontsize=20)
ax = plt.gca()
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));
```

Even without this outlier, it's clear this linear regression model doesn't fit to our data well. Let's not use the data with that outlier. Let's take an earlier visualization of our linear regression line of best fit and view it on a larger x and y scale below.

Wow this linear regression seems off! How could someone have a credit card decision greater than `1`

? That doesn't make sense since you're either approved or denied, what we marked as `1`

or `0`

.

```
zero_income = 0
super_high_income = 200000
plt.figure(figsize=(10, 8))
sns.scatterplot(x="yearly_income", y="decision_label", hue="decision_label", data=df_credit_card_applications, s=200)
plt.plot([zero_income, super_high_income], [lr.predict(zero_income), lr.predict(super_high_income)], c='darkviolet')
plt.xlabel("yearly income ($)", labelpad=13)
plt.ylabel("credit card decision", labelpad=13)
plt.title("Credit Card Decisions Based on Yearly Income", y=1.015, fontsize=20)
ax = plt.gca()
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));
```

### Logistic Regression to the Rescue!¶

We need a better line of fit in order to classify our credit card application data. First, let's learn how the sigmoid function can be relevant.

#### Sigmoid Function¶

A **sigmoid function** is a mathematical function that represents an "s"-shaped curve. It's defined by the formula:

I'll create a list of `x_values`

to plot the sigmoid function. The line of code below creates a numpy array of 120 values linearly spaced between -10 and 10. Here's the official documentation page.

```
x_values_for_sigmoid = np.linspace(-10, 10, 120)
```

Preview the first 5 values of `x_values`

.

```
x_values_for_sigmoid[0:5]
```

Our sigmoid equation from above wrapped into a Python function.

```
def sigmoid(x):
return 1 / (1 + np.exp(-x))
```

We can see from the plot below that as `x`

increase, the line approaches `1.0`

but never reaches it exactly (unless you round up).

Similarly, as `x`

decreases, the line approaches `0.0`

but never reaches it exactly (unless you round down).

These limits of this function are exactly what we need to build our line of best fit for a binary classification. This resolves many of the problems we had with using linear regression in this situation.

```
plt.figure(figsize=(8, 7))
plt.plot(x_values_for_sigmoid, sigmoid(x_values_for_sigmoid), c='teal')
plt.title('Sigmoid Function')
plt.grid(True)
plt.text(2, 0.3, r'$\sigma(x)=\frac{1}{1+e^{-x}}$', fontsize=26)
plt.show()
```

### Logistic Function¶

Below is the logistic function for a single feature. The denominator looks similar to linear regression with an intercept, coefficient and error value.

$$y_{\beta} (x) = \frac{1}{1+e^{-(\beta_0 + \beta_1 x + \epsilon)}}$$#### Fit Logistic Regression Model on Credit Card Data¶

We'd expect a normal distribution with income levels. So, most incomes would be around the mean and median value. I use numpy's `normal`

method from the `random`

module to generate the data. See the documentation here.

```
yearly_incomes_us_dollars = np.random.normal(loc=50000, scale=15000, size=500)
```

```
sns.distplot(yearly_incomes_us_dollars)
```

```
len(yearly_incomes_us_dollars)
```

I'm going to sort `yearly_income_us_dollars`

from least to greatest.

```
yearly_incomes_us_dollars = sorted(yearly_incomes_us_dollars)
```

```
credit_card_decisions = [0]*180 + [1]*4 + [0]*16 + [1]*5 + [0]*10 + [1]*(500-215)
```

```
application_data = {'yearly_income': yearly_incomes_us_dollars,
'credit_card_decision': credit_card_decisions
}
df_credit_card_applications2 = pd.DataFrame(application_data)
```

I want to now see the pattern between income level and credit card decision. I figure people that earn more money generally are accepted more than people who earn less money.

I'll use the Pandas cut method to bin income levels by `very low`

, `low`

, `medium`

, `high`

and `very high`

.

```
income_bin_labels = pd.cut(df_credit_card_applications2['yearly_income'], bins=5, labels=["very low", "low", "medium", "high", "very high"])
```

```
df_credit_card_applications2['income_bucket'] = income_bin_labels
```

There's an equal number of people in each of the five bins.

```
df_credit_card_applications2['income_bucket'].value_counts()
```

Now that our continuous variable (income) is a categorical variable (income level), we can see the mix of the count of credit card decisions by income level. To visualize the data, I use Seaborn's countplot() method.

Our hypothesis was expected in that wealther individuals are more likely to get accepted for a credit card (decision value of 1) compared to less wealthy people.

```
plt.figure(figsize=(14, 8))
sns.countplot(x='income_bucket', hue='credit_card_decision', data=df_credit_card_applications2)
plt.title("Credit Card Decisions by Income Buckets", y=1.015)
plt.ylabel("count [people]", labelpad=13)
plt.xlabel("income bucket level", labelpad=13);
```

```
plt.figure(figsize=(14, 8))
sns.catplot(x="credit_card_decision", y="yearly_income", kind="swarm", data=df_credit_card_applications2, height=8.5, aspect=.9)
plt.xlabel("credit card decision", labelpad=15)
plt.ylabel("yearly income [$]", labelpad=15)
plt.title("Distribution of Yearly Income by Credit Card Decision", y=1.013);
```

```
X = df_credit_card_applications2['yearly_income'].values.reshape(-1, 1)
```

We need to standardize our feature `yearly_income`

and can do so using StandardScaler in sklearn.

Why do we standardize features here? We talked about this in a previous class. Logistic regression by default uses regularization and regularization works best when we standardize our features. Otherwise, if we have outliers, we'll have issues in regularization.

The equation to transform each of our feature values to get a standardized feature value is below. This rescales our features to have a mean of 0 and unit variance - a variance qual to 1.

$$x_{i} = \frac{x_{i} - \bar{x}}{\sigma}$$In layman's terms, I'd describe this as: for every feature value, subtract the mean of the feature vector and divide that numerator by the standard deviation of the feature vector.

```
scaler = StandardScaler()
```

```
scaler.fit(X)
```

```
X_transformed = scaler.transform(X)
```

We can preview a histogram to see the distribution of our new feature-scaled yearly income values.

There's a uniform distribution with a median of roughly 0 and a mean of roughly 0.

```
plt.figure(figsize=(8, 6))
sns.distplot(X_transformed)
plt.title("Distribution of transformed yearly income values", y=1.015)
plt.xlabel("transformed income value", labelpad=12)
plt.ylabel("frequency of occurence", labelpad=12);
```