Machine Learning Classification Tutorial

# Visual Introduction to Classification and Logistic Regression

The tutorial is a high-level overview of classification problems in machine learning and how Logistic Regression works with a single feature and a binary target.

I'll cover the following topics:

• Overview of Classification & Key Terms
• Most Popular Classification Algorithms
• Examples of Classification in Industry
• Walkthrough of Credit Card Application Visualizations
• Logistic Regression
• Evaluation Metrics
• Accuracy
• Precision & Recall

#### Import Modules¶

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from collections import Counter
import itertools
import seaborn as sns
import matplotlib.ticker as tick
import matplotlib.patches as patches
import matplotlib.pyplot as plt
% matplotlib inline


#### Visualization Setup¶

In [2]:
sns.set_context("talk")
sns.set_style("whitegrid", {'grid.color': '.92'})

def reformat_large_tick_values(tick_val, pos):
"""
Turns large tick values (in the billions, millions and thousands) such as 4500 into 4.5K and also appropriately turns 4000 into 4K (no zero after the decimal).
"""
if tick_val >= 1000000000:
val = round(tick_val/1000000000, 1)
new_tick_format = '{:}B'.format(val)
elif tick_val >= 1000000:
val = round(tick_val/1000000, 1)
new_tick_format = '{:}M'.format(val)
elif tick_val >= 1000:
val = round(tick_val/1000, 1)
new_tick_format = '{:}K'.format(val)
elif tick_val < 1000:
new_tick_format = round(tick_val, 1)
else:
new_tick_format = tick_val

# make new_tick_format into a string value
new_tick_format = str(new_tick_format)

# code below will keep 4.5M as is but change values such as 4.0M to 4M since that zero after the decimal isn't needed
index_of_decimal = new_tick_format.find(".")

if index_of_decimal != -1:
value_after_decimal = new_tick_format[index_of_decimal+1]
if value_after_decimal == "0":
# remove the 0 after the decimal point since it's not needed
new_tick_format = new_tick_format[0:index_of_decimal] + new_tick_format[index_of_decimal+2:]

return new_tick_format


### Overview of Classification & Key Terms¶

#### Problem Setup¶

Classification involves building a best fit equation to predict the probability an observation is of a certain class/label. Let's break down what some of those terms mean.

In order to use a classification algorithm, you must be provided with observations, features and labeled data. Let's say we work for a bank and it's our job to approve or deny credit card applicants.

#### Example: Credit Card Application¶

Most Americans age 18+ likely have a debit or credit card. Credit cards are issued by banks and businesses, allowing the holder to purchase goods or services on credit. Essentially, you can buy items without the exact cash on hand - but in agreement to pay back the bank later. You can learn more about applying for credit cards on Nerdwallet.

To obtain a credit card, you must apply for one through a bank. A bank will decide/classify if you are worthy of the card.

Below is some fake sample data that indicates important features considered before one is able to be approved for a credit card.

Credit score, debt, yearly income, and age are all features. The field for the classification label is credit card decision.

Name Credit Score Debt (\$) Yearly Income (\$) Age Credit Card Decision
Joe Smith 610 3000 45000 24 Denied
Jill Mason 620 12000 58000 22 Approved
Brandon Cohen 700 0 90000 27 Approved
Ariel Pan 720 0 110000 29 Approved

There's a small trend that people that have a higher credit score, less debt and more income are more likely to be approved for a particular credit card.

In the table above, each row is considered an observation. With each observation as a record of a person, there's additional details on their credit score, debt, yearly income and age.

The columns Credit Score and Debt (\$), Yearly Income (\$) and Age are considered features. They are measurable characteristics for each observation.

The last column in our table is Credit Card Decision and represents a class or sometimes considered a label. In the past, a representative from the bank would choose to approve or deny each credit card application based on that observation's features. A popular name for this scenario is called labeled data.

#### Binary Classification¶

In the example above, there were just two possible class labels - either Denied or Approved. Therefore, this is considered a binary classification example. The term binary means related to two things.

#### Multiclass Classification¶

In machine learning, a multiclass classifier is the problem of classifying observations into three or more classes.

These five below I consider classical machine learning algorithms - as many were first profiled tens of years ago.

• Logistic Regression
• SVM
• Naive Bayes
• Decision Trees
• Random Forest

In this notebook, we're just going to learn Logistic Regression. Why? Well Logistic Regression is simple to implement and fits to data quickly. Also, this model is very interpretable - both in the math with how it works and interpretability of features. Not all other algorithms listed above also fit this criteria.

However, note Logistic Regression is often regarded as one of the simpler classification algorithms. It is just OK at capturing the variance with many features. So, you likely won't get as strong of a fit of a model with it compared to more complex machine learning models like XGBoost or neural networks.

• Diagnosed with a certain type of disease
• Spam classification of text comments
• Image classification
• Hot dog or not hot dog
• Nudity or not nudity
• Food safe to eat or not
• recognize objects on streets for self-driving cars
• Customer churn - essentially will a person continue using an app/service or leave
• Default or not on a credit card

### Visual Derivation of Logistic Regression Equation¶

Let's continue with our credit card application example. We'll utilize a single feature, yearly income, and two class labels of approved and denied.

In [3]:
yearly_incomes_us_dollars = [12000, 14000, 25000, 28000, 38000, 45000, 50000, 55000, 60000, 66000, 73000, 75000]
credit_card_decisions = ["denied"]*4 + ["approved"] + ["denied"]*2 + ["approved"]*5

In [4]:
application_data = {'yearly_income': yearly_incomes_us_dollars,
'credit_card_decision': credit_card_decisions
}
df_credit_card_applications = pd.DataFrame(application_data)

In [5]:
plt.figure(figsize=(9, 7))
sns.stripplot(data=df_credit_card_applications, x="yearly_income", y="credit_card_decision", jitter=False, size=13)
plt.xlabel("yearly income ($)", labelpad=13) plt.ylabel("credit card decision", labelpad=13) plt.title("Credit Card Decisions Based on Yearly Income", y=1.015, fontsize=20) ax = plt.gca() ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));  Upon looking at this graph, I think of a divide in which people that earn over$51,000 per year are more likely to be approved for this credit card. Below, I visualize my first thought with a Denied Side and Approved Side annotation.

In [6]:
plt.figure(figsize=(9, 7))
ax = sns.stripplot(data=df_credit_card_applications, x="yearly_income", y="credit_card_decision", jitter=False, size=13)
plt.xlabel("yearly income ($)", labelpad=13) plt.ylabel("credit card decision", labelpad=13) plt.title("Credit Card Decisions Based on Yearly Income", y=1.015, fontsize=20) plt.axvline(x=51000, linestyle="--", color='green') bbox_props_approved = dict(boxstyle="round", fc="orange", ec="0.8", alpha=0.8) ax.text(65000, 0.5, "Approved Side", ha="center", va="center", size=23, bbox=bbox_props_approved) bbox_props_denied = dict(boxstyle="round", fc="blue", ec="0.8", alpha=0.3) ax.text(30000, 0.5, "Denied Side", ha="center", va="center", size=23, bbox=bbox_props_denied) ax = plt.gca() ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));  #### 2 Features¶ Let's gather additional data to help us build our best fit of equation to classify whether someone is approved or denied of a credit card. Below, I incorporated data for people's credit score. In [7]: credit_scores = [580, 600, 620, 640, 680, 670, 650, 700, 690, 710, 680, 715]  In [8]: application_data["credit_score"] = credit_scores df_credit_card_applications = pd.DataFrame(application_data)  In [9]: df_credit_card_applications  Out[9]: yearly_income credit_card_decision credit_score 0 12000 denied 580 1 14000 denied 600 2 25000 denied 620 3 28000 denied 640 4 38000 approved 680 5 45000 denied 670 6 50000 denied 650 7 55000 approved 700 8 60000 approved 690 9 66000 approved 710 10 73000 approved 680 11 75000 approved 715 Let's visualize our two features and two class labels. In [10]: plt.figure(figsize=(10, 8)) sns.scatterplot(x='yearly_income', y='credit_score', hue='credit_card_decision', data=df_credit_card_applications, s=200) plt.title("Credit Card Application Details", y=1.015, fontsize=20) plt.xlabel("yearly income", labelpad=13) plt.ylabel("credit score", labelpad=13) ax = plt.gca() ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));  #### Visually Predict Which Class a New Observation Belongs To¶ Below we see a new green plus sign to mark a new observation that doesn't have a labeled class yet. What class do you think it belongs to? I would classify the new observation as being denied. In [11]: plt.figure(figsize=(10, 8)) ax = sns.scatterplot(x='yearly_income', y='credit_score', hue='credit_card_decision', data=df_credit_card_applications, s=200) plt.scatter([40000], [600], c='green', s=200, marker='+') bbox_props = dict(boxstyle="rarrow, pad=0.6", fc="snow", ec="g", lw=2.5) t = ax.text(30000, 600, "Predict class", ha="center", va="center", size=17, bbox=bbox_props) plt.title("Credit Card Application Details", y=1.015, fontsize=20) plt.xlabel("yearly income", labelpad=13) plt.ylabel("credit score", labelpad=13) ax = plt.gca() ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));  Below is another example, a purple plus sign, in which we have a new observation but no class label. What class do you think it belongs to? This one is trickier. There's a decent chance/probability that this observation could be approved or denied. I would hypothesize there's a 0.60 probability that this new observation is denied. In [12]: plt.figure(figsize=(10, 8)) ax = sns.scatterplot(x='yearly_income', y='credit_score', hue='credit_card_decision', data=df_credit_card_applications, s=200) plt.scatter([45000], [660], c='purple', s=200, marker='+') bbox_props = dict(boxstyle="rarrow, pad=0.6", fc="snow", ec="purple", lw=2.5) t = ax.text(35000, 660, "Predict class", ha="center", va="center", size=17, bbox=bbox_props) plt.title("Credit Card Application Details", y=1.015, fontsize=20) plt.xlabel("yearly income", labelpad=13) plt.ylabel("credit score", labelpad=13) ax = plt.gca() ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));  In my mind, I think of a dividing line that would separate the two classes. I plotted one below. However, this isn't perfect. There's an observation that was approved in my denied side. I figure with additional data points, there will also be more mis-classifications like this one. In [13]: plt.figure(figsize=(10, 8)) ax = sns.scatterplot(x='yearly_income', y='credit_score', hue='credit_card_decision', data=df_credit_card_applications, s=200) bbox_props_approved = dict(boxstyle="round", fc="orange", ec="0.8", alpha=0.8) ax.text(65000, 660, "Approved Side", ha="center", va="center", size=23, bbox=bbox_props_approved) bbox_props_denied = dict(boxstyle="round", fc="blue", ec="0.8", alpha=0.3) ax.text(28000, 660, "Denied Side", ha="center", va="center", size=23, bbox=bbox_props_denied); plt.plot([35000, 69000], [715, 585], linestyle="--", color='green') plt.title("Credit Card Application Details", y=1.015, fontsize=20) plt.xlabel("yearly income", labelpad=13) plt.ylabel("credit score", labelpad=13) ax = plt.gca() ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));  #### Let's Try Linear Regression for Classification¶ Below I create a new column called decision_label that marks a 1 if the credit_card_decision is approved and a 0 if denied. In [14]: df_credit_card_applications['decision_label'] = np.where(df_credit_card_applications['credit_card_decision']=="approved", 1, 0)  Preview the DataFrame to visually check our logic is correct. In [15]: df_credit_card_applications[['credit_card_decision', 'decision_label']].sample(5)  Out[15]: credit_card_decision decision_label 0 denied 0 10 approved 1 7 approved 1 5 denied 0 1 denied 0 We can visualize this new 1 and 0 class label name on a plot below. In [16]: plt.figure(figsize=(9, 6)) sns.scatterplot(x="yearly_income", y="decision_label", hue="decision_label", data=df_credit_card_applications, s=200) plt.xlabel("yearly income ($)", labelpad=13)
plt.title("Credit Card Decisions Based on Yearly Income", y=1.015, fontsize=20)
ax = plt.gca()
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));


We'll fit a linear regression model to our entire dataset. First, instantiate the LinearRegression object that was imported at the top of our script and assign it to the variable linear_regression. You can read more about the official documentation of Linear Regression on sklearn.

In [17]:
linear_regression = LinearRegression()


Let's build our linear regression line of best fit and assign it to lr. First, we have to get our list of x-values and y-values in the right data format for sklearn.

In [18]:
x_values = df_credit_card_applications['yearly_income'].values.reshape(-1,1)

In [19]:
x_values

Out[19]:
array([[12000],
[14000],
[25000],
[28000],
[38000],
[45000],
[50000],
[55000],
[60000],
[66000],
[73000],
[75000]])
In [20]:
y_values = df_credit_card_applications['decision_label'].values

In [21]:
y_values

Out[21]:
array([0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1])
In [22]:
lr = linear_regression.fit(x_values, y_values)


The coefficient of our linear regression model.

In [23]:
lr.coef_

Out[23]:
array([1.83358404e-05])

The intercept of our linear regression model.

In [24]:
lr.intercept_

Out[24]:
-0.3266408043702007

Our linear regression line of best fit can be modeled by the equation:

target = lr.intercept_ + yearly_income * lr.coef_

Let's plot our linear regression line of best fit using the minimum and maximum values from our x and y axes.

In [25]:
min_income = df_credit_card_applications['yearly_income'].min()
max_income = df_credit_card_applications['yearly_income'].max()

In [26]:
print("min_income: {0}".format(min_income))
print("max_income: {0}".format(max_income))

min_income: 12000
max_income: 75000


We can use the predict method to predict a credit card decision value given a yearly annual income value.

In [27]:
lr.predict(min_income)

Out[27]:
array([-0.10661072])

Here is our linear regression line of best fit in the range of yearly income values in our dataset.

In [28]:
plt.figure(figsize=(9, 6))
sns.scatterplot(x="yearly_income", y="decision_label", hue="decision_label", data=df_credit_card_applications, s=200)
plt.plot([min_income, max_income], [lr.predict(min_income), lr.predict(max_income)], c='darkviolet')
plt.xlabel("yearly income ($)", labelpad=13) plt.ylabel("credit card decision", labelpad=13) plt.title("Credit Card Decisions Based on Yearly Income", y=1.015, fontsize=20) ax = plt.gca() ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));  Below, we calculate our R^2 when trained on the entire dataset. At 0.58, it's reasonable, but we fail to capture a lot of the variance in our data. In [29]: r_squared = lr.score(x_values, y_values)  In [30]: r_squared  Out[30]: 0.5898028659646901 Below, I plot a decision threshold as a dotted green line that's equivalent to y=0.5. A decision threshold represents the result of a quantitative test to a simple binary decision. For example, given an input of a yearly income value, if we get a prediction value greater than 0.5, we'll simply round up and classify that observation as approved. Let's try a yearly income of \$50,000.

In [31]:
lr.predict(50000)

Out[31]:
array([0.59015122])

Our returned result of 0.59 is greater than 0.5 so we can round up to 1 and classify this observation as being approved for a credit card.

Now let's try a yearly income of \$33,000. In [32]: lr.predict(33000)  Out[32]: array([0.27844193]) Our returned result of 0.27 is less than 0.5 so we can round down to 0 and classify this observation as being rejected for a credit card. Below, I visualize this decision boundary. In [33]: plt.figure(figsize=(11, 8)) sns.scatterplot(x="yearly_income", y="decision_label", hue="decision_label", data=df_credit_card_applications, s=200) plt.plot([min_income, max_income], [lr.predict(min_income), lr.predict(max_income)], c='darkviolet') plt.axhline(y=0.5, linestyle="--", color='green') plt.xlabel("yearly income ($)", labelpad=13)
plt.title("Credit Card Decisions Based on Yearly Income", y=1.015, fontsize=20)
ax = plt.gca()
orange_rect = patches.Rectangle((45000,0.5), 30000, 0.5, linewidth=1, edgecolor='orange', facecolor='navajowhite', alpha=0.5)
blue_rect = patches.Rectangle((10000, 0), 35000, 0.5, linewidth=1, edgecolor='blue', facecolor='skyblue', alpha=0.5)
bbox_props_approved = dict(boxstyle="round", fc="orange", ec="0.8", alpha=0.8)
ax.text(67000, 0.7, "Always Approve", ha="center", va="center", size=20, bbox=bbox_props_approved)
bbox_props_denied = dict(boxstyle="round", fc="blue", ec="0.8", alpha=0.2)
ax.text(20000, 0.3, "Always Deny", ha="center", va="center", size=20, bbox=bbox_props_denied)
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));


This model clearly isn't perfect. Below, I point to two observations that our model misclassifies given our linear model and rounding up/down logic.

In [34]:
plt.figure(figsize=(11, 8))
sns.scatterplot(x="yearly_income", y="decision_label", hue="decision_label", data=df_credit_card_applications, s=200)
plt.plot([min_income, max_income], [lr.predict(min_income), lr.predict(max_income)], c='darkviolet')
plt.axhline(y=0.5, linestyle="--", color='green')
plt.xlabel("yearly income ($)", labelpad=13) plt.ylabel("credit card decision", labelpad=13) plt.title("Credit Card Decisions Based on Yearly Income", y=1.015, fontsize=20) ax = plt.gca() orange_rect = patches.Rectangle((45000,0.5), 30000, 0.5, linewidth=1, edgecolor='orange', facecolor='navajowhite', alpha=0.5) ax.add_patch(orange_rect) blue_rect = patches.Rectangle((10000, 0), 35000, 0.5, linewidth=1, edgecolor='blue', facecolor='skyblue', alpha=0.5) ax.add_patch(blue_rect) bbox_props_approved = dict(boxstyle="round", fc="orange", ec="0.8", alpha=0.8) ax.text(67000, 0.7, "Always Approve", ha="center", va="center", size=20, bbox=bbox_props_approved) bbox_props_denied = dict(boxstyle="round", fc="blue", ec="0.8", alpha=0.2) ax.text(20000, 0.3, "Always Deny", ha="center", va="center", size=20, bbox=bbox_props_denied) ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values)); ax.legend_.remove() # removes legend bbox_props_left = dict(boxstyle="larrow, pad=0.6", fc="snow", ec="firebrick", lw=2.5) t = ax.text(63000, 0, "Model misclassification", ha="center", va="center", size=17, bbox=bbox_props_left) bbox_props_left = dict(boxstyle="rarrow, pad=0.6", fc="snow", ec="firebrick", lw=2.5) t = ax.text(25000, 1, "Model misclassification", ha="center", va="center", size=17, bbox=bbox_props_left)  In [35]: df_credit_card_applications.head()  Out[35]: yearly_income credit_card_decision credit_score decision_label 0 12000 denied 580 0 1 14000 denied 600 0 2 25000 denied 620 0 3 28000 denied 640 0 4 38000 approved 680 1 Let's show another reason that a linear regression isn't the best model to fit for this data. I'll now add a single outlier - an individual that applied for this credit card and has a income of \$250,000.

In [36]:
wealthy_person = {'yearly_income': [250000],
'credit_card_decision': ["approved"],
'credit_score': [715],
'decision_label': [1]
}
df_wealthy = pd.DataFrame(wealthy_person)

In [37]:
df_cc_applications_with_outlier = pd.concat([df_credit_card_applications, df_wealthy])


The steps below fit a new linear regression model to our new data that includes this outlier.

In [38]:
linear_regression2 = LinearRegression()
x_values2 = df_cc_applications_with_outlier['yearly_income'].values.reshape(-1,1)
y_values2 = df_cc_applications_with_outlier['decision_label'].values
lr2 = linear_regression2.fit(x_values2, y_values2)


We assign a variable max_income_outlier to our new max income value of \$250,000. In [39]: max_income_outlier = df_cc_applications_with_outlier['yearly_income'].max()  Our visualization below plots the new linear regression line of best fit with this additional outlier point - the orange dot to the far-most right. It's clear this line doesn't capture the variance in our data well. If we followed the previous decision threshold rules mentioned earlier, we'd likely misclassify more observations moving forward. In [40]: plt.figure(figsize=(9, 6)) sns.scatterplot(x="yearly_income", y="decision_label", hue="decision_label", data=df_cc_applications_with_outlier, s=200) plt.plot([min_income, max_income_outlier], [lr2.predict(min_income), lr2.predict(max_income_outlier)], c='darkviolet') plt.xlabel("yearly income ($)", labelpad=13)
plt.title("Credit Card Decisions Based on Yearly Income with Outlier", y=1.015, fontsize=20)
ax = plt.gca()
ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));


Even without this outlier, it's clear this linear regression model doesn't fit to our data well. Let's not use the data with that outlier. Let's take an earlier visualization of our linear regression line of best fit and view it on a larger x and y scale below.

Wow this linear regression seems off! How could someone have a credit card decision greater than 1? That doesn't make sense since you're either approved or denied, what we marked as 1 or 0.

In [41]:
zero_income = 0
super_high_income = 200000
plt.figure(figsize=(10, 8))
sns.scatterplot(x="yearly_income", y="decision_label", hue="decision_label", data=df_credit_card_applications, s=200)
plt.plot([zero_income, super_high_income], [lr.predict(zero_income), lr.predict(super_high_income)], c='darkviolet')
plt.xlabel("yearly income ($)", labelpad=13) plt.ylabel("credit card decision", labelpad=13) plt.title("Credit Card Decisions Based on Yearly Income", y=1.015, fontsize=20) ax = plt.gca() ax.xaxis.set_major_formatter(tick.FuncFormatter(reformat_large_tick_values));  ### Logistic Regression to the Rescue!¶ We need a better line of fit in order to classify our credit card application data. First, let's learn how the sigmoid function can be relevant. #### Sigmoid Function¶ A sigmoid function is a mathematical function that represents an "s"-shaped curve. It's defined by the formula: $$\sigma(x)=\frac{1}{1+e^{-x}}$$ I'll create a list of x_values to plot the sigmoid function. The line of code below creates a numpy array of 120 values linearly spaced between -10 and 10. Here's the official documentation page. In [42]: x_values_for_sigmoid = np.linspace(-10, 10, 120)  Preview the first 5 values of x_values. In [43]: x_values_for_sigmoid[0:5]  Out[43]: array([-10. , -9.83193277, -9.66386555, -9.49579832, -9.32773109]) Our sigmoid equation from above wrapped into a Python function. In [44]: def sigmoid(x): return 1 / (1 + np.exp(-x))  We can see from the plot below that as x increase, the line approaches 1.0 but never reaches it exactly (unless you round up). Similarly, as x decreases, the line approaches 0.0 but never reaches it exactly (unless you round down). These limits of this function are exactly what we need to build our line of best fit for a binary classification. This resolves many of the problems we had with using linear regression in this situation. In [45]: plt.figure(figsize=(8, 7)) plt.plot(x_values_for_sigmoid, sigmoid(x_values_for_sigmoid), c='teal') plt.title('Sigmoid Function') plt.grid(True) plt.text(2, 0.3, r'$\sigma(x)=\frac{1}{1+e^{-x}}$', fontsize=26) plt.show()  ### Logistic Function¶ Below is the logistic function for a single feature. The denominator looks similar to linear regression with an intercept, coefficient and error value. $$y_{\beta} (x) = \frac{1}{1+e^{-(\beta_0 + \beta_1 x + \epsilon)}}$$ #### Fit Logistic Regression Model on Credit Card Data¶ We'd expect a normal distribution with income levels. So, most incomes would be around the mean and median value. I use numpy's normal method from the random module to generate the data. See the documentation here. In [46]: yearly_incomes_us_dollars = np.random.normal(loc=50000, scale=15000, size=500)  In [47]: sns.distplot(yearly_incomes_us_dollars)  /Users/dan/anaconda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use arr[tuple(seq)] instead of arr[seq]. In the future this will be interpreted as an array index, arr[np.array(seq)], which will result either in an error or a different result. return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval  Out[47]: <matplotlib.axes._subplots.AxesSubplot at 0x1a15d3e278> In [48]: len(yearly_incomes_us_dollars)  Out[48]: 500 I'm going to sort yearly_income_us_dollars from least to greatest. In [49]: yearly_incomes_us_dollars = sorted(yearly_incomes_us_dollars)  In [50]: credit_card_decisions = [0]*180 + [1]*4 + [0]*16 + [1]*5 + [0]*10 + [1]*(500-215)  In [51]: application_data = {'yearly_income': yearly_incomes_us_dollars, 'credit_card_decision': credit_card_decisions } df_credit_card_applications2 = pd.DataFrame(application_data)  I want to now see the pattern between income level and credit card decision. I figure people that earn more money generally are accepted more than people who earn less money. I'll use the Pandas cut method to bin income levels by very low, low, medium, high and very high. In [52]: income_bin_labels = pd.cut(df_credit_card_applications2['yearly_income'], bins=5, labels=["very low", "low", "medium", "high", "very high"])  In [53]: df_credit_card_applications2['income_bucket'] = income_bin_labels  There's an equal number of people in each of the five bins. In [54]: df_credit_card_applications2['income_bucket'].value_counts()  Out[54]: medium 245 high 156 low 71 very high 20 very low 8 Name: income_bucket, dtype: int64 Now that our continuous variable (income) is a categorical variable (income level), we can see the mix of the count of credit card decisions by income level. To visualize the data, I use Seaborn's countplot() method. Our hypothesis was expected in that wealther individuals are more likely to get accepted for a credit card (decision value of 1) compared to less wealthy people. In [55]: plt.figure(figsize=(14, 8)) sns.countplot(x='income_bucket', hue='credit_card_decision', data=df_credit_card_applications2) plt.title("Credit Card Decisions by Income Buckets", y=1.015) plt.ylabel("count [people]", labelpad=13) plt.xlabel("income bucket level", labelpad=13);  In [56]: plt.figure(figsize=(14, 8)) sns.catplot(x="credit_card_decision", y="yearly_income", kind="swarm", data=df_credit_card_applications2, height=8.5, aspect=.9) plt.xlabel("credit card decision", labelpad=15) plt.ylabel("yearly income [$]", labelpad=15)
plt.title("Distribution of Yearly Income by Credit Card Decision", y=1.013);

<matplotlib.figure.Figure at 0x1a1750e978>
In [57]:
X = df_credit_card_applications2['yearly_income'].values.reshape(-1, 1)


We need to standardize our feature yearly_income and can do so using StandardScaler in sklearn.

Why do we standardize features here? We talked about this in a previous class. Logistic regression by default uses regularization and regularization works best when we standardize our features. Otherwise, if we have outliers, we'll have issues in regularization.

The equation to transform each of our feature values to get a standardized feature value is below. This rescales our features to have a mean of 0 and unit variance - a variance qual to 1.

$$x_{i} = \frac{x_{i} - \bar{x}}{\sigma}$$

In layman's terms, I'd describe this as: for every feature value, subtract the mean of the feature vector and divide that numerator by the standard deviation of the feature vector.

In [58]:
scaler = StandardScaler()

In [59]:
scaler.fit(X)

Out[59]:
StandardScaler(copy=True, with_mean=True, with_std=True)
In [60]:
X_transformed = scaler.transform(X)


We can preview a histogram to see the distribution of our new feature-scaled yearly income values.

There's a uniform distribution with a median of roughly 0 and a mean of roughly 0.

In [61]:
plt.figure(figsize=(8, 6))
sns.distplot(X_transformed)
plt.title("Distribution of transformed yearly income values", y=1.015)

/Users/dan/anaconda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use arr[tuple(seq)] instead of arr[seq]. In the future this will be interpreted as an array index, arr[np.array(seq)], which will result either in an error or a different result.