Data Analysis Data Wrangling Tutorial

Create New Columns Based on Operations

A pivotal part of data analysis is to create new columns based on some operation. This tutorial will cover multiple ways to do that.

Import Modules

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline

Get Tips Dataset

Let's get the tips dataset from the Python seaborn visualization library and assign it to the DataFrame df_tips.

df_tips = sns.load_dataset('tips')

Each row represents a unique meal at a restaurant for a party of people; the dataset contains the following fields:

column name column description
total_bill financial amount of meal in U.S. dollars
tip financial amount of the meal's tip in U.S. dollars
sex gender of server
smoker boolean to represent if server smokes or not
day day of week
time meal name (Lunch or Dinner)
size count of people eating meal

Preview the first 5 rows of df_tips using the pandas series head() method.

df_tips.head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

Create New Column Based on Mathematical Operations of Column(s)

We can perform mathematical operations on columns such as addition, division and multiplication.

Addition

Let's create a new column called total_money_received that's the addition of the total_bill and tip column. This total amount is how much money the restaurant receives for each order.

df_tips['total_money_received'] = df_tips['total_bill'] + df_tips['tip']

Let's see a preview of these three columnns.

df_tips[['total_bill', 'tip', 'total_money_received']].sample(n=5)
total_bill tip total_money_received
106 20.49 4.06 24.55
163 13.81 2.00 15.81
124 12.48 2.52 15.00
155 29.85 5.14 34.99
54 25.56 4.34 29.90

Division

Let's create a new column called tip_proportion_of_total_bill that's a calculation of tip values divided by their respective total_bill value.

df_tips['tip_proportion_of_total_bill'] = df_tips['tip']/df_tips['total_bill']

Let's preview these three columns.

df_tips[['total_bill', 'tip', 'tip_proportion_of_total_bill']].sample(n=5)
total_bill tip tip_proportion_of_total_bill
127 14.52 2.00 0.137741
25 17.81 2.34 0.131387
235 10.07 1.25 0.124131
65 20.08 3.15 0.156873
12 15.42 1.57 0.101816

For aesthetic purposes of the values in the tip_proportion_of_total_bill, I will round all values to four decimal places using the round() method below.

df_tips['tip_proportion_of_total_bill'] = df_tips['tip_proportion_of_total_bill'].round(decimals=4)

Additionally, I want to create a new column called money_received_per_person that's a calculation of total_money_received divided by size. I round values in this new column to two decimal places.

df_tips['money_received_per_person'] = df_tips['total_money_received']/df_tips['size']
df_tips['money_received_per_person'] = df_tips['money_received_per_person'].round(decimals=2)

Let's preview these three columns.

df_tips[['total_money_received', 'size', 'money_received_per_person']].sample(n=5)
total_money_received size money_received_per_person
64 20.23 3 6.74
61 15.81 2 7.90
141 41.00 6 6.83
150 16.57 2 8.28
155 34.99 5 7.00

Multiplication

I want to create a new column called tip_percentage_of_total_bill that simply takes our column called tip_proportion_of_total_bill and multiplies the value by 100. I round values in this new column to two decimal places.

df_tips['tip_percentage_of_total_bill'] = df_tips['tip_proportion_of_total_bill']*100
df_tips['tip_percentage_of_total_bill'] = df_tips['tip_percentage_of_total_bill'].round(decimals=3)

Let's preview these columns recently used for multiplication.

df_tips[['total_bill', 'tip', 'tip_proportion_of_total_bill', 'tip_percentage_of_total_bill']].head()
total_bill tip tip_proportion_of_total_bill tip_percentage_of_total_bill
0 16.99 1.01 0.0594 5.94
1 10.34 1.66 0.1605 16.05
2 21.01 3.50 0.1666 16.66
3 23.68 3.31 0.1398 13.98
4 24.59 3.61 0.1468 14.68

Create New Column Based on Mapping of Current Values to New Values

In our day column, we see the following unique values printed out below using the pandas series unique method.

df_tips['day'].unique()
[Sun, Sat, Thur, Fri]
Categories (4, object): [Sun, Sat, Thur, Fri]

I don't like how the days are shortened names. The name Sun can be mapped to a longer and more colloquial name of Sunday.

In pandas, we can use the series map() method to map our current values, the shortened day names, to longer more colloquial names of days, our new values that we'll create in a new column called day_long_name.

Below is a Python dictionary that assigns the short day names as keys and the longer names to their respective values.

short_day_name_to_long_name = {'Sun': 'Sunday', 'Sat': 'Saturday', 'Thur': 'Thursday', 'Fri': 'Friday'}

Let's assign the map() method to the day column and set the argument arg equal to our short_day_name_to_long_name dictionary. We'll set this equal to be the values in a new column assigned to day_long_name.

df_tips['day_long_name'] = df_tips['day'].map(arg=short_day_name_to_long_name)

Let's preview a few random rows of day and day_long_name.

df_tips[['day', 'day_long_name']].sample(n=5)
day day_long_name
40 Sat Saturday
192 Thur Thursday
89 Thur Thursday
202 Thur Thursday
98 Fri Friday

Create New Column Based on Conditions Across Various Columns

If I were the owner of this restaurant, I'd want to learn more about the context of large groups who ordered a large dollar amount of food. To learn more about this category, I want to create a new column in our DataFrame with these characteristics.

First, let's view the distribution of the column money_received_per_person.

sns.set_context('talk')
plt.figure(figsize=(9, 5))
sns.boxplot(df_tips['money_received_per_person'], color='green')
plt.title("Distribution of Money Received Per Person Values", y=1.015)
plt.xlabel("money received per person [$]", labelpad=14);

png

We can see the median is roughly at 8.50 U.S. dollars.

Below, I assign variables for the 75th and 25th percentile of the money_received_per_person column that we'll use in our analysis to help create the category of big spenders per person.

large_amt_paid_per_person = round(df_tips['money_received_per_person'].quantile(q=0.75), 2)
large_amt_paid_per_person
10.71
small_amt_paid_per_person = round(df_tips['money_received_per_person'].quantile(q=0.25), 2)
small_amt_paid_per_person
6.75

Below, I assign a variable conditions to be a list of conditions that characterize people served at our restaurant by the columns money_received_per_person and size. In our first index of the variable conditions, that condition is met if money_received_per_person is greater than large_amt_paid_per_person and size is greater than or equal to 3.

conditions = [
    (df_tips['money_received_per_person'] > large_amt_paid_per_person) & (df_tips['size'] >= 3),
    (df_tips['money_received_per_person'] > large_amt_paid_per_person) & (df_tips['size'] < 3),
    (df_tips['money_received_per_person'].between(left=small_amt_paid_per_person, right=large_amt_paid_per_person)) & (df_tips['size'] >= 3),
    (df_tips['money_received_per_person'].between(left=small_amt_paid_per_person, right=large_amt_paid_per_person)) & (df_tips['size'] < 3),
    (df_tips['money_received_per_person'] < small_amt_paid_per_person) & (df_tips['size'] >= 3),
    (df_tips['money_received_per_person'] < small_amt_paid_per_person) & (df_tips['size'] < 3)
]

There's a variable called choices that's a list of text strings that classify this category of people that meet this condition. In the example described above, I want to classify that group of people that spend a lot of money per person and are large groups as big spenders & large group.

choices = ['big spenders & large group', 'big spenders & small group', 
           'medium spenders & large group', 'medium spenders & small group',
           'small spenders & large group', 'small spenders & small group']

The numpy select() method returns an array drawn from elements in a choicelist. We can use our conditions above and are outcome choices to classify those groups of people based on money spent per person and group size.

In the select() method below, I set the argument condlist to be our conditions and the argument choicelist to be our choices. I created the variables conditions and choices.

df_tips['party_category'] = np.select(condlist=conditions, choicelist=choices)

For the new column created party_category, I apply the value_counts() method to see a count of each of the unique values in the column.

df_tips['party_category'].value_counts()
medium spenders & small group    82
big spenders & small group       48
medium spenders & large group    40
small spenders & large group     31
small spenders & small group     30
big spenders & large group       13
Name: party_category, dtype: int64

Let's preview the relevant columns used in this analysis.

df_tips[['money_received_per_person', 'size', 'party_category']].sample(n=5)
money_received_per_person size party_category
167 9.05 4 medium spenders & large group
226 6.04 2 small spenders & small group
65 7.74 3 medium spenders & large group
231 6.23 3 small spenders & large group
25 5.04 4 small spenders & large group