Create New Columns Based on Operations¶
Date published: 2018-12-24
Category: Data Analysis
Subcategory: Data Wrangling
Tags: python, pandas
A pivotal part of data analysis is to create new columns based on some operation. This tutorial will cover multiple ways to do that.
Import Modules¶
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
% matplotlib inline
Get Tips Dataset¶
Let's get the tips dataset from the Python seaborn visualization library and assign it to the DataFrame df_tips.
df_tips = sns.load_dataset('tips')
Each row represents a unique meal at a restaurant for a party of people; the dataset contains the following fields:
| column name | column description |
|---|---|
total_bill |
financial amount of meal in U.S. dollars |
tip |
financial amount of the meal's tip in U.S. dollars |
sex |
gender of server |
smoker |
boolean to represent if server smokes or not |
day |
day of week |
time |
meal name (Lunch or Dinner) |
size |
count of people eating meal |
Preview the first 5 rows of df_tips using the pandas series head() method.
df_tips.head()
| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
Create New Column Based on Mathematical Operations of Column(s)¶
We can perform mathematical operations on columns such as addition, division and multiplication.
Addition¶
Let's create a new column called total_money_received that's the addition of the total_bill and tip column. This total amount is how much money the restaurant receives for each order.
df_tips['total_money_received'] = df_tips['total_bill'] + df_tips['tip']
Let's see a preview of these three columnns.
df_tips[['total_bill', 'tip', 'total_money_received']].sample(n=5)
| total_bill | tip | total_money_received | |
|---|---|---|---|
| 106 | 20.49 | 4.06 | 24.55 |
| 163 | 13.81 | 2.00 | 15.81 |
| 124 | 12.48 | 2.52 | 15.00 |
| 155 | 29.85 | 5.14 | 34.99 |
| 54 | 25.56 | 4.34 | 29.90 |
Division¶
Let's create a new column called tip_proportion_of_total_bill that's a calculation of tip values divided by their respective total_bill value.
df_tips['tip_proportion_of_total_bill'] = df_tips['tip']/df_tips['total_bill']
Let's preview these three columns.
df_tips[['total_bill', 'tip', 'tip_proportion_of_total_bill']].sample(n=5)
| total_bill | tip | tip_proportion_of_total_bill | |
|---|---|---|---|
| 127 | 14.52 | 2.00 | 0.137741 |
| 25 | 17.81 | 2.34 | 0.131387 |
| 235 | 10.07 | 1.25 | 0.124131 |
| 65 | 20.08 | 3.15 | 0.156873 |
| 12 | 15.42 | 1.57 | 0.101816 |
For aesthetic purposes of the values in the tip_proportion_of_total_bill, I will round all values to four decimal places using the round() method below.
df_tips['tip_proportion_of_total_bill'] = df_tips['tip_proportion_of_total_bill'].round(decimals=4)
Additionally, I want to create a new column called money_received_per_person that's a calculation of total_money_received divided by size. I round values in this new column to two decimal places.
df_tips['money_received_per_person'] = df_tips['total_money_received']/df_tips['size']
df_tips['money_received_per_person'] = df_tips['money_received_per_person'].round(decimals=2)
Let's preview these three columns.
df_tips[['total_money_received', 'size', 'money_received_per_person']].sample(n=5)
| total_money_received | size | money_received_per_person | |
|---|---|---|---|
| 64 | 20.23 | 3 | 6.74 |
| 61 | 15.81 | 2 | 7.90 |
| 141 | 41.00 | 6 | 6.83 |
| 150 | 16.57 | 2 | 8.28 |
| 155 | 34.99 | 5 | 7.00 |
Multiplication¶
I want to create a new column called tip_percentage_of_total_bill that simply takes our column called tip_proportion_of_total_bill and multiplies the value by 100. I round values in this new column to two decimal places.
df_tips['tip_percentage_of_total_bill'] = df_tips['tip_proportion_of_total_bill']*100
df_tips['tip_percentage_of_total_bill'] = df_tips['tip_percentage_of_total_bill'].round(decimals=3)
Let's preview these columns recently used for multiplication.
df_tips[['total_bill', 'tip', 'tip_proportion_of_total_bill', 'tip_percentage_of_total_bill']].head()
| total_bill | tip | tip_proportion_of_total_bill | tip_percentage_of_total_bill | |
|---|---|---|---|---|
| 0 | 16.99 | 1.01 | 0.0594 | 5.94 |
| 1 | 10.34 | 1.66 | 0.1605 | 16.05 |
| 2 | 21.01 | 3.50 | 0.1666 | 16.66 |
| 3 | 23.68 | 3.31 | 0.1398 | 13.98 |
| 4 | 24.59 | 3.61 | 0.1468 | 14.68 |
Create New Column Based on Mapping of Current Values to New Values¶
In our day column, we see the following unique values printed out below using the pandas series unique method.
df_tips['day'].unique()
[Sun, Sat, Thur, Fri] Categories (4, object): [Sun, Sat, Thur, Fri]
I don't like how the days are shortened names. The name Sun can be mapped to a longer and more colloquial name of Sunday.
In pandas, we can use the series map() method to map our current values, the shortened
day names, to longer more colloquial names of days, our new values that we'll create in a new column called day_long_name.
Below is a Python dictionary that assigns the short day names as keys and the longer names to their respective values.
short_day_name_to_long_name = {'Sun': 'Sunday', 'Sat': 'Saturday', 'Thur': 'Thursday', 'Fri': 'Friday'}
Let's assign the map() method to the day column and set the argument arg equal to our short_day_name_to_long_name dictionary. We'll set this equal to be the values in a new column assigned to day_long_name.
df_tips['day_long_name'] = df_tips['day'].map(arg=short_day_name_to_long_name)
Let's preview a few random rows of day and day_long_name.
df_tips[['day', 'day_long_name']].sample(n=5)
| day | day_long_name | |
|---|---|---|
| 40 | Sat | Saturday |
| 192 | Thur | Thursday |
| 89 | Thur | Thursday |
| 202 | Thur | Thursday |
| 98 | Fri | Friday |
Create New Column Based on Conditions Across Various Columns¶
If I were the owner of this restaurant, I'd want to learn more about the context of large groups who ordered a large dollar amount of food. To learn more about this category, I want to create a new column in our DataFrame with these characteristics.
First, let's view the distribution of the column money_received_per_person.
sns.set_context('talk')
plt.figure(figsize=(9, 5))
sns.boxplot(df_tips['money_received_per_person'], color='green')
plt.title("Distribution of Money Received Per Person Values", y=1.015)
plt.xlabel("money received per person [$]", labelpad=14);
We can see the median is roughly at 8.50 U.S. dollars.
Below, I assign variables for the 75th and 25th percentile of the money_received_per_person column that we'll use in our analysis to help create the category of big spenders per person.
large_amt_paid_per_person = round(df_tips['money_received_per_person'].quantile(q=0.75), 2)
large_amt_paid_per_person
10.71
small_amt_paid_per_person = round(df_tips['money_received_per_person'].quantile(q=0.25), 2)
small_amt_paid_per_person
6.75
Below, I assign a variable conditions to be a list of conditions that characterize people served at our restaurant by the columns money_received_per_person and size. In our first index of the variable conditions, that condition is met if money_received_per_person is greater than large_amt_paid_per_person and size is greater than or equal to 3.
conditions = [
(df_tips['money_received_per_person'] > large_amt_paid_per_person) & (df_tips['size'] >= 3),
(df_tips['money_received_per_person'] > large_amt_paid_per_person) & (df_tips['size'] < 3),
(df_tips['money_received_per_person'].between(left=small_amt_paid_per_person, right=large_amt_paid_per_person)) & (df_tips['size'] >= 3),
(df_tips['money_received_per_person'].between(left=small_amt_paid_per_person, right=large_amt_paid_per_person)) & (df_tips['size'] < 3),
(df_tips['money_received_per_person'] < small_amt_paid_per_person) & (df_tips['size'] >= 3),
(df_tips['money_received_per_person'] < small_amt_paid_per_person) & (df_tips['size'] < 3)
]
There's a variable called choices that's a list of text strings that classify this category of people that meet this condition. In the example described above, I want to classify that group of people that spend a lot of money per person and are large groups as big spenders & large group.
choices = ['big spenders & large group', 'big spenders & small group',
'medium spenders & large group', 'medium spenders & small group',
'small spenders & large group', 'small spenders & small group']
The numpy select() method returns an array drawn from elements in a choicelist. We can use our conditions above and are outcome choices to classify those groups of people based on money spent per person and group size.
In the select() method below, I set the argument condlist to be our conditions and the argument choicelist to be our choices. I created the variables conditions and choices.
df_tips['party_category'] = np.select(condlist=conditions, choicelist=choices)
For the new column created party_category, I apply the value_counts() method to see a count of each of the unique values in the column.
df_tips['party_category'].value_counts()
medium spenders & small group 82 big spenders & small group 48 medium spenders & large group 40 small spenders & large group 31 small spenders & small group 30 big spenders & large group 13 Name: party_category, dtype: int64
Let's preview the relevant columns used in this analysis.
df_tips[['money_received_per_person', 'size', 'party_category']].sample(n=5)
| money_received_per_person | size | party_category | |
|---|---|---|---|
| 167 | 9.05 | 4 | medium spenders & large group |
| 226 | 6.04 | 2 | small spenders & small group |
| 65 | 7.74 | 3 | medium spenders & large group |
| 231 | 6.23 | 3 | small spenders & large group |
| 25 | 5.04 | 4 | small spenders & large group |