Self Join¶
Date published: 2019-04-14
Category: Data Analysis
Subcategory: Data Wrangling
Tags: self join, pandas merge, python, pandas
In SQL, a popular type of join is a self join which joins a table to itself. This is helpful for comparing rows to one another, based on their values in columns, in a single table.
In this article, I'll walk through two examples in which self joins can be helpful.
Import Modules¶
import pandas as pd
Example 1: Basic Real Estate Transactions¶
Create Dataset¶
I'll create a small dataset of 5 real estate transactions that include a unique transaction id for each purchase, a close date for each sale, the buyer's name and seller's name.
Notice how Julia was the buyer for transaction id 1 and later a seller for transaction id 2.
data = {'transaction_id': [1, 2, 3, 4, 5],
'close_date': ["2012-08-01", "2012-08-02", "2012-08-03", "2012-08-04", "2012-08-04"],
'buyer_name': ["Julia", "Joe", "Jake", "Jamie", "Jackie"],
'seller_name': ["Lara", "Julia", "Barbara", "Emily", "Mason"]
}
df = pd.DataFrame(data)
View entire df.
df
| transaction_id | close_date | buyer_name | seller_name | |
|---|---|---|---|---|
| 0 | 1 | 2012-08-01 | Julia | Lara |
| 1 | 2 | 2012-08-02 | Joe | Julia |
| 2 | 3 | 2012-08-03 | Jake | Barbara |
| 3 | 4 | 2012-08-04 | Jamie | Emily |
| 4 | 5 | 2012-08-04 | Jackie | Mason |
Find People Who Were Both Buyers and Sellers¶
Often times, people buy homes and then later sell that homes. In this dataset, I'm curious, which people both bought and sold a home? We noticed earlier Julia bought a home and later sold one so Julia's name should be the only result.
One method of finding a solution is to do a self join. In pandas, the DataFrame object has a merge() method. Below, for df, for the merge method, I'll set the following arguments:
right=dfso that the firstdflisted in the statement merges with another DataFrame,dfleft_on='buyer_name'is the column to join from the leftdfright_on='seller_nameis the column to join from the rightdf
By default, these arguments are also set in the merge method:
how='inner'so returned results only show records in which the leftdfhas a value inbuyer_nameequivalent to the rightdfwith a value ofseller_name.suffixes=(‘_x’, ‘_y’)so_xis appended to the end of column names from our leftdfif those column names originally match the rightdf._yis appended to the end of column names from our rightdfif those column names originally match the leftdf.
df2 = df.merge(right=df, left_on='buyer_name', right_on='seller_name')
df2df2 = df.merge(right=df, left_on='buyer_name', right_on='seller_name')
df2
| transaction_id_x | close_date_x | buyer_name_x | seller_name_x | transaction_id_y | close_date_y | buyer_name_y | seller_name_y | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2012-08-01 | Julia | Lara | 2 | 2012-08-02 | Joe | Julia |
Our output of df2 shows in a single record, the details of Julia who bought a home and sold a home.
We can find all unique values in the buyer_name_x field to programmatically arrive at our result.
df2['buyer_name_x'].unique()
array(['Julia'], dtype=object)
Example 2: Intermediate Real Estate Transactions¶
Append New Row to Dataset¶
Below, I create a new row for another real estate transaction in which Julia buys a 2nd home.
df.loc[5] = [6, "2012-08-05", "Julia", "Mary"]
View new df with additional row.
df
| transaction_id | close_date | buyer_name | seller_name | |
|---|---|---|---|---|
| 0 | 1 | 2012-08-01 | Julia | Lara |
| 1 | 2 | 2012-08-02 | Joe | Julia |
| 2 | 3 | 2012-08-03 | Jake | Barbara |
| 3 | 4 | 2012-08-04 | Jamie | Emily |
| 4 | 5 | 2012-08-04 | Jackie | Mason |
| 5 | 6 | 2012-08-05 | Julia | Mary |
Find People Who Are Both Buyers and Sellers¶
This is the same ask as with Example 1. However, our dataset is slightly different so a self join will return different results.
I'll use the same code to perform a self join but assign the output to df instead.
df3 = df.merge(right=df, left_on='buyer_name', right_on='seller_name')
df3
| transaction_id_x | close_date_x | buyer_name_x | seller_name_x | transaction_id_y | close_date_y | buyer_name_y | seller_name_y | |
|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2012-08-01 | Julia | Lara | 2 | 2012-08-02 | Joe | Julia |
| 1 | 6 | 2012-08-05 | Julia | Mary | 2 | 2012-08-02 | Joe | Julia |
There are two records!
The first record indicates Julia's purchase for transaction_id of 1 and later a sale with transaction_id of 2.
The second record indicates Julia's purchase for transaction_id of 6 and later a sale with transaction_id of 2.
This is the correct output as I wanted all rows of df to be joined with df in which a buyer_name from the left df is equivalent to a seller_name from the right df.
I can find all unique values of the buyer_name_x field to programmatically arrive at our result.
df3['buyer_name_x'].unique()
array(['Julia'], dtype=object)