Self Join
- April 14, 2019
- Key Terms: self join, pandas merge, python, pandas
In SQL, a popular type of join is a self join which joins a table to itself. This is helpful for comparing rows to one another, based on their values in columns, in a single table.
In this article, I'll walk through two examples in which self joins can be helpful.
Import Modules¶
import pandas as pd
Example 1: Basic Real Estate Transactions¶
Create Dataset¶
I'll create a small dataset of 5 real estate transactions that include a unique transaction id for each purchase, a close date for each sale, the buyer's name and seller's name.
Notice how Julia was the buyer for transaction id 1
and later a seller for transaction id 2
.
data = {'transaction_id': [1, 2, 3, 4, 5],
'close_date': ["2012-08-01", "2012-08-02", "2012-08-03", "2012-08-04", "2012-08-04"],
'buyer_name': ["Julia", "Joe", "Jake", "Jamie", "Jackie"],
'seller_name': ["Lara", "Julia", "Barbara", "Emily", "Mason"]
}
df = pd.DataFrame(data)
View entire df
.
df
Find People Who Were Both Buyers and Sellers¶
Often times, people buy homes and then later sell that homes. In this dataset, I'm curious, which people both bought and sold a home? We noticed earlier Julia bought a home and later sold one so Julia's name should be the only result.
One method of finding a solution is to do a self join. In pandas, the DataFrame object has a merge()
method. Below, for df
, for the merge
method, I'll set the following arguments:
right=df
so that the firstdf
listed in the statement merges with another DataFrame,df
left_on='buyer_name'
is the column to join from the leftdf
right_on='seller_name
is the column to join from the rightdf
By default, these arguments are also set in the merge
method:
how='inner'
so returned results only show records in which the leftdf
has a value inbuyer_name
equivalent to the rightdf
with a value ofseller_name
.suffixes=(‘_x’, ‘_y’)
so_x
is appended to the end of column names from our leftdf
if those column names originally match the rightdf
._y
is appended to the end of column names from our rightdf
if those column names originally match the leftdf
.
df2 = df.merge(right=df, left_on='buyer_name', right_on='seller_name')
df2df2 = df.merge(right=df, left_on='buyer_name', right_on='seller_name')
df2
Our output of df2
shows in a single record, the details of Julia who bought a home and sold a home.
We can find all unique values in the buyer_name_x
field to programmatically arrive at our result.
df2['buyer_name_x'].unique()
Example 2: Intermediate Real Estate Transactions¶
Append New Row to Dataset¶
Below, I create a new row for another real estate transaction in which Julia buys a 2nd home.
df.loc[5] = [6, "2012-08-05", "Julia", "Mary"]
View new df
with additional row.
df
Find People Who Are Both Buyers and Sellers¶
This is the same ask as with Example 1. However, our dataset is slightly different so a self join will return different results.
I'll use the same code to perform a self join but assign the output to df
instead.
df3 = df.merge(right=df, left_on='buyer_name', right_on='seller_name')
df3
There are two records!
The first record indicates Julia's purchase for transaction_id
of 1
and later a sale with transaction_id
of 2
.
The second record indicates Julia's purchase for transaction_id
of 6
and later a sale with transaction_id
of 2
.
This is the correct output as I wanted all rows of df
to be joined with df
in which a buyer_name
from the left df
is equivalent to a seller_name
from the right df
.
I can find all unique values of the buyer_name_x
field to programmatically arrive at our result.
df3['buyer_name_x'].unique()