Self Join¶
Date published: 2019-04-14
Category: Data Analysis
Subcategory: Data Wrangling
Tags: self join, pandas merge, python, pandas
In SQL, a popular type of join is a self join which joins a table to itself. This is helpful for comparing rows to one another, based on their values in columns, in a single table.
In this article, I'll walk through two examples in which self joins can be helpful.
Import Modules¶
import pandas as pd
Example 1: Basic Real Estate Transactions¶
Create Dataset¶
I'll create a small dataset of 5 real estate transactions that include a unique transaction id for each purchase, a close date for each sale, the buyer's name and seller's name.
Notice how Julia was the buyer for transaction id 1
and later a seller for transaction id 2
.
data = {'transaction_id': [1, 2, 3, 4, 5],
'close_date': ["2012-08-01", "2012-08-02", "2012-08-03", "2012-08-04", "2012-08-04"],
'buyer_name': ["Julia", "Joe", "Jake", "Jamie", "Jackie"],
'seller_name': ["Lara", "Julia", "Barbara", "Emily", "Mason"]
}
df = pd.DataFrame(data)
View entire df
.
df
transaction_id | close_date | buyer_name | seller_name | |
---|---|---|---|---|
0 | 1 | 2012-08-01 | Julia | Lara |
1 | 2 | 2012-08-02 | Joe | Julia |
2 | 3 | 2012-08-03 | Jake | Barbara |
3 | 4 | 2012-08-04 | Jamie | Emily |
4 | 5 | 2012-08-04 | Jackie | Mason |
Find People Who Were Both Buyers and Sellers¶
Often times, people buy homes and then later sell that homes. In this dataset, I'm curious, which people both bought and sold a home? We noticed earlier Julia bought a home and later sold one so Julia's name should be the only result.
One method of finding a solution is to do a self join. In pandas, the DataFrame object has a merge()
method. Below, for df
, for the merge
method, I'll set the following arguments:
right=df
so that the firstdf
listed in the statement merges with another DataFrame,df
left_on='buyer_name'
is the column to join from the leftdf
right_on='seller_name
is the column to join from the rightdf
By default, these arguments are also set in the merge
method:
how='inner'
so returned results only show records in which the leftdf
has a value inbuyer_name
equivalent to the rightdf
with a value ofseller_name
.suffixes=(‘_x’, ‘_y’)
so_x
is appended to the end of column names from our leftdf
if those column names originally match the rightdf
._y
is appended to the end of column names from our rightdf
if those column names originally match the leftdf
.
df2 = df.merge(right=df, left_on='buyer_name', right_on='seller_name')
df2df2 = df.merge(right=df, left_on='buyer_name', right_on='seller_name')
df2
transaction_id_x | close_date_x | buyer_name_x | seller_name_x | transaction_id_y | close_date_y | buyer_name_y | seller_name_y | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 2012-08-01 | Julia | Lara | 2 | 2012-08-02 | Joe | Julia |
Our output of df2
shows in a single record, the details of Julia who bought a home and sold a home.
We can find all unique values in the buyer_name_x
field to programmatically arrive at our result.
df2['buyer_name_x'].unique()
array(['Julia'], dtype=object)
Example 2: Intermediate Real Estate Transactions¶
Append New Row to Dataset¶
Below, I create a new row for another real estate transaction in which Julia buys a 2nd home.
df.loc[5] = [6, "2012-08-05", "Julia", "Mary"]
View new df
with additional row.
df
transaction_id | close_date | buyer_name | seller_name | |
---|---|---|---|---|
0 | 1 | 2012-08-01 | Julia | Lara |
1 | 2 | 2012-08-02 | Joe | Julia |
2 | 3 | 2012-08-03 | Jake | Barbara |
3 | 4 | 2012-08-04 | Jamie | Emily |
4 | 5 | 2012-08-04 | Jackie | Mason |
5 | 6 | 2012-08-05 | Julia | Mary |
Find People Who Are Both Buyers and Sellers¶
This is the same ask as with Example 1. However, our dataset is slightly different so a self join will return different results.
I'll use the same code to perform a self join but assign the output to df
instead.
df3 = df.merge(right=df, left_on='buyer_name', right_on='seller_name')
df3
transaction_id_x | close_date_x | buyer_name_x | seller_name_x | transaction_id_y | close_date_y | buyer_name_y | seller_name_y | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 2012-08-01 | Julia | Lara | 2 | 2012-08-02 | Joe | Julia |
1 | 6 | 2012-08-05 | Julia | Mary | 2 | 2012-08-02 | Joe | Julia |
There are two records!
The first record indicates Julia's purchase for transaction_id
of 1
and later a sale with transaction_id
of 2
.
The second record indicates Julia's purchase for transaction_id
of 6
and later a sale with transaction_id
of 2
.
This is the correct output as I wanted all rows of df
to be joined with df
in which a buyer_name
from the left df
is equivalent to a seller_name
from the right df
.
I can find all unique values of the buyer_name_x
field to programmatically arrive at our result.
df3['buyer_name_x'].unique()
array(['Julia'], dtype=object)