Data Analysis Data Wrangling Tutorial

Self Join

In SQL, a popular type of join is a self join which joins a table to itself. This is helpful for comparing rows to one another, based on their values in columns, in a single table.

In this article, I'll walk through two examples in which self joins can be helpful.

Import Modules

import pandas as pd

Example 1: Basic Real Estate Transactions

Create Dataset

I'll create a small dataset of 5 real estate transactions that include a unique transaction id for each purchase, a close date for each sale, the buyer's name and seller's name.

Notice how Julia was the buyer for transaction id 1 and later a seller for transaction id 2.

data = {'transaction_id': [1, 2, 3, 4, 5], 
        'close_date': ["2012-08-01", "2012-08-02", "2012-08-03", "2012-08-04", "2012-08-04"], 
        'buyer_name': ["Julia", "Joe", "Jake", "Jamie", "Jackie"],
       'seller_name': ["Lara", "Julia", "Barbara", "Emily", "Mason"]
       }
df = pd.DataFrame(data)

View entire df.

df
transaction_id close_date buyer_name seller_name
0 1 2012-08-01 Julia Lara
1 2 2012-08-02 Joe Julia
2 3 2012-08-03 Jake Barbara
3 4 2012-08-04 Jamie Emily
4 5 2012-08-04 Jackie Mason

Find People Who Were Both Buyers and Sellers

Often times, people buy homes and then later sell that homes. In this dataset, I'm curious, which people both bought and sold a home? We noticed earlier Julia bought a home and later sold one so Julia's name should be the only result.

One method of finding a solution is to do a self join. In pandas, the DataFrame object has a merge() method. Below, for df, for the merge method, I'll set the following arguments:

  • right=df so that the first df listed in the statement merges with another DataFrame, df
  • left_on='buyer_name' is the column to join from the left df
  • right_on='seller_name is the column to join from the right df

By default, these arguments are also set in the merge method:

  • how='inner' so returned results only show records in which the left df has a value in buyer_name equivalent to the right df with a value of seller_name.
  • suffixes=(‘_x’, ‘_y’) so _x is appended to the end of column names from our left df if those column names originally match the right df. _y is appended to the end of column names from our right df if those column names originally match the left df.
df2 = df.merge(right=df, left_on='buyer_name', right_on='seller_name')
df2df2 = df.merge(right=df, left_on='buyer_name', right_on='seller_name')
df2
transaction_id_x close_date_x buyer_name_x seller_name_x transaction_id_y close_date_y buyer_name_y seller_name_y
0 1 2012-08-01 Julia Lara 2 2012-08-02 Joe Julia

Our output of df2 shows in a single record, the details of Julia who bought a home and sold a home.

We can find all unique values in the buyer_name_x field to programmatically arrive at our result.

df2['buyer_name_x'].unique()
array(['Julia'], dtype=object)

Example 2: Intermediate Real Estate Transactions

Append New Row to Dataset

Below, I create a new row for another real estate transaction in which Julia buys a 2nd home.

df.loc[5] = [6, "2012-08-05", "Julia", "Mary"]

View new df with additional row.

df
transaction_id close_date buyer_name seller_name
0 1 2012-08-01 Julia Lara
1 2 2012-08-02 Joe Julia
2 3 2012-08-03 Jake Barbara
3 4 2012-08-04 Jamie Emily
4 5 2012-08-04 Jackie Mason
5 6 2012-08-05 Julia Mary

Find People Who Are Both Buyers and Sellers

This is the same ask as with Example 1. However, our dataset is slightly different so a self join will return different results.

I'll use the same code to perform a self join but assign the output to df instead.

df3 = df.merge(right=df, left_on='buyer_name', right_on='seller_name')
df3
transaction_id_x close_date_x buyer_name_x seller_name_x transaction_id_y close_date_y buyer_name_y seller_name_y
0 1 2012-08-01 Julia Lara 2 2012-08-02 Joe Julia
1 6 2012-08-05 Julia Mary 2 2012-08-02 Joe Julia

There are two records!

The first record indicates Julia's purchase for transaction_id of 1 and later a sale with transaction_id of 2.

The second record indicates Julia's purchase for transaction_id of 6 and later a sale with transaction_id of 2.

This is the correct output as I wanted all rows of df to be joined with df in which a buyer_name from the left df is equivalent to a seller_name from the right df.

I can find all unique values of the buyer_name_x field to programmatically arrive at our result.

df3['buyer_name_x'].unique()
array(['Julia'], dtype=object)