Self Join¶

Date published: 2019-04-14

Category: Data Analysis

Subcategory: Data Wrangling

Tags: self join, pandas merge, python, pandas

In SQL, a popular type of join is a self join which joins a table to itself. This is helpful for comparing rows to one another, based on their values in columns, in a single table.

In this article, I'll walk through two examples in which self joins can be helpful.

Import Modules¶

In [16]:

                
                    Copied!
                    
import pandas as pd
import pandas as pd

Example 1: Basic Real Estate Transactions¶

Create Dataset¶

I'll create a small dataset of 5 real estate transactions that include a unique transaction id for each purchase, a close date for each sale, the buyer's name and seller's name.

Notice how Julia was the buyer for transaction id 1 and later a seller for transaction id 2.

In [17]:

                
                    Copied!
                    
                        
                        
                    
                    

            
data = {'transaction_id': [1, 2, 3, 4, 5], 
        'close_date': ["2012-08-01", "2012-08-02", "2012-08-03", "2012-08-04", "2012-08-04"], 
        'buyer_name': ["Julia", "Joe", "Jake", "Jamie", "Jackie"],
       'seller_name': ["Lara", "Julia", "Barbara", "Emily", "Mason"]
       }
df = pd.DataFrame(data)
data = {'transaction_id': [1, 2, 3, 4, 5], 
        'close_date': ["2012-08-01", "2012-08-02", "2012-08-03", "2012-08-04", "2012-08-04"], 
        'buyer_name': ["Julia", "Joe", "Jake", "Jamie", "Jackie"],
       'seller_name': ["Lara", "Julia", "Barbara", "Emily", "Mason"]
       }
df = pd.DataFrame(data)

View entire df.

In [18]:

Out[18]:

	transaction_id	close_date	buyer_name	seller_name
0	1	2012-08-01	Julia	Lara
1	2	2012-08-02	Joe	Julia
2	3	2012-08-03	Jake	Barbara
3	4	2012-08-04	Jamie	Emily
4	5	2012-08-04	Jackie	Mason

Find People Who Were Both Buyers and Sellers¶

Often times, people buy homes and then later sell that homes. In this dataset, I'm curious, which people both bought and sold a home? We noticed earlier Julia bought a home and later sold one so Julia's name should be the only result.

One method of finding a solution is to do a self join. In pandas, the DataFrame object has a merge() method. Below, for df, for the merge method, I'll set the following arguments:

right=df so that the first df listed in the statement merges with another DataFrame, df
left_on='buyer_name' is the column to join from the left df
right_on='seller_name is the column to join from the right df

By default, these arguments are also set in the merge method:

how='inner' so returned results only show records in which the left df has a value in buyer_name equivalent to the right df with a value of seller_name.
suffixes=(‘_x’, ‘_y’) so _x is appended to the end of column names from our left df if those column names originally match the right df. _y is appended to the end of column names from our right df if those column names originally match the left df.

In [20]:

                
                    Copied!
                    
df2 = df.merge(right=df, left_on='buyer_name', right_on='seller_name')
df2df2 = df.merge(right=df, left_on='buyer_name', right_on='seller_name')
df2
df2 = df.merge(right=df, left_on='buyer_name', right_on='seller_name')
df2df2 = df.merge(right=df, left_on='buyer_name', right_on='seller_name')
df2

Out[20]:

	transaction_id_x	close_date_x	buyer_name_x	seller_name_x	transaction_id_y	close_date_y	buyer_name_y	seller_name_y
0	1	2012-08-01	Julia	Lara	2	2012-08-02	Joe	Julia

Our output of df2 shows in a single record, the details of Julia who bought a home and sold a home.

We can find all unique values in the buyer_name_x field to programmatically arrive at our result.

In [21]:

                
                    Copied!
                    
df2['buyer_name_x'].unique()
df2['buyer_name_x'].unique()

Out[21]:

array(['Julia'], dtype=object)

Example 2: Intermediate Real Estate Transactions¶

Append New Row to Dataset¶

Below, I create a new row for another real estate transaction in which Julia buys a 2nd home.

In [24]:

                
                    Copied!
                    
df.loc[5] = [6, "2012-08-05", "Julia", "Mary"]
df.loc[5] = [6, "2012-08-05", "Julia", "Mary"]

View new df with additional row.

In [26]:

Out[26]:

	transaction_id	close_date	buyer_name	seller_name
0	1	2012-08-01	Julia	Lara
1	2	2012-08-02	Joe	Julia
2	3	2012-08-03	Jake	Barbara
3	4	2012-08-04	Jamie	Emily
4	5	2012-08-04	Jackie	Mason
5	6	2012-08-05	Julia	Mary

Find People Who Are Both Buyers and Sellers¶

This is the same ask as with Example 1. However, our dataset is slightly different so a self join will return different results.

I'll use the same code to perform a self join but assign the output to df instead.

In [28]:

                
                    Copied!
                    
df3 = df.merge(right=df, left_on='buyer_name', right_on='seller_name')
df3
df3 = df.merge(right=df, left_on='buyer_name', right_on='seller_name')
df3

Out[28]:

	transaction_id_x	close_date_x	buyer_name_x	seller_name_x	transaction_id_y	close_date_y	buyer_name_y	seller_name_y
0	1	2012-08-01	Julia	Lara	2	2012-08-02	Joe	Julia
1	6	2012-08-05	Julia	Mary	2	2012-08-02	Joe	Julia

There are two records!

The first record indicates Julia's purchase for transaction_id of 1 and later a sale with transaction_id of 2.

The second record indicates Julia's purchase for transaction_id of 6 and later a sale with transaction_id of 2.

This is the correct output as I wanted all rows of df to be joined with df in which a buyer_name from the left df is equivalent to a seller_name from the right df.

I can find all unique values of the buyer_name_x field to programmatically arrive at our result.

In [29]:

                
                    Copied!
                    
df3['buyer_name_x'].unique()
df3['buyer_name_x'].unique()

Out[29]:

array(['Julia'], dtype=object)