Pandas rank() Method: Equivalent to ROW_NUMBER(), RANK(), DENSE_RANK() and NTILE() SQL Window Functions¶

Date published: 2019-04-15

Category: Data Analysis

Subcategory: Data Wrangling

Tags: rank, pandas, python

In SQL, popular window functions include: ROW_NUMBER(), RANK(), DENSE_RANK() and NTILE(). These are helpful for creating a new column that's a rank of some other values in a column, perhaps partitioned by one or multiple groups. You can learn about these SQL window functions via Mode's SQL tutorial.

Similarly, using pandas in Python, the rank() method for a series provides similar utility to the SQL window functions listed above.

In this tutorial, I'll cover the rank() method in pandas with an example of real estate transactions data and later quiz scores.

Setup Code¶

Import Module¶

In [3]:

                
                    Copied!
                    
import pandas as pd
import seaborn as sns
import pandas as pd
import seaborn as sns

Gradient Used to Highlight Table Outputs¶

In [4]:

                
                    Copied!
                    
cm = sns.light_palette("lightgreen", as_cmap=True)
cm = sns.light_palette("lightgreen", as_cmap=True)

Example 1: Count of New Sellers Per Day¶

Create Real Estate Transaction Dataset¶

I'll create a small dataset of 8 real estate transactions that include a close date and seller name for each house sale.

In [5]:

                
                    Copied!
                    
data = {'close_date': ["2012-08-01", "2012-08-01", "2012-08-01", "2012-08-02", "2012-08-03", "2012-08-04", "2012-08-05", "2012-08-07"],
        'seller_name': ["Lara", "Julia", "Julia", "Emily", "Julia", "Lara", "Julia", "Julia"]
       }
df = pd.DataFrame(data)
data = {'close_date': ["2012-08-01", "2012-08-01", "2012-08-01", "2012-08-02", "2012-08-03", "2012-08-04", "2012-08-05", "2012-08-07"],
        'seller_name': ["Lara", "Julia", "Julia", "Emily", "Julia", "Lara", "Julia", "Julia"]
       }
df = pd.DataFrame(data)

Convert close_date column to datetime type in Python.

In [6]:

                
                    Copied!
                    
df['close_date'] = pd.to_datetime(df['close_date'])
df['close_date'] = pd.to_datetime(df['close_date'])

View df.

In [7]:

Out[7]:

	close_date	seller_name
0	2012-08-01	Lara
1	2012-08-01	Julia
2	2012-08-01	Julia
3	2012-08-02	Emily
4	2012-08-03	Julia
5	2012-08-04	Lara
6	2012-08-05	Julia
7	2012-08-07	Julia

Find Rank of Home Close Date by Each Seller¶

The rank action can be described as ranking the close dates within each seller group, based on the chronological order of the home sales.

I group by the seller_name column, and apply the rank() method to the close_date colummn. Set argument method to first meaning which will rank house sales by close_date by ascending order. This is equivalent to the ROW_NUMBER() window function in SQL.

In [8]:

                
                    Copied!
                    
df['seller__sale_date_rank'] = df.groupby('seller_name')['close_date'].rank(method='first')
df['seller__sale_date_rank'] = df.groupby('seller_name')['close_date'].rank(method='first')

Convert that new column from float to integer so it's easier to read.

In [9]:

                
                    Copied!
                    
df['seller__sale_date_rank'] = pd.to_numeric(df['seller__sale_date_rank'], downcast='integer')
df['seller__sale_date_rank'] = pd.to_numeric(df['seller__sale_date_rank'], downcast='integer')

To understand rank better, let's just examine Julia's sales. Now we have a rank column to represent the order for each of Julia's close dates. There were two closes on 2012-08-01, a "tie", and ranks are 1 and 2.

In [10]:

                
                    Copied!
                    
df[df['seller_name']=='Julia'].sort_values('seller__sale_date_rank')
df[df['seller_name']=='Julia'].sort_values('seller__sale_date_rank')

Out[10]:

	close_date	seller_name	seller__sale_date_rank
1	2012-08-01	Julia	1
2	2012-08-01	Julia	2
4	2012-08-03	Julia	3
6	2012-08-05	Julia	4
7	2012-08-07	Julia	5

Find Count of New Sellers Per Seller Per Day¶

I can utilize the rankings above to find the count of new sellers by day.

For example, Julia is a new home seller on August 1st because she has a rank of 1 that day. Julia would not be counted as a new home seller on August 3rd because she has a rank of 3 that day.

I can filter df by record in which seller__sale_date_rank is equal to 1. The three records for Lara, Julia and Emily show the close_date for each in which they sold their first home.

In [11]:

                
                    Copied!
                    
df[df['seller__sale_date_rank']==1]
df[df['seller__sale_date_rank']==1]

Out[11]:

	close_date	seller_name	seller__sale_date_rank
0	2012-08-01	Lara	1
1	2012-08-01	Julia	1
3	2012-08-02	Emily	1

I can also calculate the count of seller first-time sales by day too. Let's group the dataframe output above by close_date and calculate the size of each group.

Similar to above, on 2012-08-01, we can see 2 new sellers representing Lara and Julia. On 2012-08-02, we can see 1 new seller representing Emily.

In [12]:

                
                    Copied!
                    
df[df['seller__sale_date_rank']==1].groupby('close_date').size().reset_index().rename(columns={0: 'count_new_sellers_of_homes'})
df[df['seller__sale_date_rank']==1].groupby('close_date').size().reset_index().rename(columns={0: 'count_new_sellers_of_homes'})

Out[12]:

	close_date	count_new_sellers_of_homes
0	2012-08-01	2
1	2012-08-02	1

Example 2: Count of New Sellers By Agency Per Day¶

In example 1, I wanted the count of new sellers by day. In this example 2, I want the count of sellers by agency per day.

Below, I add a new column called agency.

In [13]:

                
                    Copied!
                    
df['agency'] = ["Fifer", "Fifer", "Fifer", "HomeSales", "HomeSales", "Fifer", "Fifer", "Fifer"]
df['agency'] = ["Fifer", "Fifer", "Fifer", "HomeSales", "HomeSales", "Fifer", "Fifer", "Fifer"]

View df.

In [14]:

Out[14]:

	close_date	seller_name	seller__sale_date_rank	agency
0	2012-08-01	Lara	1	Fifer
1	2012-08-01	Julia	1	Fifer
2	2012-08-01	Julia	2	Fifer
3	2012-08-02	Emily	1	HomeSales
4	2012-08-03	Julia	3	HomeSales
5	2012-08-04	Lara	2	Fifer
6	2012-08-05	Julia	4	Fifer
7	2012-08-07	Julia	5	Fifer

I want to create the rank value of home sales by agency per seller ordered by close date. This is valuable so each agency can understand the count of new sellers per day. If you analyzed this data, you could answer the question: "When was Julia's 3rd home sale with the agency Fifer?"

Below, I group by two fields, agency and then seller_name and find a rank value ordered by close_date. I store these rank values in a new column called agency_seller__sale_date_rank.

In [15]:

                
                    Copied!
                    
df['agency_seller__sale_date_rank'] = df.groupby(['agency', 'seller_name'])['close_date'].rank(method='first')
df['agency_seller__sale_date_rank'] = df.groupby(['agency', 'seller_name'])['close_date'].rank(method='first')

Convert the new rank column from float to numeric so it's easier to read.

In [16]:

                
                    Copied!
                    
df['agency_seller__sale_date_rank'] = pd.to_numeric(df['agency_seller__sale_date_rank'], downcast='integer')
df['agency_seller__sale_date_rank'] = pd.to_numeric(df['agency_seller__sale_date_rank'], downcast='integer')

View df.

View Julia's home sales with Fifer.

Our earlier question: "When was Julia's 3rd home sale with the agency Fifer?" It was 2012-08-05.

In [17]:

                
                    Copied!
                    
df[(df['seller_name']=='Julia') & (df['agency']=='Fifer')]
df[(df['seller_name']=='Julia') & (df['agency']=='Fifer')]

Out[17]:

	close_date	seller_name	seller__sale_date_rank	agency	agency_seller__sale_date_rank
1	2012-08-01	Julia	1	Fifer	1
2	2012-08-01	Julia	2	Fifer	2
6	2012-08-05	Julia	4	Fifer	3
7	2012-08-07	Julia	5	Fifer	4

Example 3: Pandas Rank `method='min'` Comparison¶

The pandas rank() method has an argument method that can be set to other values than first. One example is to min. This is equivalent to the RANK() window function in SQL.

I'll re-create the original column of seller's rank of sales by close date.

In [18]:

                
                    Copied!
                    
df2 = pd.DataFrame(data)
df2['close_date'] = pd.to_datetime(df2['close_date'])
df2['seller__sale_date_rank_min'] = df2.groupby('seller_name')['close_date'].rank(method='min')
df2['seller__sale_date_rank_min'] = pd.to_numeric(df2['seller__sale_date_rank_min'], downcast='integer')
df2 = pd.DataFrame(data)
df2['close_date'] = pd.to_datetime(df2['close_date'])
df2['seller__sale_date_rank_min'] = df2.groupby('seller_name')['close_date'].rank(method='min')
df2['seller__sale_date_rank_min'] = pd.to_numeric(df2['seller__sale_date_rank_min'], downcast='integer')

View df2.

In [19]:

Out[19]:

	close_date	seller_name	seller__sale_date_rank_min
0	2012-08-01	Lara	1
1	2012-08-01	Julia	1
2	2012-08-01	Julia	1
3	2012-08-02	Emily	1
4	2012-08-03	Julia	3
5	2012-08-04	Lara	2
6	2012-08-05	Julia	4
7	2012-08-07	Julia	5

Notice how with method='min', in the column min_rank_agency_seller_by_close_date, Julia's two home sales on August 1, 2012 are both given a tied rank of 1.

Then, the min rank value skips a value of 2 and Julia's next sale on August 5, 2012 has a value of 3.

In [20]:

                
                    Copied!
                    
df2[df2['seller_name']=='Julia']
df2[df2['seller_name']=='Julia']

Out[20]:

	close_date	seller_name	seller__sale_date_rank_min
1	2012-08-01	Julia	1
2	2012-08-01	Julia	1
4	2012-08-03	Julia	3
6	2012-08-05	Julia	4
7	2012-08-07	Julia	5

The use case for method='first' versus method='min' varies on a case by case basis. It's important to understand your data well to make sure you utilize the correct one.

Example 4: Pandas Rank `method='dense'` Comparison¶

The pandas rank() method has an argument method that can be set to 'dense'. This is different than method set to either 'first' or 'min'.

In [21]:

                
                    Copied!
                    
df3 = pd.DataFrame(data)
df3['close_date'] = pd.to_datetime(df3['close_date'])
df3['seller__sale_date_rank_dense'] = df3.groupby('seller_name')['close_date'].rank(method='dense')
df3['seller__sale_date_rank_dense'] = pd.to_numeric(df3['seller__sale_date_rank_dense'], downcast='integer')
df3 = pd.DataFrame(data)
df3['close_date'] = pd.to_datetime(df3['close_date'])
df3['seller__sale_date_rank_dense'] = df3.groupby('seller_name')['close_date'].rank(method='dense')
df3['seller__sale_date_rank_dense'] = pd.to_numeric(df3['seller__sale_date_rank_dense'], downcast='integer')

View df3.

In [22]:

Out[22]:

	close_date	seller_name	seller__sale_date_rank_dense
0	2012-08-01	Lara	1
1	2012-08-01	Julia	1
2	2012-08-01	Julia	1
3	2012-08-02	Emily	1
4	2012-08-03	Julia	2
5	2012-08-04	Lara	2
6	2012-08-05	Julia	3
7	2012-08-07	Julia	4

Notice how with method='dense', in the column dense_rank_agency_seller_by_close_date, Julia's two home sales on August 1, 2012 are both given a tied rank of 1.

Then, the dense rank value does not skip a value of 2 and Julia's next sale on August 5, 2012 has a value of 2. dense is different than min!

In [23]:

                
                    Copied!
                    
df3[df3['seller_name']=='Julia']
df3[df3['seller_name']=='Julia']

Out[23]:

	close_date	seller_name	seller__sale_date_rank_dense
1	2012-08-01	Julia	1
2	2012-08-01	Julia	1
4	2012-08-03	Julia	2
6	2012-08-05	Julia	3
7	2012-08-07	Julia	4

The use case for 'dense' varies on a case by case basis. A second reminder - please understand your data well and why you'd choose 'dense' over other options.

Example 5: Pandas Rank `pct=True`¶

The pandas series rank() method has another argument called pct that can be set to True or False. By default, it's False. This pct argument computes the percentage rank of data.

To set the argument pct=True is similar to the NTILE(100) window function in SQL.

Create Class Quiz Score Data¶

Below I create fictional data for 11 students in a class that took the same quiz. The max possible score is 100.

I remember in school, teachers told me I was in a certain percentile based on my GPA or score on a quiz. Therefore, I created a dataset I'll remember in relation to percentile rankings!

For ranking students on a curve, I recommend using the method='dense' option because it assigns the same rank to equal quiz scores and does not leave gaps between the ranks; so it's a more accurate representation of students' performance relative to their peers.

In [30]:

                
                    Copied!
                    
class_quiz_data = {'name': ["Jamie", "Joe", "Jake", "Jill", "James", "Dan", "Joy", "Jared", "Jill", "Jaime", "Jay"],
                   'score': [80, 85, 74, 100, 98, 91, 89, 90, 65, 84, 85]
                  }
df_quiz_scores = pd.DataFrame(class_quiz_data)
class_quiz_data = {'name': ["Jamie", "Joe", "Jake", "Jill", "James", "Dan", "Joy", "Jared", "Jill", "Jaime", "Jay"],
                   'score': [80, 85, 74, 100, 98, 91, 89, 90, 65, 84, 85]
                  }
df_quiz_scores = pd.DataFrame(class_quiz_data)

View df_quiz_scores as a dataframe.

In [31]:

                
                    Copied!
                    
df_quiz_scores
df_quiz_scores

Out[31]:

	name	score
0	Jamie	80
1	Joe	85
2	Jake	74
3	Jill	100
4	James	98
5	Dan	91
6	Joy	89
7	Jared	90
8	Jill	65
9	Jaime	84
10	Jay	85

Find the Percent Rank of Each Score in the Class¶

Below I create a new column called score_percent_rank that's a ranking of scores in the score column.

In [32]:

                
                    Copied!
                    
df_quiz_scores['score_percent_rank'] = df_quiz_scores['score'].rank(pct=True, method='dense')
df_quiz_scores['score_percent_rank'] = df_quiz_scores['score'].rank(pct=True, method='dense')

View df_quiz_scores as a dataframe below. Notice how a high quiz score by Dan of 91 has a rank value of 0.81. I think of this as the 81st percentile.

In [33]:

                
                    Copied!
                    
df_quiz_scores
df_quiz_scores

Out[33]:

	name	score	score_percent_rank
0	Jamie	80	0.3
1	Joe	85	0.5
2	Jake	74	0.2
3	Jill	100	1.0
4	James	98	0.9
5	Dan	91	0.8
6	Joy	89	0.6
7	Jared	90	0.7
8	Jill	65	0.1
9	Jaime	84	0.4
10	Jay	85	0.5

To make this rank easier to understand, I will multiply all these values by $100$ and convert the column to an integer data type so it's easier to read.

In [28]:

                
                    Copied!
                    
df_quiz_scores['score_percentile_rank'] = df_quiz_scores['score_percent_rank']*100
df_quiz_scores['score_percentile_rank'] = df_quiz_scores['score_percentile_rank'].astype('int')
df_quiz_scores['score_percentile_rank'] = df_quiz_scores['score_percent_rank']*100
df_quiz_scores['score_percentile_rank'] = df_quiz_scores['score_percentile_rank'].astype('int')

Dan's score of 91 has a score score_percentile_rank of 81.

Jay's score of 85 has a percentile rank of 50. So any score below 85 scored in the "bottom half" on the quiz.

In [29]:

                
                    Copied!
                    
df_quiz_scores
df_quiz_scores

Out[29]:

	name	score	score_percent_rank	score_percentile_rank
0	Jamie	80	0.3	30
1	Joe	85	0.5	50
2	Jake	74	0.2	20
3	Jill	100	1.0	100
4	James	98	0.9	90
5	Dan	91	0.8	80
6	Joy	89	0.6	60
7	Jared	90	0.7	70
8	Jill	65	0.1	10
9	Jaime	84	0.4	40
10	Jay	85	0.5	50

How the dense rank calculation to percentile works¶

First, sort the quiz scores in descending order: [100, 98, 91, 90, 89, 85, 85, 84, 80, 74, 65]
Assign a rank to each score without leaving gaps between the ranks for equal values, and compute the percentile rank by dividing the rank value by the total number of distinct scores:
- 100: rank 1, percentile rank: 1/10 = 0.10
- 98: rank 2, percentile rank: 2/10 = 0.20
- 91: rank 3, percentile rank: 3/10 = 0.30
- 90: rank 4, percentile rank: 4/10 = 0.40
- 89: rank 5, percentile rank: 5/10 = 0.50
- 85: rank 6, percentile rank: 6/10 = 0.60
- 85: rank 6, percentile rank: 6/10 = 0.60
- 84: rank 7, percentile rank: 7/10 = 0.70
- 80: rank 8, percentile rank: 8/10 = 0.80
- 74: rank 9, percentile rank: 9/10 = 0.90
- 65: rank 10, percentile rank: 10/10 = 1.00

In this example, the 'dense' method with pct=True assigns the same percentile rank to equal scores (e.g., Joe and Jay both have a score of 85 and receive a percentile rank of 0.60) and does not leave gaps between the ranks (e.g., Jaime's score of 84 gets a percentile rank of 0.70, right after Joe and Jay).

This approach provides students with their relative standing in terms of percentile, making it easier to understand their performance compared to their peers.

Pandas rank() Method: Equivalent to ROW_NUMBER(), RANK(), DENSE_RANK() and NTILE() SQL Window Functions¶

Setup Code¶

Import Module¶

Gradient Used to Highlight Table Outputs¶

Example 1: Count of New Sellers Per Day¶

Create Real Estate Transaction Dataset¶

Find Rank of Home Close Date by Each Seller¶

Find Count of New Sellers Per Seller Per Day¶

Example 2: Count of New Sellers By Agency Per Day¶

Example 3: Pandas Rank method='min' Comparison¶

Example 4: Pandas Rank method='dense' Comparison¶

Example 5: Pandas Rank pct=True¶

Create Class Quiz Score Data¶

Find the Percent Rank of Each Score in the Class¶

How the dense rank calculation to percentile works¶

Example 3: Pandas Rank `method='min'` Comparison¶

Example 4: Pandas Rank `method='dense'` Comparison¶

Example 5: Pandas Rank `pct=True`¶