Pandas rank() Method: Equivalent to ROW_NUMBER(), RANK(), DENSE_RANK() and NTILE() SQL Window Functions¶
Date published: 2019-04-15
Category: Data Analysis
Subcategory: Data Wrangling
Tags: rank, pandas, python
In SQL, popular window functions include: ROW_NUMBER()
, RANK()
, DENSE_RANK()
and NTILE()
. These are helpful for creating a new column that's a rank of some other values in a column, perhaps partitioned by one or multiple groups. You can learn about these SQL window functions via Mode's SQL tutorial.
Similarly, using pandas in Python, the rank() method for a series provides similar utility to the SQL window functions listed above.
In this tutorial, I'll cover the rank()
method in pandas with an example of real estate transactions data and later quiz scores.
Setup Code¶
Import Module¶
import pandas as pd
import seaborn as sns
Gradient Used to Highlight Table Outputs¶
cm = sns.light_palette("lightgreen", as_cmap=True)
Example 1: Count of New Sellers Per Day¶
Create Real Estate Transaction Dataset¶
I'll create a small dataset of 8 real estate transactions that include a close date and seller name for each house sale.
data = {'close_date': ["2012-08-01", "2012-08-01", "2012-08-01", "2012-08-02", "2012-08-03", "2012-08-04", "2012-08-05", "2012-08-07"],
'seller_name': ["Lara", "Julia", "Julia", "Emily", "Julia", "Lara", "Julia", "Julia"]
}
df = pd.DataFrame(data)
Convert close_date
column to datetime type in Python.
df['close_date'] = pd.to_datetime(df['close_date'])
View df
.
df
close_date | seller_name | |
---|---|---|
0 | 2012-08-01 | Lara |
1 | 2012-08-01 | Julia |
2 | 2012-08-01 | Julia |
3 | 2012-08-02 | Emily |
4 | 2012-08-03 | Julia |
5 | 2012-08-04 | Lara |
6 | 2012-08-05 | Julia |
7 | 2012-08-07 | Julia |
Find Rank of Home Close Date by Each Seller¶
The rank action can be described as ranking the close dates within each seller group, based on the chronological order of the home sales.
I group by the seller_name
column, and apply the rank()
method to the close_date
colummn. Set argument method
to first
meaning which will rank house sales by close_date
by ascending order. This is equivalent to the ROW_NUMBER()
window function in SQL.
df['seller__sale_date_rank'] = df.groupby('seller_name')['close_date'].rank(method='first')
Convert that new column from float to integer so it's easier to read.
df['seller__sale_date_rank'] = pd.to_numeric(df['seller__sale_date_rank'], downcast='integer')
To understand rank better, let's just examine Julia's sales. Now we have a rank column to represent the order for each of Julia's close dates. There were two closes on 2012-08-01
, a "tie", and ranks are 1
and 2
.
df[df['seller_name']=='Julia'].sort_values('seller__sale_date_rank')
close_date | seller_name | seller__sale_date_rank | |
---|---|---|---|
1 | 2012-08-01 | Julia | 1 |
2 | 2012-08-01 | Julia | 2 |
4 | 2012-08-03 | Julia | 3 |
6 | 2012-08-05 | Julia | 4 |
7 | 2012-08-07 | Julia | 5 |
Find Count of New Sellers Per Seller Per Day¶
I can utilize the rankings above to find the count of new sellers by day.
For example, Julia is a new home seller on August 1st because she has a rank of 1
that day. Julia would not be counted as a new home seller on August 3rd because she has a rank of 3
that day.
I can filter df
by record in which seller__sale_date_rank
is equal to 1
. The three records for Lara, Julia and Emily show the close_date
for each in which they sold their first home.
df[df['seller__sale_date_rank']==1]
close_date | seller_name | seller__sale_date_rank | |
---|---|---|---|
0 | 2012-08-01 | Lara | 1 |
1 | 2012-08-01 | Julia | 1 |
3 | 2012-08-02 | Emily | 1 |
I can also calculate the count of seller first-time sales by day too. Let's group the dataframe output above by close_date
and calculate the size of each group.
Similar to above, on 2012-08-01
, we can see 2
new sellers representing Lara and Julia. On 2012-08-02
, we can see 1
new seller representing Emily.
df[df['seller__sale_date_rank']==1].groupby('close_date').size().reset_index().rename(columns={0: 'count_new_sellers_of_homes'})
close_date | count_new_sellers_of_homes | |
---|---|---|
0 | 2012-08-01 | 2 |
1 | 2012-08-02 | 1 |
Example 2: Count of New Sellers By Agency Per Day¶
In example 1, I wanted the count of new sellers by day. In this example 2, I want the count of sellers by agency per day.
Below, I add a new column called agency
.
df['agency'] = ["Fifer", "Fifer", "Fifer", "HomeSales", "HomeSales", "Fifer", "Fifer", "Fifer"]
View df
.
df
close_date | seller_name | seller__sale_date_rank | agency | |
---|---|---|---|---|
0 | 2012-08-01 | Lara | 1 | Fifer |
1 | 2012-08-01 | Julia | 1 | Fifer |
2 | 2012-08-01 | Julia | 2 | Fifer |
3 | 2012-08-02 | Emily | 1 | HomeSales |
4 | 2012-08-03 | Julia | 3 | HomeSales |
5 | 2012-08-04 | Lara | 2 | Fifer |
6 | 2012-08-05 | Julia | 4 | Fifer |
7 | 2012-08-07 | Julia | 5 | Fifer |
I want to create the rank value of home sales by agency per seller ordered by close date. This is valuable so each agency can understand the count of new sellers per day. If you analyzed this data, you could answer the question: "When was Julia's 3rd home sale with the agency Fifer?"
Below, I group by two fields, agency
and then seller_name
and find a rank value ordered by close_date
. I store these rank values in a new column called agency_seller__sale_date_rank
.
df['agency_seller__sale_date_rank'] = df.groupby(['agency', 'seller_name'])['close_date'].rank(method='first')
Convert the new rank column from float to numeric so it's easier to read.
df['agency_seller__sale_date_rank'] = pd.to_numeric(df['agency_seller__sale_date_rank'], downcast='integer')
View df
.
View Julia's home sales with Fifer.
Our earlier question: "When was Julia's 3rd home sale with the agency Fifer?" It was 2012-08-05
.
df[(df['seller_name']=='Julia') & (df['agency']=='Fifer')]
close_date | seller_name | seller__sale_date_rank | agency | agency_seller__sale_date_rank | |
---|---|---|---|---|---|
1 | 2012-08-01 | Julia | 1 | Fifer | 1 |
2 | 2012-08-01 | Julia | 2 | Fifer | 2 |
6 | 2012-08-05 | Julia | 4 | Fifer | 3 |
7 | 2012-08-07 | Julia | 5 | Fifer | 4 |
Example 3: Pandas Rank method='min'
Comparison¶
The pandas rank()
method has an argument method
that can be set to other values than first
. One example is to min
. This is equivalent to the RANK()
window function in SQL.
I'll re-create the original column of seller's rank of sales by close date.
df2 = pd.DataFrame(data)
df2['close_date'] = pd.to_datetime(df2['close_date'])
df2['seller__sale_date_rank_min'] = df2.groupby('seller_name')['close_date'].rank(method='min')
df2['seller__sale_date_rank_min'] = pd.to_numeric(df2['seller__sale_date_rank_min'], downcast='integer')
View df2
.
df2
close_date | seller_name | seller__sale_date_rank_min | |
---|---|---|---|
0 | 2012-08-01 | Lara | 1 |
1 | 2012-08-01 | Julia | 1 |
2 | 2012-08-01 | Julia | 1 |
3 | 2012-08-02 | Emily | 1 |
4 | 2012-08-03 | Julia | 3 |
5 | 2012-08-04 | Lara | 2 |
6 | 2012-08-05 | Julia | 4 |
7 | 2012-08-07 | Julia | 5 |
Notice how with method='min'
, in the column min_rank_agency_seller_by_close_date
, Julia's two home sales on August 1, 2012 are both given a tied rank of 1.
Then, the min rank value skips a value of 2
and Julia's next sale on August 5, 2012 has a value of 3
.
df2[df2['seller_name']=='Julia']
close_date | seller_name | seller__sale_date_rank_min | |
---|---|---|---|
1 | 2012-08-01 | Julia | 1 |
2 | 2012-08-01 | Julia | 1 |
4 | 2012-08-03 | Julia | 3 |
6 | 2012-08-05 | Julia | 4 |
7 | 2012-08-07 | Julia | 5 |
The use case for method='first'
versus method='min'
varies on a case by case basis. It's important to understand your data well to make sure you utilize the correct one.
Example 4: Pandas Rank method='dense'
Comparison¶
The pandas rank()
method has an argument method
that can be set to 'dense'
. This is different than method
set to either 'first'
or 'min'
.
df3 = pd.DataFrame(data)
df3['close_date'] = pd.to_datetime(df3['close_date'])
df3['seller__sale_date_rank_dense'] = df3.groupby('seller_name')['close_date'].rank(method='dense')
df3['seller__sale_date_rank_dense'] = pd.to_numeric(df3['seller__sale_date_rank_dense'], downcast='integer')
View df3
.
df3
close_date | seller_name | seller__sale_date_rank_dense | |
---|---|---|---|
0 | 2012-08-01 | Lara | 1 |
1 | 2012-08-01 | Julia | 1 |
2 | 2012-08-01 | Julia | 1 |
3 | 2012-08-02 | Emily | 1 |
4 | 2012-08-03 | Julia | 2 |
5 | 2012-08-04 | Lara | 2 |
6 | 2012-08-05 | Julia | 3 |
7 | 2012-08-07 | Julia | 4 |
Notice how with method='dense'
, in the column dense_rank_agency_seller_by_close_date
, Julia's two home sales on August 1, 2012 are both given a tied rank of 1
.
Then, the dense rank value does not skip a value of 2
and Julia's next sale on August 5, 2012 has a value of 2
. dense
is different than min
!
df3[df3['seller_name']=='Julia']
close_date | seller_name | seller__sale_date_rank_dense | |
---|---|---|---|
1 | 2012-08-01 | Julia | 1 |
2 | 2012-08-01 | Julia | 1 |
4 | 2012-08-03 | Julia | 2 |
6 | 2012-08-05 | Julia | 3 |
7 | 2012-08-07 | Julia | 4 |
The use case for 'dense'
varies on a case by case basis. A second reminder - please understand your data well and why you'd choose 'dense'
over other options.
Example 5: Pandas Rank pct=True
¶
The pandas series rank()
method has another argument called pct
that can be set to True
or False
. By default, it's False
. This pct
argument computes the percentage rank of data.
To set the argument pct=True
is similar to the NTILE(100)
window function in SQL.
Create Class Quiz Score Data¶
Below I create fictional data for 11 students in a class that took the same quiz. The max possible score is 100
.
I remember in school, teachers told me I was in a certain percentile based on my GPA or score on a quiz. Therefore, I created a dataset I'll remember in relation to percentile rankings!
For ranking students on a curve, I recommend using the method='dense'
option because it assigns the same rank to equal quiz scores and does not leave gaps between the ranks; so it's a more accurate representation of students' performance relative to their peers.
class_quiz_data = {'name': ["Jamie", "Joe", "Jake", "Jill", "James", "Dan", "Joy", "Jared", "Jill", "Jaime", "Jay"],
'score': [80, 85, 74, 100, 98, 91, 89, 90, 65, 84, 85]
}
df_quiz_scores = pd.DataFrame(class_quiz_data)
View df_quiz_scores
as a dataframe.
df_quiz_scores
name | score | |
---|---|---|
0 | Jamie | 80 |
1 | Joe | 85 |
2 | Jake | 74 |
3 | Jill | 100 |
4 | James | 98 |
5 | Dan | 91 |
6 | Joy | 89 |
7 | Jared | 90 |
8 | Jill | 65 |
9 | Jaime | 84 |
10 | Jay | 85 |
Find the Percent Rank of Each Score in the Class¶
Below I create a new column called score_percent_rank
that's a ranking of scores in the score
column.
df_quiz_scores['score_percent_rank'] = df_quiz_scores['score'].rank(pct=True, method='dense')
View df_quiz_scores
as a dataframe below. Notice how a high quiz score by Dan of 91
has a rank value of 0.81
. I think of this as the 81st percentile.
df_quiz_scores
name | score | score_percent_rank | |
---|---|---|---|
0 | Jamie | 80 | 0.3 |
1 | Joe | 85 | 0.5 |
2 | Jake | 74 | 0.2 |
3 | Jill | 100 | 1.0 |
4 | James | 98 | 0.9 |
5 | Dan | 91 | 0.8 |
6 | Joy | 89 | 0.6 |
7 | Jared | 90 | 0.7 |
8 | Jill | 65 | 0.1 |
9 | Jaime | 84 | 0.4 |
10 | Jay | 85 | 0.5 |
To make this rank easier to understand, I will multiply all these values by $100$ and convert the column to an integer data type so it's easier to read.
df_quiz_scores['score_percentile_rank'] = df_quiz_scores['score_percent_rank']*100
df_quiz_scores['score_percentile_rank'] = df_quiz_scores['score_percentile_rank'].astype('int')
Dan's score
of 91
has a score score_percentile_rank
of 81
.
Jay's score of 85
has a percentile rank of 50
. So any score below 85
scored in the "bottom half" on the quiz.
df_quiz_scores
name | score | score_percent_rank | score_percentile_rank | |
---|---|---|---|---|
0 | Jamie | 80 | 0.3 | 30 |
1 | Joe | 85 | 0.5 | 50 |
2 | Jake | 74 | 0.2 | 20 |
3 | Jill | 100 | 1.0 | 100 |
4 | James | 98 | 0.9 | 90 |
5 | Dan | 91 | 0.8 | 80 |
6 | Joy | 89 | 0.6 | 60 |
7 | Jared | 90 | 0.7 | 70 |
8 | Jill | 65 | 0.1 | 10 |
9 | Jaime | 84 | 0.4 | 40 |
10 | Jay | 85 | 0.5 | 50 |
How the dense rank calculation to percentile works¶
First, sort the quiz scores in descending order:
[100, 98, 91, 90, 89, 85, 85, 84, 80, 74, 65]
Assign a rank to each score without leaving gaps between the ranks for equal values, and compute the percentile rank by dividing the rank value by the total number of distinct scores:
- 100: rank 1, percentile rank: 1/10 = 0.10
- 98: rank 2, percentile rank: 2/10 = 0.20
- 91: rank 3, percentile rank: 3/10 = 0.30
- 90: rank 4, percentile rank: 4/10 = 0.40
- 89: rank 5, percentile rank: 5/10 = 0.50
- 85: rank 6, percentile rank: 6/10 = 0.60
- 85: rank 6, percentile rank: 6/10 = 0.60
- 84: rank 7, percentile rank: 7/10 = 0.70
- 80: rank 8, percentile rank: 8/10 = 0.80
- 74: rank 9, percentile rank: 9/10 = 0.90
- 65: rank 10, percentile rank: 10/10 = 1.00
In this example, the 'dense' method with pct=True
assigns the same percentile rank to equal scores (e.g., Joe and Jay both have a score of 85 and receive a percentile rank of 0.60) and does not leave gaps between the ranks (e.g., Jaime's score of 84 gets a percentile rank of 0.70, right after Joe and Jay).
This approach provides students with their relative standing in terms of percentile, making it easier to understand their performance compared to their peers.