Data Analysis Data Wrangling Tutorial

Pandas rank() Method: Equivalent to ROW_NUMBER(), RANK(), DENSE_RANK() and NTILE() SQL Window Functions

In SQL, popular window functions include: ROW_NUMBER(), RANK(), DENSE_RANK() and NTILE(). These are helpful for creating a new column that's a rank of some other values in a column, perhaps partitioned by one or multiple groups. You can learn about these SQL window functions via Mode's SQL tutorial.

Similarly, using pandas in Python, the rank() method for a series provides similar utility to the SQL window functions listed above.

In this tutorial, I'll cover the rank() method in pandas with an example of real estate transactions data and later quiz scores.

Setup Code

Import Module

import pandas as pd
import seaborn as sns

Gradient Used to Highlight Table Outputs

cm = sns.light_palette("lightgreen", as_cmap=True)

Example 1: Count of New Sellers Per Day

Create Real Estate Transaction Dataset

I'll create a small dataset of 8 real estate transactions that include a close date and seller name for each house sale.

Take note that Julia sold 5 homes with some transactions on separate days. Lara also sold 2 homes on different days.

data = {'close_date': ["2012-08-01", "2012-08-01", "2012-08-01", "2012-08-02", "2012-08-03", "2012-08-04", "2012-08-05", "2012-08-07"],
        'seller_name': ["Lara", "Julia", "Julia", "Emily", "Julia", "Lara", "Julia", "Julia"]
       }
df = pd.DataFrame(data)

View entire df.

df
close_date seller_name
0 2012-08-01 Lara
1 2012-08-01 Julia
2 2012-08-01 Julia
3 2012-08-02 Emily
4 2012-08-03 Julia
5 2012-08-04 Lara
6 2012-08-05 Julia
7 2012-08-07 Julia

Convert close_date column to datetime type in Python.

df['close_date'] = pd.to_datetime(df['close_date'])

Find Rank of Homes Sold for Each Seller by Close Date

I want to know the rank of each house sale by person ordered by time.

Based on looking at the table above, Julia sold her first, second and third house on August 1st. On August 3rd, Julia sold her fourth house.

For the df dataframe, I want to create a new column that shows the numerical rank values for each person's home sales by day. To do this, I group by the seller_name column, and apply the rank() method to the close_date colummn. I set the rank() argument methond='first' to rank the sales of houses per person, ordered by date, in the order they appear. The new column with rank values is called rank_seller_by_close_date.

The method='first' for the rank() method for pandas series is equivalent to the ROW_NUMBER() window function in SQL.

df['rank_seller_by_close_date'] = df.groupby('seller_name')['close_date'].rank(method='first')

Below I output df and highlight the rank_seller_by_close_date with shades of green designated by numerical value. The higher the rank value, the darker the green highlighted background. df is ordered by close_date.

df.style.background_gradient(cmap=cm, subset=pd.IndexSlice[:, ['rank_seller_by_close_date']])

close_date seller_name rank_seller_by_close_date
0 2012-08-01 00:00:00 Lara 1
1 2012-08-01 00:00:00 Julia 1
2 2012-08-01 00:00:00 Julia 2
3 2012-08-02 00:00:00 Emily 1
4 2012-08-03 00:00:00 Julia 3
5 2012-08-04 00:00:00 Lara 2
6 2012-08-05 00:00:00 Julia 4
7 2012-08-07 00:00:00 Julia 5

It's difficult to easily see the order of ranks for each seller by date. Below, I output df sorted by seller_name and then rank_seller_by_close_date.

Much better! It's clear Julia's home sales are ranked by from oldest to newest. Even though she made two homes sales on August 1st, one is given a rank value of 1 and then next 2.

df.sort_values(by=['seller_name', 'rank_seller_by_close_date']).style.background_gradient(cmap=cm, subset=pd.IndexSlice[:, ['rank_seller_by_close_date']])

close_date seller_name rank_seller_by_close_date
3 2012-08-02 00:00:00 Emily 1
1 2012-08-01 00:00:00 Julia 1
2 2012-08-01 00:00:00 Julia 2
4 2012-08-03 00:00:00 Julia 3
6 2012-08-05 00:00:00 Julia 4
7 2012-08-07 00:00:00 Julia 5
0 2012-08-01 00:00:00 Lara 1
5 2012-08-04 00:00:00 Lara 2

Find Count of New Sellers Per Seller Per Day

I can utilize the rankings above to find the count of new sellers by day.

For example, Julia is a new home seller on August 1st because she has a rank of 1 that day. Julia would not be counted as a new home seller on August 3rd because she has a rank of 3 that day.

I can filter df by record in which rank_seller_by_close_date is equal to 1. The three records for Lara, Julia and Emily show the close_date for each in which they sold their first home.

df[df['rank_seller_by_close_date']==1]
close_date seller_name rank_seller_by_close_date
0 2012-08-01 Lara 1.0
1 2012-08-01 Julia 1.0
3 2012-08-02 Emily 1.0

Let's take the table above and group by close_date to find the count of close_date observations. Below is the output.

Similar to above, on 2012-08-01, we can see 2 new sellers representing Lara and Julia. On 2012-08-02, we can see 1 new seller representing Emily.

df[df['rank_seller_by_close_date']==1].groupby('close_date').size().reset_index().rename(columns={0: 'count_new_sellers_of_homes'})
close_date count_new_sellers_of_homes
0 2012-08-01 2
1 2012-08-02 1

Example 2: Count of New Sellers By Agency Per Day

In example 1, I wanted the count of new sellers by day. In this example 2, I want the count of sellers by agency per day.

Below, I add a new column called agency that goes for each home sale.

df['agency'] = ["Fifer", "Fifer", "Fifer", "HomeSales", "HomeSales", "Fifer", "Fifer", "Fifer"]

Below I print the output of df.

df
close_date seller_name rank_seller_by_close_date agency
0 2012-08-01 Lara 1.0 Fifer
1 2012-08-01 Julia 1.0 Fifer
2 2012-08-01 Julia 2.0 Fifer
3 2012-08-02 Emily 1.0 HomeSales
4 2012-08-03 Julia 3.0 HomeSales
5 2012-08-04 Lara 2.0 Fifer
6 2012-08-05 Julia 4.0 Fifer
7 2012-08-07 Julia 5.0 Fifer

I want to create the rank value of home sales by agency per seller ordered by close date. This is valuable so each agency can understand the count of new sellers per day.

Below, I group by two fields, agency and then seller_name and find a rank value ordered by close_date. I store these rank values in a new column called first_rank_agency_seller_by_close_date.

df['first_rank_agency_seller_by_close_date'] = df.groupby(['agency', 'seller_name'])['close_date'].rank(method='first')

Below I output df and highlight the important column first_rank_agency_seller_by_close_date. The order of ranks below is a bit confusing. I'll re-order records after below's output.

df.style.background_gradient(cmap=cm, subset=pd.IndexSlice[:, ['first_rank_agency_seller_by_close_date']])

close_date seller_name rank_seller_by_close_date agency first_rank_agency_seller_by_close_date
0 2012-08-01 00:00:00 Lara 1 Fifer 1
1 2012-08-01 00:00:00 Julia 1 Fifer 1
2 2012-08-01 00:00:00 Julia 2 Fifer 2
3 2012-08-02 00:00:00 Emily 1 HomeSales 1
4 2012-08-03 00:00:00 Julia 3 HomeSales 1
5 2012-08-04 00:00:00 Lara 2 Fifer 2
6 2012-08-05 00:00:00 Julia 4 Fifer 3
7 2012-08-07 00:00:00 Julia 5 Fifer 4

The output below is easier to understand because I sort df by agency and then seller_name.

Below, notice how Fifer first helped Julia sell a home on 2012-08-01 and first helped Lara sell a home on 2012-08-01. Both of these rows have a first_rank_agency_seller_by_close_date value of 1.

Then, when Julia sold a home with HomeSales on 2012-08-03, that also has a first_rank_agency_seller_by_close_date value of 1 because that was her first home sold by date with that agency, HomeSales.

df.sort_values(by=['agency', 'seller_name']).style.background_gradient(cmap=cm, subset=pd.IndexSlice[:, ['first_rank_agency_seller_by_close_date']])

close_date seller_name rank_seller_by_close_date agency first_rank_agency_seller_by_close_date
1 2012-08-01 00:00:00 Julia 1 Fifer 1
2 2012-08-01 00:00:00 Julia 2 Fifer 2
6 2012-08-05 00:00:00 Julia 4 Fifer 3
7 2012-08-07 00:00:00 Julia 5 Fifer 4
0 2012-08-01 00:00:00 Lara 1 Fifer 1
5 2012-08-04 00:00:00 Lara 2 Fifer 2
3 2012-08-02 00:00:00 Emily 1 HomeSales 1
4 2012-08-03 00:00:00 Julia 3 HomeSales 1

I still want the count of new sellers with each agency per day. Below is a filter of df for records in which first_rank_agency_seller_by_close_date equals 1.

df[df['first_rank_agency_seller_by_close_date']==1]
close_date seller_name rank_seller_by_close_date agency first_rank_agency_seller_by_close_date
0 2012-08-01 Lara 1.0 Fifer 1.0
1 2012-08-01 Julia 1.0 Fifer 1.0
3 2012-08-02 Emily 1.0 HomeSales 1.0
4 2012-08-03 Julia 3.0 HomeSales 1.0

To get the final final output a count of new sellers with each agency per day, I take the filtered dataframe from above, group by agency and then close_date, and then compute the group size.

Above I see two records of homes sold on 2012-08-01 by Lara and Julia with the agency Fifer. In the output from the group by operation below, I see an equivalent first record for Fifer on 2012-08-01 with count_new_sellers_of_homes equal to 2.

df[df['first_rank_agency_seller_by_close_date']==1].groupby(['agency', 'close_date']).size().reset_index().rename(columns={0: 'count_new_sellers_of_homes'})
agency close_date count_new_sellers_of_homes
0 Fifer 2012-08-01 2
1 HomeSales 2012-08-02 1
2 HomeSales 2012-08-03 1

Example 3: Pandas Rank method='min' Comparison

The pandas rank() method has an argument method that can be set to min. This is different than the method used above with method set to first.

The method='min' argument for the rank() method for pandas series is equivalent to the RANK() window function in SQL.

Notice how with method='min', in the column min_rank_agency_seller_by_close_date, Julia's two home sales on August 1, 2012 are both given a tied rank of 1.

Then, the min rank value skips a value of 2 and Julia's next sale on August 5, 2012 has a value of 3.

df['min_rank_agency_seller_by_close_date'] = df.groupby(['agency', 'seller_name'])['close_date'].rank(method='min')
df.sort_values(by=['agency', 'seller_name']).style.background_gradient(cmap=cm, subset=pd.IndexSlice[:, ['min_rank_agency_seller_by_close_date']])

close_date seller_name rank_seller_by_close_date agency first_rank_agency_seller_by_close_date min_rank_agency_seller_by_close_date
1 2012-08-01 00:00:00 Julia 1 Fifer 1 1
2 2012-08-01 00:00:00 Julia 2 Fifer 2 1
6 2012-08-05 00:00:00 Julia 4 Fifer 3 3
7 2012-08-07 00:00:00 Julia 5 Fifer 4 4
0 2012-08-01 00:00:00 Lara 1 Fifer 1 1
5 2012-08-04 00:00:00 Lara 2 Fifer 2 2
3 2012-08-02 00:00:00 Emily 1 HomeSales 1 1
4 2012-08-03 00:00:00 Julia 3 HomeSales 1 1

The use case for method='first' versus method='min' varies on a case by case basis. It's important to understand your data well to make sure you utilize the correct one, though you can of course easily swap the argument's value.

Example 4: Pandas Rank method='dense' Comparison

The pandas rank() method has an argument method that can be set to 'dense'. This is different than method set to either 'first' or 'min'.

Notice how with method='dense', in the column dense_rank_agency_seller_by_close_date, Julia's two home sales on August 1, 2012 are both given a tied rank of 1.

Then, the dense rank value does not skip a value of 2 and Julia's next sale on August 5, 2012 has a value of 2. 'dense' is different than min!

df['dense_rank_agency_seller_by_close_date'] = df.groupby(['agency', 'seller_name'])['close_date'].rank(method='dense')
df.sort_values(by=['agency', 'seller_name']).style.background_gradient(cmap=cm, subset=pd.IndexSlice[:, ['dense_rank_agency_seller_by_close_date']])

close_date seller_name rank_seller_by_close_date agency first_rank_agency_seller_by_close_date min_rank_agency_seller_by_close_date dense_rank_agency_seller_by_close_date
1 2012-08-01 00:00:00 Julia 1 Fifer 1 1 1
2 2012-08-01 00:00:00 Julia 2 Fifer 2 1 1
6 2012-08-05 00:00:00 Julia 4 Fifer 3 3 2
7 2012-08-07 00:00:00 Julia 5 Fifer 4 4 3
0 2012-08-01 00:00:00 Lara 1 Fifer 1 1 1
5 2012-08-04 00:00:00 Lara 2 Fifer 2 2 2
3 2012-08-02 00:00:00 Emily 1 HomeSales 1 1 1
4 2012-08-03 00:00:00 Julia 3 HomeSales 1 1 1

The use case for 'dense' varies on a case by case basis. A second reminder - please understand your data well and why you'd choose 'dense' over other options.

Example 5: Pandas Rank pct=True

The pandas series rank() method has another argument called pct that can be set to True or False. By default, it's False. This pct argument computes the percentage rank of data.

To set the argument pct=True is similar to the NTILE(100) window function in SQL.

Create Class Quiz Score Data

Below I create fictional data for 11 students in a class that took the same quiz. The max possible score is 100.

I remember in school, teachers told me I was in a certain percentile based on my GPA or score on a quiz. Therefore, I created a dataset I'll remember in relation to percentile rankings!

class_quiz_data = {'name': ["Jamie", "Joe", "Jake", "Jill", "James", "Dan", "Joy", "Jared", "Jill", "Jaime", "Jay"],
                   'score': [80, 85, 74, 100, 98, 91, 89, 90, 65, 84, 85]
                  }
df_quiz_scores = pd.DataFrame(class_quiz_data)

View df_quiz_scores as a dataframe.

df_quiz_scores
name score
0 Jamie 80
1 Joe 85
2 Jake 74
3 Jill 100
4 James 98
5 Dan 91
6 Joy 89
7 Jared 90
8 Jill 65
9 Jaime 84
10 Jay 85

Find the Percent Rank of Each Score in the Class

Below I create a new column called score_percent_rank that's a ranking of scores in the score column.

df_quiz_scores['score_percent_rank'] = df_quiz_scores['score'].rank(pct=True)

View df_quiz_scores as a dataframe below. Notice how a high quiz score by Dan of 91 has a rank value of 0.81. I think of this as the 81st percentile.

df_quiz_scores
name score score_percent_rank
0 Jamie 80 0.272727
1 Joe 85 0.500000
2 Jake 74 0.181818
3 Jill 100 1.000000
4 James 98 0.909091
5 Dan 91 0.818182
6 Joy 89 0.636364
7 Jared 90 0.727273
8 Jill 65 0.090909
9 Jaime 84 0.363636
10 Jay 85 0.500000

To make this rank easier to understand, I will multiply all these values by \(100\) and round to the nearest integer value.

df_quiz_scores['score_percentile_rank'] = (df_quiz_scores['score_percent_rank']*100).round()

View df_quiz_scores as a dataframe below. Now, Dan's score of 91 has a score score_percentile_rank of 82.

df_quiz_scores
name score score_percent_rank score_percentile_rank
0 Jamie 80 0.272727 27.0
1 Joe 85 0.500000 50.0
2 Jake 74 0.181818 18.0
3 Jill 100 1.000000 100.0
4 James 98 0.909091 91.0
5 Dan 91 0.818182 82.0
6 Joy 89 0.636364 64.0
7 Jared 90 0.727273 73.0
8 Jill 65 0.090909 9.0
9 Jaime 84 0.363636 36.0
10 Jay 85 0.500000 50.0