Data Analysis Business Metrics Article

Calculations for Popular Business Metrics

On the job, I'm often asked to setup a measurement plan for a new app feature or quantify something that's happend in the past. There's several common calculations for metrics including:

  • Sum: result of adding numbers
  • Count: total number of occurences of something
  • Average: a number to regard the central tendancy of a set of values calculated as the sum of a condition divided by the count of occurences of that condition
  • Percentile: a value that represents a threshold for the percentage of data points less than the value
  • Ratio: comparison of two numbers into a value
  • Probability: how likely something is to happen

In this post, I'll walk through a scenario in which I am asked to calculate the success of a website homepage that's for a note-taking app. Success is a vague word and there's lots of ways we can characterize it. With each metric using a calculation from the list above, I'll align that metric to be relevant to overall business goal: get as many weekly active users as possible.

The homepage is just meant to educate people about the note-taking app and encourage them to sign up. There's a lot of content on the homepage about the features of the app and there's a single signup button to create an account and use the note-taking app.

A caveat: this is a very simple business example and this post is meant to introduce popular calculations for metrics. For a real-world project, I'd encourage you to understand your business problems, business goals and the pros and cons of each potential metric you choose.

Setup Code and Data

Import Modules

from datetime import datetime
from random import randint
from random import choice
from random import seed
import numpy as np
import pandas as pd
from datetime import timedelta

Generate Data

I'll generate relevant session data we can use to compute all metrics above.

Below I create 17 sessions in which each session is simplified to a visitor landing on the homepage only and then either clicking the signup button or leaving the site.

For each session, there's a:

  • cookie id to represent a unique value for a visitor based on the visitor's browser
  • timestamp for when the visitor lands on the homepage
  • a value to represent the timestamp for when a user clicked the signup button or an NaT value to represent not a timestamp meaning the visitor never clicked the signup button
seed(6)
session_start_time = datetime.today().replace(microsecond=0,second=0,minute=0)
signup_button_time = session_start_time + timedelta(seconds=random_seconds)
session_start_times = []
click_signup_button_times = []
count_of_sessions = 50

cookie_ids_list = list(range(1, 15)) + list(range(1, 4))

for session in cookie_ids_list:

    random_seconds_between_session_start_times = randint(a=1, b=20)
    session_start_time = session_start_time + timedelta(seconds=random_seconds_between_session_start_times)

    random_seconds_to_signup = randint(a=4, b=100)
    click_signup_button_time = session_start_time + timedelta(seconds=random_seconds_to_signup)

    # randomly choose either the signup button time or np.NaN to be no button click
    click_signup_button_time = random.choice([click_signup_button_time, np.NaN])

    session_start_times.append(session_start_time)
    click_signup_button_times.append(click_signup_button_time)

Create a Pandas dataframe from session_start_times and signup_button_times.

data = {'cookie_id': cookie_ids_list,
        'session_start_time': session_start_times,
        'click_signup_button_time':  click_signup_button_times
       }
df_sessions = pd.DataFrame(data)

View entire dataset.

df_sessions
cookie_id session_start_time click_signup_button_time
0 1 2019-08-25 15:00:19 NaT
1 2 2019-08-25 15:00:28 2019-08-25 15:00:36
2 3 2019-08-25 15:00:33 NaT
3 4 2019-08-25 15:00:45 2019-08-25 15:01:29
4 5 2019-08-25 15:00:54 2019-08-25 15:02:00
5 6 2019-08-25 15:01:08 2019-08-25 15:02:20
6 7 2019-08-25 15:01:15 NaT
7 8 2019-08-25 15:01:35 2019-08-25 15:03:06
8 9 2019-08-25 15:01:49 2019-08-25 15:02:35
9 10 2019-08-25 15:02:01 NaT
10 11 2019-08-25 15:02:16 2019-08-25 15:03:49
11 12 2019-08-25 15:02:23 NaT
12 13 2019-08-25 15:02:27 2019-08-25 15:02:36
13 14 2019-08-25 15:02:39 2019-08-25 15:03:45
14 1 2019-08-25 15:02:56 2019-08-25 15:04:13
15 2 2019-08-25 15:03:08 NaT
16 3 2019-08-25 15:03:18 2019-08-25 15:04:07

Sum Metric

The sum is the result of adding things. Given the dataset, I don't think there's a single sum metric that's relevant to the business' goals. However, the sum is a calculation necessary to know the average time a visitor spends on the page before clicking the signup button. We'll revisit this calculation later in the ratio section.

Count Metric

A count is the total number of occurences of something. A relevant count metric for our business goals is the count of sessions. My hypothesis is that more sessions on the homepage will result in more signups which will likely result in more weekly active users. Therefore, the business wants more sessions per day.

I can use the Python len() function to count the number of session in the dataset.

count_sessions = len(df_sessions)
count_sessions
17

There were 18 sessions.

Another relevant count metric is the count of unique visitors to the site in a day. I'll assume each cookie used on a visitors' browser was used properly and corresponds to each user.

I can count the unique values in the cookie_id field by using the Pandas series unique() method.

df_sessions['cookie_id'].nunique()
14

There were 14 distinct visitors to the homepage.

Average Metric

An average is the "central" value from a set of numbers. A relevant average metric that aligns with the business goals is the average time it takes somebody sign up from landing on the homepage first. This average metric gives us a baseline for how much consideration it takes someone to sign up.

I need to create a new column that's the time duration for each user to sign up after landing on the homepage. I will subtract click_signup_button_time - session_start_time.

df_sessions['time_duration_until_signup'] = df_sessions['click_signup_button_time'] - df_sessions['session_start_time']

Preview the first few sessions.

df_sessions.head()
cookie_id session_start_time click_signup_button_time time_duration_until_signup
0 1 2019-08-25 15:00:19 NaT NaT
1 2 2019-08-25 15:00:28 2019-08-25 15:00:36 00:00:08
2 3 2019-08-25 15:00:33 NaT NaT
3 4 2019-08-25 15:00:45 2019-08-25 15:01:29 00:00:44
4 5 2019-08-25 15:00:54 2019-08-25 15:02:00 00:01:06

I can use the Pandas mean() method to calculate the average of the time_duration_until_signup values.

df_sessions['time_duration_until_signup'].mean()
Timedelta('0 days 00:00:56.454545')

For users that do sign up, the average time it takes someone to click the signup button after landing on the homepage is approximately 56 seconds.

Percentile

A percentile represents a threshold in which there's a percentage of data points less than that value. A frequently used percentile value is the 50th percentile which represents the median. This means there are approximately 50% of data points smaller than the median value.

Median can be especially helpful to understand the central tendancy of a set of values when there's a skewness in the values. You can learn more on my skewness article.

I can calculate the median time_duration_until_signup value using the median method.

df_sessions['time_duration_until_signup'].median()
Timedelta('0 days 00:01:06')

The median is approximately 1 minute and 6 seconds. In approximately half of sessions with signups, visitors spent less than 1 minute and 6 seconds.

Ratio

A ratio comparison of two numbers into a value. One ratio that's critical to the business' goal is to know the proportion of homepage visits that end up with a click on a signup button. This rate is commony called the clickthrough rate. The higher this rate, the more signups and likely more weekly active users down the road.

Let's first count the number of sessions that have a click of the signup button. I'll count the number of non-null values in the click_signup_button_time field using the Pandas series notnull()` method

signup_button_clicks = len(df_sessions[df_sessions['click_signup_button_time'].notnull()])
signup_button_clicks
11
ratio_clicks_to_session = round(signup_button_clicks/count_sessions, 2)
ratio_clicks_to_session
0.65

0.65 sessions resulted in a signup button click. The higher this ratio, the greater % of visitors we can convert to signup and this would help expand our pool of more potential weekly active users.

Probability

The cookie_id should specify each visitor to the site based on their browser. One interesting observation is cookie_id of 1 and 3 each made a visit to the homepage, didn't signed up, but later revisited the homepage and did signup.

I'm curious about the click through probability. This is a metric to define if a unique visitor signed up. The calculation is the count of unique visitors (represented by the cookie_id) who clicked the button divided by the count of unique visitors.

df_sessions.head()
cookie_id session_start_time click_signup_button_time time_duration_until_signup
0 1 2019-08-25 15:00:19 NaT NaT
1 2 2019-08-25 15:00:28 2019-08-25 15:00:36 00:00:08
2 3 2019-08-25 15:00:33 NaT NaT
3 4 2019-08-25 15:00:45 2019-08-25 15:01:29 00:00:44
4 5 2019-08-25 15:00:54 2019-08-25 15:02:00 00:01:06
count_unique_visitors_who_signed_up = df_sessions[df_sessions['click_signup_button_time'].notnull()]['cookie_id'].nunique()
count_unique_visitors_who_signed_up
11
count_unique_visitors = df_ssessions['cookie_id'].nunique()
count_unique_visitors
14
click_through_probability = round(count_unique_visitors_who_signed_up/count_unique_visitors, 2)
click_through_probability
0.79

The click through probability is 0.79 which means 79% of unique visitors to the homepage created an account. The higher this probability, the greater % of visitors we can convert to signup and this would help increase our pool of more potential weekly active users.