Popular Summary Business Metrics¶
Date published: 2019-08-26
Category: Data Analysis
Subcategory: Business Metrics
Tags: business metrics, python, pandas
On the job, I'm often asked to setup a measurement plan for a new app feature or quantify something that's happend in the past. There's several common calculations for metrics including:
- Sum: result of adding numbers
- Count: total number of occurences of something
- Average: a number to regard the central tendancy of a set of values calculated as the sum of a condition divided by the count of occurences of that condition
- Percentile: a value that represents a threshold for the percentage of data points less than the value
- Ratio: comparison of two numbers into a value
- Probability: how likely something is to happen
In this post, I'll walk through a scenario in which I am asked to calculate the success of a website homepage that's for a note-taking app. Success is a vague word and there's lots of ways we can characterize it. With each metric using a calculation from the list above, I'll align that metric to be relevant to overall business goal: get as many weekly active users as possible.
The homepage is just meant to educate people about the note-taking app and encourage them to sign up. There's a lot of content on the homepage about the features of the app and there's a single signup button to create an account and use the note-taking app.
A caveat: this is a very simple business example and this post is meant to introduce popular summary business metrics. For a real-world project, I'd encourage you to understand your business problems, business goals and the pros and cons of each potential metric you choose. Here's a great article from First Round Capital that goes into more detail on evaluation of types of metrics on whether they're a practical indicator of your business' success.
Setup Code and Data¶
Import Modules¶
from datetime import datetime
from random import randint
from random import choice
from random import seed
import numpy as np
import pandas as pd
from datetime import timedelta
Generate Data¶
I'll generate relevant session data we can use to compute all metrics above.
Below I create 17 sessions in which each session is simplified to a visitor landing on the homepage only and then either clicking the signup button or leaving the site.
For each session, there's a:
- cookie id to represent a unique value for a visitor based on the visitor's browser
- timestamp for when the visitor lands on the homepage
- a value to represent the timestamp for when a user clicked the signup button or an
NaT
value to represent not a timestamp meaning the visitor never clicked the signup button
seed(6)
session_start_time = datetime.today().replace(microsecond=0,second=0,minute=0)
signup_button_time = session_start_time + timedelta(seconds=random_seconds)
session_start_times = []
click_signup_button_times = []
count_of_sessions = 50
cookie_ids_list = list(range(1, 15)) + list(range(1, 4))
for session in cookie_ids_list:
random_seconds_between_session_start_times = randint(a=1, b=20)
session_start_time = session_start_time + timedelta(seconds=random_seconds_between_session_start_times)
random_seconds_to_signup = randint(a=4, b=100)
click_signup_button_time = session_start_time + timedelta(seconds=random_seconds_to_signup)
# randomly choose either the signup button time or np.NaN to be no button click
click_signup_button_time = random.choice([click_signup_button_time, np.NaN])
session_start_times.append(session_start_time)
click_signup_button_times.append(click_signup_button_time)
Create a Pandas dataframe from session_start_times
and signup_button_times
.
data = {'cookie_id': cookie_ids_list,
'session_start_time': session_start_times,
'click_signup_button_time': click_signup_button_times
}
df_sessions = pd.DataFrame(data)
View entire dataset.
df_sessions
cookie_id | session_start_time | click_signup_button_time | |
---|---|---|---|
0 | 1 | 2019-08-25 15:00:19 | NaT |
1 | 2 | 2019-08-25 15:00:28 | 2019-08-25 15:00:36 |
2 | 3 | 2019-08-25 15:00:33 | NaT |
3 | 4 | 2019-08-25 15:00:45 | 2019-08-25 15:01:29 |
4 | 5 | 2019-08-25 15:00:54 | 2019-08-25 15:02:00 |
5 | 6 | 2019-08-25 15:01:08 | 2019-08-25 15:02:20 |
6 | 7 | 2019-08-25 15:01:15 | NaT |
7 | 8 | 2019-08-25 15:01:35 | 2019-08-25 15:03:06 |
8 | 9 | 2019-08-25 15:01:49 | 2019-08-25 15:02:35 |
9 | 10 | 2019-08-25 15:02:01 | NaT |
10 | 11 | 2019-08-25 15:02:16 | 2019-08-25 15:03:49 |
11 | 12 | 2019-08-25 15:02:23 | NaT |
12 | 13 | 2019-08-25 15:02:27 | 2019-08-25 15:02:36 |
13 | 14 | 2019-08-25 15:02:39 | 2019-08-25 15:03:45 |
14 | 1 | 2019-08-25 15:02:56 | 2019-08-25 15:04:13 |
15 | 2 | 2019-08-25 15:03:08 | NaT |
16 | 3 | 2019-08-25 15:03:18 | 2019-08-25 15:04:07 |
Sum Metric¶
The sum is the result of adding things. Given the dataset, I don't think there's a single sum metric that's relevant to the business' goals. However, the sum is a calculation necessary to know the average time a visitor spends on the page before clicking the signup button. We'll revisit this calculation later in the ratio section.
Count Metric¶
A count is the total number of occurences of something. A relevant count metric for our business goals is the count of sessions. My hypothesis is that more sessions on the homepage will result in more signups which will likely result in more weekly active users. Therefore, the business wants more sessions per day.
I can use the Python len() function to count the number of session in the dataset.
count_sessions = len(df_sessions)
count_sessions
17
There were 18 sessions.
Another relevant count metric is the count of unique visitors to the site in a day. I'll assume each cookie used on a visitors' browser was used properly and corresponds to each user.
I can count the unique values in the cookie_id
field by using the Pandas series unique() method.
df_sessions['cookie_id'].nunique()
14
There were 14 distinct visitors to the homepage.
Average Metric¶
An average is the "central" value from a set of numbers. A relevant average metric that aligns with the business goals is the average time it takes somebody sign up from landing on the homepage first. This average metric gives us a baseline for how much consideration it takes someone to sign up.
I need to create a new column that's the time duration for each user to sign up after landing on the homepage. I will subtract click_signup_button_time
- session_start_time
.
df_sessions['time_duration_until_signup'] = df_sessions['click_signup_button_time'] - df_sessions['session_start_time']
Preview the first few sessions.
df_sessions.head()
cookie_id | session_start_time | click_signup_button_time | time_duration_until_signup | |
---|---|---|---|---|
0 | 1 | 2019-08-25 15:00:19 | NaT | NaT |
1 | 2 | 2019-08-25 15:00:28 | 2019-08-25 15:00:36 | 00:00:08 |
2 | 3 | 2019-08-25 15:00:33 | NaT | NaT |
3 | 4 | 2019-08-25 15:00:45 | 2019-08-25 15:01:29 | 00:00:44 |
4 | 5 | 2019-08-25 15:00:54 | 2019-08-25 15:02:00 | 00:01:06 |
I can use the Pandas mean() method to calculate the average of the time_duration_until_signup
values.
df_sessions['time_duration_until_signup'].mean()
Timedelta('0 days 00:00:56.454545')
For users that do sign up, the average time it takes someone to click the signup button after landing on the homepage is approximately 56 seconds.
Percentile¶
A percentile represents a threshold in which there's a percentage of data points less than that value. A frequently used percentile value is the 50th percentile which represents the median. This means there are approximately 50% of data points smaller than the median value.
Median can be especially helpful to understand the central tendancy of a set of values when there's a skewness in the values. You can learn more on my skewness article.
I can calculate the median time_duration_until_signup
value using the median method.
df_sessions['time_duration_until_signup'].median()
Timedelta('0 days 00:01:06')
The median is approximately 1 minute and 6 seconds. In approximately half of sessions with signups, visitors spent less than 1 minute and 6 seconds.
Ratio¶
A ratio comparison of two numbers into a value. One ratio that's critical to the business' goal is to know the proportion of homepage visits that end up with a click on a signup button. This rate is commony called the clickthrough rate. The higher this rate, the more signups and likely more weekly active users down the road.
Let's first count the number of sessions that have a click of the signup button. I'll count the number of non-null values in the click_signup_button_time
field using the Pandas series notnull() method
signup_button_clicks = len(df_sessions[df_sessions['click_signup_button_time'].notnull()])
signup_button_clicks
11
ratio_clicks_to_session = round(signup_button_clicks/count_sessions, 2)
ratio_clicks_to_session
0.65
0.65 sessions resulted in a signup button click. The higher this ratio, the greater % of visitors we can convert to signup and this would help expand our pool of more potential weekly active users.
Probability¶
The cookie_id
should specify each visitor to the site based on their browser. One interesting observation is cookie_id
of 1
and 3
each made a visit to the homepage, didn't signed up, but later revisited the homepage and did signup.
I'm curious about the click through probability. This is a metric to define if a unique visitor signed up. The calculation is the count of unique visitors (represented by the cookie_id
) who clicked the button divided by the count of unique visitors.
df_sessions.head()
cookie_id | session_start_time | click_signup_button_time | time_duration_until_signup | |
---|---|---|---|---|
0 | 1 | 2019-08-25 15:00:19 | NaT | NaT |
1 | 2 | 2019-08-25 15:00:28 | 2019-08-25 15:00:36 | 00:00:08 |
2 | 3 | 2019-08-25 15:00:33 | NaT | NaT |
3 | 4 | 2019-08-25 15:00:45 | 2019-08-25 15:01:29 | 00:00:44 |
4 | 5 | 2019-08-25 15:00:54 | 2019-08-25 15:02:00 | 00:01:06 |
count_unique_visitors_who_signed_up = df_sessions[df_sessions['click_signup_button_time'].notnull()]['cookie_id'].nunique()
count_unique_visitors_who_signed_up
11
count_unique_visitors = df_ssessions['cookie_id'].nunique()
count_unique_visitors
14
click_through_probability = round(count_unique_visitors_who_signed_up/count_unique_visitors, 2)
click_through_probability
0.79
The click through probability is 0.79 which means 79% of unique visitors to the homepage created an account. The higher this probability, the greater % of visitors we can convert to signup and this would help increase our pool of more potential weekly active users.