value_counts() Method: Count Unique Occurrences of Values in a Column¶
Date published: 2018-11-25
Category: Data Analysis
Subcategory: Data Wrangling
Tags: python, pandas
In pandas, for a column in a DataFrame, we can use the value_counts() method to easily count the unique occurences of values.
There's additional interesting analyis we can do with value_counts()
too. We'll try them out using the titanic dataset.
Import Module¶
import pandas as pd
import seaborn as sns
Get Titanic Dataset¶
We'll use the titanic dataset included in the seaborn library.
df = sns.load_dataset('titanic')
Below is a preview of the first few rows of the dataset.
Each row includes details of a person who boarded the famous Titanic cruise ship.
In this tutorial, we're just going to utilize the sex
and fare
columns. The sex
column classifies the person's gender as male or female. The fare
column indicates the dollar amount each person paid to board the Titanic.
df.head()
survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
3 | 1 | 1 | female | 35.0 | 1 | 0 | 53.1000 | S | First | woman | False | C | Southampton | yes | False |
4 | 0 | 3 | male | 35.0 | 0 | 0 | 8.0500 | S | Third | man | True | NaN | Southampton | no | True |
Fun with value_counts()
¶
Here is the simple use of value_counts()
we call on the sex
column that returns us the count of occurences of each of the unique values in this column.
df['sex'].value_counts()
male 577 female 314 Name: sex, dtype: int64
Now, we want to do the same operation, but this time sort our outputted values in the sex
column, male and female, so that values that start with the letter a appear at the top and values that start with letter z appear at the bottom. This is considered ascending order.
f is before m in the alphabet so we see female before male.
In our value_counts
method, we'll set the argument ascending
to True
.
df['sex'].value_counts(ascending=True)
female 314 male 577 Name: sex, dtype: int64
Often times, we want to know what percentage of the whole is for each value that appears in the column. For example, if we took the two counts above, 577 and 314 and we sum them up, we'd get 891. So, what percentage of people on the titanic were male. The calculation is 577/891 x 100 = 64.75%.
To calculate this in pandas with the value_counts()
method, set the argument normalize
to True
.
df['sex'].value_counts(normalize=True)
male 0.647587 female 0.352413 Name: sex, dtype: float64
Before we try a new value_counts()
argument, let's take a look at some basic descriptive statistics of the fare
column. To accomplish this, we'll call the describe()
method on the column.
There's 891 values of fare data, a mean of 32 and a standard deviation of 49 which indicates a fairly wide spread of data.
df['fare'].describe()
count 891.000000 mean 32.204208 std 49.693429 min 0.000000 25% 7.910400 50% 14.454200 75% 31.000000 max 512.329200 Name: fare, dtype: float64
Another interesting feature of the value_counts()
method is that it can be used to bin continuous data into discrete intervals. We set the argument bins
to an integer representing the number of bins to create.
For each bin, the range of fare amounts in dollar values is the same. One contains fares from 73.19 to 146.38 which is a range of 73.19. Another bin contains fares from 146.38 to 73.19 which is also a range of 73.19. See how the ranges are same! However, inside each range of fare values can contain a different count of the number of tickets bought by passengers of the Titanic.
We can see most people paid under 73.19 for their ticket.
df['fare'].value_counts(bins=7)
(-0.513, 73.19] 789 (73.19, 146.38] 71 (146.38, 219.57] 15 (219.57, 292.76] 13 (439.139, 512.329] 3 (365.949, 439.139] 0 (292.76, 365.949] 0 Name: fare, dtype: int64