Categorical Data¶

Date published: 2018-11-22

Category: Data Analysis

Subcategory: Data Wrangling

Tags: categorical data, python, pandas

What is Categorical Data¶

Categorical data has a limited number of values to choose from for a field of data. Some examples of fields and values are:

Field	Potential Values
Blood type	O negative, O positive, A negative, B negative
Customer responses on satisfaction of a product	happy, content, sad
Eye color	green, blue, brown

There are two common types of categorical data: nominal and ordinal.

Nominal categorical data has values with no inherent order such as the eye color example above.

Ordinal categorical data contains values with an intended order. One example is the customer responses above. There's an inherent order with the values - happy is a more positive measurement than content. In my list of potential values, I ordered the values from responses that deem the product most-likeable to least-likeable.

Categorical Data in Pandas¶

Generally, the pandas data type of categorical columns is similar to simply strings of text or numerical values. However, with using ordinal categorical data types, there's a few small differences that would affect my typical workflow. Those differences in pandas are sorting as well as calculuating the minimum and maximum values in a column.

Import Modules¶

In [2]:

                
                    Copied!
                    
import pandas as pd
import pandas as pd

Create Survey Responses Data¶

Create a Python list of survey responses that are either happy, content, or sad.

In [3]:

                
                    Copied!
                    
responses = ["happy", "happy", "content", "content", "content", "content", "happy", "content", "sad", "sad", "sad", "sad", "sad", "sad"]
responses = ["happy", "happy", "content", "content", "content", "content", "happy", "content", "sad", "sad", "sad", "sad", "sad", "sad"]

Create a pandas categorical data structure of these responses; set the ordered argument to True so that order is declared by the categories argument which is the rank of responses in the order of happy, content, or sad.

In [13]:

                
                    Copied!
                    
survey_responses = pd.Categorical(responses, categories=["happy", "content", "sad"], ordered=True)
survey_responses = pd.Categorical(responses, categories=["happy", "content", "sad"], ordered=True)

View the data type of survey_responses.

In [14]:

                
                    Copied!
                    
type(survey_responses)
type(survey_responses)

Out[14]:

pandas.core.categorical.Categorical

Create a pandas DataFrame with one column called response with the survey_responses data structure.

In [15]:

                
                    Copied!
                    
df_survey_responses = pd.DataFrame({"response": survey_responses})
df_survey_responses = pd.DataFrame({"response": survey_responses})

Analyze Survey Responses Data¶

Preview the first 5 rows of df_survey_responses.

In [16]:

                
                    Copied!
                    
df_survey_responses.head()
df_survey_responses.head()

Out[16]:

	response
0	happy
1	happy
2	content
3	content
4	content

Descriptive Statistics¶

Use the describe() method on a Pandas DataFrame to get statistics of columns or you could call this method directly on a series. We'll call it on the DataFrame below.

count shows the number of responses
unique shows the number of unique categorical values
top shows the highest-occuring categorical value
freq shows the frequency/count of the highest-occuring categorical value

In [17]:

                
                    Copied!
                    
df_survey_responses.describe()
df_survey_responses.describe()

Out[17]:

	response
count	14
unique	3
top	sad
freq	6

Sorting¶

Sort the responses in the response column by ascending order and you'll see they appear with high at the top and low at the bottom.

In [18]:

                
                    Copied!
                    
df_survey_responses.sort_values(by='response').head(10)
df_survey_responses.sort_values(by='response').head(10)

Out[18]:

	response
0	happy
1	happy
6	happy
2	content
3	content
4	content
5	content
7	content
8	sad
9	sad

Count of unique occurences of survey responses¶

Call the value_counts() method on the response column to get a count of occurences for each of the categorical responses. Notice how low was mentioned the most and high the least.

In [19]:

                
                    Copied!
                    
df_survey_responses['response'].value_counts()
df_survey_responses['response'].value_counts()

Out[19]:

sad        6
content    5
happy      3
Name: response, dtype: int64

Calculate the Least-Occuring Value in the `response` Column¶

The result of a pandas Series min() method may be different than what you expect. We're returned happy because it's the least-occuring category type in the response column. Only 3 responses included happy and there's more responses of the content and sad categories.

In [20]:

                
                    Copied!
                    
df_survey_responses['response'].min()
df_survey_responses['response'].min()

Out[20]:

'happy'

Calculate Most-Occuring Value in `response` Column¶

Call the max() method on the response column and we're returned sad which is the most-occuring categorical value.

In [21]:

                
                    Copied!
                    
df_survey_responses['response'].max()
df_survey_responses['response'].max()

Out[21]:

'sad'

You can learn more about the differences in working with categorical data in Pandas from the official documentation page.