Introduction
In my previous article, I wrote about pandas data types; what they areand how to convert data to the appropriate type. This article will focus on the pandascategorical data type and some of the benefits and drawbacks of usingit.
Pandas Category DataType
To refresh your memory, here is a summary table of the various pandas data types(akadtypes).
Pandas dtype | Python type | NumPy type | Usage |
---|---|---|---|
object | str | string_, unicode_ | Text |
int64 | int | int_, int8, int16, int32, int64, uint8, uint16, uint32, uint64 | Integer numbers |
float64 | float | float_, float16, float32, float64 | Floating point numbers |
bool | bool | bool_ | True/False values |
datetime64 | NA | datetime64[ns] | Date and time values |
timedelta[ns] | NA | NA | Differences between two datetimes |
category | NA | NA | Finite list of text values |
This article will focus on categorical data. As a quick refresher, categorical data isdata which takes on a finite number of possible values. For example, if wewere talking about a physical product like a t-shirt, it could have categoricalvariables suchas:
- Size (X-Small, Small, Medium, Large,X-Large)
- Color (Red, Black,White)
- Style (Short sleeve, longsleeve)
- Material (Cotton,Polyester)
Attributes such as cost, price, quantity are typically integers orfloats.
The key take away is that whether or not a variable is categorical depends on itsapplication. Since we only have 3 colors of shirts, then that is a good categoricalvariable. However, “color” could represent thousands of values in other situationsso it would not be a goodchoice.
There is no hard and fast rule for how many values a categorical value should have.You should apply your domain knowledge to make that determination on your own data sets.In this article, we will look at one approach for identifying categoricalvalues.
The category data type in pandas is a hybrid data type. It looks and behaves like astring in many instances but internally is represented by an array of integers.This allows the data to be sorted in a custom order and to more efficiently storethedata.
At the end of the day why do we care about using categorical values? There are 3 mainreasons:
- We can define a custom sort order which can improve summarizing and reporting the data.In the example above, “X-Small” < “Small” < “Medium” < “Large” < “X-Large”.Alphabetical sorting would not be able to reproduce thatorder.
- Some of the python visualization libraries can interpret the categorical data type to apply approrpiate statistical models or plottypes.
- Categorical data uses less memory which can lead to performanceimprovements.
While categorical data is very handy in pandas. It is not necessary for every type of analysis.In fact, there can be some edge cases where defining a column of data as categorical thenmanipulating the dataframe can lead to some surprising results. Care must be taken tounderstand the data set and the necessary analysis before converting columns tocategorical datatypes.
DataPreparation
One of the main use cases for categorical data types is more efficient memory usage.In order to demonstrate, we will use a large data set from the US Centers for Medicare and Medicaid Services.This data set includes a 500MB+ csv file that has information about research paymentsto doctors and hospital in fiscal year2017.
First, set up imports and read in all thedata:
import pandas as pdfrom pandas.api.types import CategoricalDtypedf_raw = pd.read_csv('OP_DTL_RSRCH_PGYR2017_P06292018.csv', low_memory=False)
I have included the low_memory=False
parameter in order to surpressthiswarning:
interactiveshell.py:2728: DtypeWarning: Columns (..) have mixed types. Specify dtype option on import or set low_memory=False.interactivity=interactivity, compiler=compiler, result=result)
Feel free to read more about this parameter in the pandas read_csvdocumentation.
One interesting thing about this data set is that it has over 176 columns but many of themare empty. I found a stack overflow solution to quickly drop all the columns where atleast 90% of the data is empty. I thought this might be handy for others aswell.
drop_thresh = df_raw.shape[0]*.9df = df_raw.dropna(thresh=drop_thresh, how='all', axis='columns').copy()
Let’s take a look at the size of these various dataframes. Here is the original dataset:
df_raw.info()
<class 'pandas.core.frame.DataFrame'>RangeIndex: 607865 entries, 0 to 607864Columns: 176 entries, Change_Type to Context_of_Researchdtypes: float64(34), int64(3), object(139)memory usage: 816.2+ MB
The 500MB csv file fills about 816MB of memory. This seems large but even a low-endlaptop has several gigabytes of RAM so we are nowhere near the need for specializedprocessingtools.
Here is the data set we will use for the rest of thearticle:
df.info()
<class 'pandas.core.frame.DataFrame'>RangeIndex: 607865 entries, 0 to 607864Data columns (total 33 columns):Change_Type 607865 non-null objectCovered_Recipient_Type 607865 non-null object.....Payment_Publication_Date 607865 non-null objectdtypes: float64(2), int64(3), object(28)memory usage: 153.0+ MB
Now that we only have 33 columns, taking 153MB of memory, let’s take a look at whichcolumns might be good candidates for a categorical datatype.
In order to make this a little easier, I created a small helper function to createa dataframe showing all the unique values in acolumn.
unique_counts = pd.DataFrame.from_records([(col, df[col].nunique()) for col in df.columns], columns=['Column_Name', 'Num_Unique']).sort_values(by=['Num_Unique'])
Column_Name | Num_Unique | |
---|---|---|
0 | Change_Type | 1 |
27 | Delay_in_Publication_Indicator | 1 |
31 | Program_Year | 1 |
32 | Payment_Publication_Date | 1 |
29 | Dispute_Status_for_Publication | 2 |
26 | Preclinical_Research_Indicator | 2 |
22 | Related_Product_Indicator | 2 |
25 | Form_of_Payment_or_Transfer_of_Value | 3 |
1 | Covered_Recipient_Type | 4 |
14 | Principal_Investigator_1_Country | 4 |
15 | Principal_Investigator_1_Primary_Type | 6 |
6 | Recipient_Country | 9 |
21 | Applicable_Manufacturer_or_Applicable_GPO_Maki… | 20 |
4 | Recipient_State | 53 |
12 | Principal_Investigator_1_State | 54 |
17 | Principal_Investigator_1_License_State_code1 | 54 |
16 | Principal_Investigator_1_Specialty | 243 |
24 | Date_of_Payment | 365 |
18 | Submitting_Applicable_Manufacturer_or_Applicab… | 478 |
19 | Applicable_Manufacturer_or_Applicable_GPO_Maki… | 551 |
20 | Applicable_Manufacturer_or_Applicable_GPO_Maki… | 557 |
11 | Principal_Investigator_1_City | 4101 |
3 | Recipient_City | 4277 |
8 | Principal_Investigator_1_First_Name | 8300 |
5 | Recipient_Zip_Code | 12826 |
28 | Name_of_Study | 13015 |
13 | Principal_Investigator_1_Zip_Code | 13733 |
9 | Principal_Investigator_1_Last_Name | 21420 |
10 | Principal_Investigator_1_Business_Street_Addre… | 29026 |
7 | Principal_Investigator_1_Profile_ID | 29696 |
2 | Recipient_Primary_Business_Street_Address_Line1 | 38254 |
23 | Total_Amount_of_Payment_USDollars | 141959 |
30 | Record_ID | 607865 |
This table highlights a couple of items that will help determine which values should becategorical. First, there is a big jump in unique values once we get above 557 uniquevalues. This should be a useful threshold for this dataset.
In addition, the date fields should not be converted tocategorical.
The simplest way to convert a column to a categorical type is to useastype('category')
. We can use a loop to convert all the columns we careabout using astype('category')
cols_to_exclude = ['Program_Year', 'Date_of_Payment', 'Payment_Publication_Date']for col in df.columns: if df[col].nunique() < 600 and col not in cols_to_exclude: df[col] = df[col].astype('category')
If we use df.info()
to look at the memory usage, we have taken the 153 MB dataframedown to 82.4 MB. This is pretty impressive. We have cut the memory usage almost in halfjust by converting to categorical values for the majority of ourcolumns.
There is one other feature we can use with categorical data - defining a custom order.To illustrate, let’s do a quick summary of the total payments made by the form ofpayment:
df.groupby('Covered_Recipient_Type')['Total_Amount_of_Payment_USDollars'].sum().to_frame()
Total_Amount_of_Payment_USDollars | |
---|---|
Covered_Recipient_Type | |
Covered Recipient Physician | 7.912815e+07 |
Covered Recipient Teaching Hospital | 1.040372e+09 |
Non-covered Recipient Entity | 3.536595e+09 |
Non-covered Recipient Individual | 2.832901e+06 |
If we want to change the order of the Covered_Recipient_Type
, we need todefine a custom CategoricalDtype
:
cats_to_order = ["Non-covered Recipient Entity", "Covered Recipient Teaching Hospital", "Covered Recipient Physician", "Non-covered Recipient Individual"]covered_type = CategoricalDtype(categories=cats_to_order, ordered=True)
Then, explicitly re_order thecategory:
df['Covered_Recipient_Type'] = df['Covered_Recipient_Type'].cat.reorder_categories(cats_to_order, ordered=True)
Now, we can see the sort order in effect with thegroupby:
df.groupby('Covered_Recipient_Type')['Total_Amount_of_Payment_USDollars'].sum().to_frame()
Total_Amount_of_Payment_USDollars | |
---|---|
Covered_Recipient_Type | |
Non-covered Recipient Entity | 3.536595e+09 |
Covered Recipient Teaching Hospital | 1.040372e+09 |
Covered Recipient Physician | 7.912815e+07 |
Non-covered Recipient Individual | 2.832901e+06 |
If you have this same type of data file that you will be processing repeatedly,you can specify this conversion when reading the csv by passing a dictionary ofcolumn names and types via the dtype
:parameter.
df_raw_2 = pd.read_csv('OP_DTL_RSRCH_PGYR2017_P06292018.csv', dtype={'Covered_Recipient_Type':covered_type})
Performance
We’ve shown that the size of the dataframe is reduced by converting values to categoricaldata types. Does this impact other areas of performance? The answer isyes.
Here is an example of a groupby operation on the categorical vs. object data types.First, perform the analysis on the original inputdataframe.
%%timeitdf_raw.groupby('Covered_Recipient_Type')['Total_Amount_of_Payment_USDollars'].sum().to_frame()
40.3 ms ± 2.38 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now, on the dataframe with categoricaldata:
%%timeitdf.groupby('Covered_Recipient_Type')['Total_Amount_of_Payment_USDollars'].sum().to_frame()
4.51 ms ± 96.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In this case we sped up the code by 10x, going from 40.3 ms to 4.51 ms. You can imaginethat on much larger data sets, the speedup could be evengreater.
WatchOuts
Photo credit: Frans VanHeerden
Categorical data seems pretty nifty. It saves memory and speeds up code, so why notuse it everywhere? Well, Donald Knuth is correct when he warns about prematureoptimization:
The real problem is that programmers have spent far too much time worrying aboutefficiency in the wrong places and at the wrong times; premature optimizationis the root of all evil (or at least most of it) in programming.
In the examples above, the code is faster but it really does not matter when itis used for quick summary actions that are run infrequently. In addition, all the workto figure out and convert to categorical data is probably not worth it for thisdata set and this simpleanalysis.
In addition, categorical data can yield some surprising behaviors in real worldusage. The examples below will illustrate a couple ofissues.
Let’s build a simple dataframe with one ordered categorical variable that representsthe status of the customer. This trivial example will highlight some potentialsubtle errors when dealing with categorical values. It is worth noting that thisexample shows how to use astype()
to convert to the ordered category inone step instead of the two step process usedearlier.
import pandas as pdfrom pandas.api.types import CategoricalDtypesales_1 = [{'account': 'Jones LLC', 'Status': 'Gold', 'Jan': 150, 'Feb': 200, 'Mar': 140}, {'account': 'Alpha Co', 'Status': 'Gold', 'Jan': 200, 'Feb': 210, 'Mar': 215}, {'account': 'Blue Inc', 'Status': 'Silver', 'Jan': 50, 'Feb': 90, 'Mar': 95 }]df_1 = pd.DataFrame(sales_1)status_type = CategoricalDtype(categories=['Silver', 'Gold'], ordered=True)df_1['Status'] = df_1['Status'].astype(status_type)
This yields a simple dataframe that looks likethis:
Feb | Jan | Mar | Status | account | |
---|---|---|---|---|---|
0 | 200 | 150 | 140 | Gold | Jones LLC |
1 | 210 | 200 | 215 | Gold | Alpha Co |
2 | 90 | 50 | 95 | Silver | Blue Inc |
We can inspect the categorical column in moredetail:
df_1['Status']
0 Gold1 Gold2 SilverName: Status, dtype: categoryCategories (2, object): [Silver < Gold]
All looks good. We see the data is all there and that Gold is > thenSilver.
Now, let’s bring in another dataframe and apply the same category to the statuscolumn:
sales_2 = [{'account': 'Smith Co', 'Status': 'Silver', 'Jan': 100, 'Feb': 100, 'Mar': 70}, {'account': 'Bingo', 'Status': 'Bronze', 'Jan': 310, 'Feb': 65, 'Mar': 80}]df_2 = pd.DataFrame(sales_2)df_2['Status'] = df_2['Status'].astype(status_type)
Feb | Jan | Mar | Status | account | |
---|---|---|---|---|---|
0 | 100 | 100 | 70 | Silver | Smith Co |
1 | 65 | 310 | 80 | NaN | Bingo |
Hmm. Something happened to our status. If we just look at the column in moredetail:
df_2['Status']
0 Silver1 NaNName: Status, dtype: categoryCategories (2, object): [Silver < Gold]
We can see that since we did not define “Bronze” as a valid status, we end upwith an NaN
value. Pandas does this for a perfectly good reason. It assumesthat you have defined all of the valid categories and in this case, “Bronze” is notvalid. You can just imagine how confusing this issue could be to troubleshoot ifyou were not looking out forit.
This scenario is relatively easy to see but what would you do if you had 100’s of valuesand the data was not cleaned and normalizedproperly?
Here’s another tricky example where you can “lose” the categoryobject:
sales_1 = [{'account': 'Jones LLC', 'Status': 'Gold', 'Jan': 150, 'Feb': 200, 'Mar': 140}, {'account': 'Alpha Co', 'Status': 'Gold', 'Jan': 200, 'Feb': 210, 'Mar': 215}, {'account': 'Blue Inc', 'Status': 'Silver', 'Jan': 50, 'Feb': 90, 'Mar': 95 }]df_1 = pd.DataFrame(sales_1)# Define an unordered categorydf_1['Status'] = df_1['Status'].astype('category')sales_2 = [{'account': 'Smith Co', 'Status': 'Silver', 'Jan': 100, 'Feb': 100, 'Mar': 70}, {'account': 'Bingo', 'Status': 'Bronze', 'Jan': 310, 'Feb': 65, 'Mar': 80}]df_2 = pd.DataFrame(sales_2)df_2['Status'] = df_2['Status'].astype('category')# Combine the two dataframes into 1df_combined = pd.concat([df_1, df_2])
Feb | Jan | Mar | Status | account | |
---|---|---|---|---|---|
0 | 200 | 150 | 140 | Gold | Jones LLC |
1 | 210 | 200 | 215 | Gold | Alpha Co |
2 | 90 | 50 | 95 | Silver | Blue Inc |
0 | 100 | 100 | 70 | Silver | Smith Co |
1 | 65 | 310 | 80 | Bronze | Bingo |
Everything looks ok but upon further inspection, we’ve lost our category datatype:
df_combined['Status']
0 Gold1 Gold2 Silver0 Silver1 BronzeName: Status, dtype: object
In this case, the data is still there but the type has been converted to an object.Once again, this is pandas attempt to combine the data without throwing errors but notmaking assumptions. If you want to convert to a category data type now, you can useastype('category')
.
GeneralGuidelines
Now that you know about these gotchas, you can watch out for them. But I will givea few guidelines for how I recommend using categorical datatypes:
- Do not assume you need to convert all categorical data to the pandas category datatype.
- If the data set starts to approach an appreciable percentage of your useable memory, then consider using categorical datatypes.
- If you have very significant performance concerns with operations that are executed frequently, lookat using categoricaldata.
- If you are using categorical data, add some checks to make sure the data is clean and completebefore converting to the pandas category type. Additionally, check for
NaN
values aftercombining or convertingdataframes.
I hope this article was helpful. Categorical data types in pandas can be very useful.However, there are a few issues that you need to keep an eye out for so that you do notget tripped up in subsequent processing. Feel free to add any additional tips orquestions in the comments sectionbelow.
Changes
- 6-Dec-2020: Fix typo in
groupby
example
FAQs
What is the category data type in pandas? ›
The category data type in pandas is a hybrid data type. It looks and behaves like a string in many instances but internally is represented by an array of integers. This allows the data to be sorted in a custom order and to more efficiently store the data.
What is categorical data in pandas explain with an example? ›Categorical are a pandas data type that corresponds to the categorical variables in statistics. Such variables take on a fixed and limited number of possible values. For examples – grades, gender, blood group type etc.
Why do we use category Dtype in pandas and not use object Dtype? ›As shown below, by using 'category' dtype, the execution is much faster (e.g., less than 1/3 of the time taken comparing to using 'object' date type.)
What is Category data example? ›Categorical variables represent types of data which may be divided into groups. Examples of categorical variables are race, sex, age group, and educational level.
Which type of data is category data? ›Qualitative data, also known as the categorical data, describes the data that fits into the categories. Qualitative data are not numerical. The categorical information involves categorical variables that describe the features such as a person's gender, home town etc.
How do you handle categorical values in a dataset? ›- One-hot Encoding using: Python's category_encoding library. Scikit-learn preprocessing. Pandas' get_dummies.
- Binary Encoding.
- Frequency Encoding.
- Label Encoding.
- Ordinal Encoding.
Frequency tables, pie charts, and bar charts are the most appropriate graphical displays for categorical variables.
How do you deal with categorical data? ›- Nominal Data: The nominal data called labelled/named data. Allowed to change the order of categories, change in order doesn't affect its value. ...
- Ordinal Data: Represent discretely and ordered units. Same as nominal data but have ordered/rank.
Categorical variables represent groupings of some kind. They are sometimes recorded as numbers, but the numbers represent categories rather than actual amounts of things. There are three types of categorical variables: binary, nominal, and ordinal variables.
What are the 2 types of categorical variables explain and give examples each? ›There are two types of categorical variable, nominal and ordinal. A nominal variable has no intrinsic ordering to its categories. For example, gender is a categorical variable having two categories (male and female) with no intrinsic ordering to the categories. An ordinal variable has a clear ordering.
What are the three types of categorical data? ›
Categorical data is divided into two types, nominal and ordinal.
When to use category pandas? ›- Memory usage — for string columns where there are many repeated values, categories can drastically reduce the amount of memory required to store the data in memory.
- Runtime performance — there are optimizations in place which can improve execution speed for certain operations.
The basic strategy is to convert each category value into a new column and assign a 1 or 0 (True/False) value to the column. This has the benefit of not weighting a value improperly. There are many libraries out there that support one-hot encoding but the simplest one is using pandas ' . get_dummies() method.
How to fill categorical missing values in pandas? ›One approach to fill these missing values can be to replace them with the most common or occurring class. We can do this by taking the index of the most common class which can be determined by using value_counts() method.
What is an example of categorical answer? ›Examples of categorical data:
Gender (Male, Female) Brand of soaps (Dove, Olay…) Hair color (Blonde, Brunette, Brown, Red, etc.) Survey on a topic “Do you have children?” (Yes or No)
There are two general types of data – quantitative and qualitative and both are equally important. You use both types to demonstrate effectiveness, importance or value.
How many categories data type are there? ›Most modern computer languages recognize five basic categories of data types: Integral, Floating Point, Character, Character String, and composite types, with various specific subtypes defined within each broad category.
What are the 4 categories of data? ›Typically, there are four classifications for data: public, internal-only, confidential, and restricted.
How do you deal with categorical variables in data analysis? ›1) Using the categorical variable, evaluate the probability of the Target variable (where the output is True or 1). 2) Calculate the probability of the Target variable having a False or 0 output. 3) Calculate the probability ratio i.e. P(True or 1) / P(False or 0). 4) Replace the category with a probability ratio.
Which is the best way to encode categorical variables? ›This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model. The two most popular techniques are an integer encoding and a one hot encoding, although a newer technique called learned embedding may provide a useful middle ground between these two methods.
Which encoding is best for categorical data? ›
Hash Encoding represents the categorical data into numerical value by the hashing function. Hashing is often used in data encryption or data comparison, but the main part is still similar — transform one feature to another using hashing function.
How do you visualize 3 categorical variables? ›To visualize a small data set containing multiple categorical (or qualitative) variables, you can create either a bar plot, a balloon plot or a mosaic plot.
How do you visualize a categorical variable in Python? ›...
Categorical estimate plots:
- pointplot() (with kind="point" )
- barplot() (with kind="bar" )
- countplot() (with kind="count" )
...
- Bar plot. ...
- Stem plot. ...
- Pie chart. ...
- Treemap. ...
- Waffle chart. ...
- Word cloud.
Categorical data can take on numerical values (such as “1” indicating male and “2” indicating female), but those numbers don't have mathematical meaning. You couldn't add them together, for example. ( Other names for categorical data are qualitative data, or Yes/No data.
How do you classify categorical variables? ›There are two major classes of categorical data, nominal and ordinal. In any nominal categorical data attribute, there is no concept of ordering amongst the values of that attribute.
When should you use categorical data? ›Categorical variables can be used to represent different types of qualitative data. For example: Ordinal data - represents outcomes for which the order of the groups is relevant. Nominal data - represent outcomes for which the order of groups does not matter.
How do you know if a variable is categorical? ›A categorical variable (sometimes called a nominal variable) is one that has two or more categories, but there is no intrinsic ordering to the categories.
What is an example of a categorical data question? ›Categorical Data, sometimes called qualitative data, are data whose values describe some characteristic or category. For example, a survey could ask a random group of people: What is your lucky day of the week?
What is an example of a categorical data that is a number? ›Categorical data examples include personal biodata information—full name, gender, phone number, etc. Numerical data examples include CGPA calculator, interval sale, etc.
Which of the following is an example of categorical data set? ›
Bar graphs are usually used to represent 'categorical data' while the histograms are usually used for 'continuous data'.
Which are the 3 basic categories of data types? ›- Integer.
- Double or Real.
- String.
- Boolean.
- Date/Time.
- Object.
- Variant.
Categorical are a Pandas data type. A string variable consisting of only a few different values. Converting such a string variable to a categorical variable will save some memory. The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”).
What are the 3 types of data types? ›Most programming languages support basic data types of integer numbers (of varying sizes), floating-point numbers (which approximate real numbers), characters and Booleans.
What are the 5 different categorization of data types? ›- Public data. Public data is important information, though often available material that's freely accessible for people to read, research, review and store. ...
- Private data. ...
- Internal data. ...
- Confidential data. ...
- Restricted data.
4 Types of Data: Nominal, Ordinal, Discrete, Continuous.
What are the two main categories of data types? ›There are two general types of data – quantitative and qualitative and both are equally important. You use both types to demonstrate effectiveness, importance or value.