Representing Categorical data in Machine Learning

Mathanraj Sharma
3 min readMay 3, 2019

--

Machine learning algorithms are build based on Mathematical Functions, to be more clear functions of functions. These algorithms are trying to find relativity among data points to get a better generalization. Since they are Mathematical functions they cannot directly deal with the categorical data like Strings or Characters. Let’s dig deep and discuss some of the techniques used to represent categorical data and how to choose one of them for your need.

Label Encoding

The easiest way to solve the problem with categorical data. It simply gives unique numeric labels based on the distinct values we have for certain data column.

For example, let us assume we have a column in our data frame. Has four distinct values, [‘New York’, ‘London’, ‘Toronto’, ‘Delhi’]. Labeling will give different numeric values for each data, [0,1,2,3].

Before Labeling: [‘New York’,’New York’, ‘Delhi’, ‘Toronto’, ‘New York’, ‘London’,’Toronto’,……….]

After Labeling: [0,0,3,2,0,1,2,….]

Dummy Encoding

Dummy encoding will create separate columns for each distinct value on your categorical column with respective value as the column name. Then by default, it will select the 0th-row element as the regression constant for the training. The mean of that element will be the predicted score when all other values have zero.

(For morereference:https://gerardnico.com/data_mining/dummy)

Before Dummy Encoding: [‘New York’,’New York’, ‘Delhi’, ‘Toronto’, ‘New York’, ‘London’,’Toronto’,……….]

After Dummy Encoding:

One-Hot Encoding

It is another well-recognized techniques to represent categorical data. It doesn’t add multiple columns like Dummy encoding. Instead of that, it will replace the value of the row, with a list (length = distinct values).

Before OHE: [‘New York’,’New York’, ‘Delhi’, ‘Toronto’, ‘New York’, ‘London’,’Toronto’,……….]

After OHE:

Look at the end result, at any given point (at any row i.e. list in that raw) only one index can be 1 all others are 0. This is why it is called ONE HOT encoding.

Hope you all understood the differences among all three encodings. But now the question is when to use which encoding?

It heavily depends on which algorithm you are going to use. Most of the clustering and DNN algorithms learn a single weight for each feature or they compute the distance (they learn the euclidian distance) between your data points. In such cases, label encoding won’t perform well. Because the end result from label encoding is a Nominal data set, it might confuse your model to make false assumptions based on the numerical labels generated. Or they will not affect your model performance in either way since they are meaningless. There is no meaningful magnitude or correlation between each of those data points. It is better to go with Dummy encoding or One-Hot encoding. One hot encoding is a good representation of the data on neural network learning (except random forest and other methods which can use the categorical data). But again, as it is visible in examples above, dummies add more columns to your data set which consumes more space and computing power when learning those points. However, for small datasets, dummies work well.

In conclusion, choosing the suitable encoding for categorical data is a trade-off among the size of your dataset, space, computing time and the learning algorithm you are going to use.

Hope this article gives you a better understanding of different ways to represent categorical data. Applaud and share if you found it worths.

--

--

Mathanraj Sharma
Mathanraj Sharma

Written by Mathanraj Sharma

Machine Learning Engineer at H2O.ai | Maker | Developer | Dev Blogger | Like my stories? Buy me a ☕ https://ko-fi.com/mathanraj_sharma

No responses yet