One-Hot Encoding

Representing categorical data into numbers

Numerical algorithms need numbers as inputs.

But how to feed categorical data into numerical algorithms?

Binary Data

In the case of binary data, it’s relatively easy to understand that a good way to represent them in numerical form is by using 1 and 0.

For example, let’s consider that our algorithm works on data about generic things and we have a variable that represents the information of “being a gift”. We can use 1 if the item is a gift, 0 otherwise:

$$\begin{array}{c|c} & \text{Value} \\
\hline \text{Gift} & 0 \\
\hline \text{No Gift} & 1 \end{array}$$

Multi-Class Data

What if there are more classes?

Suppose we have a variable that represents a color of an item, and the possible colors are: red, blue and green.

We could represent the colors with numbers: 0 for the red, 1 for the blue, 2 for the green:

$$\begin{array}{c|c} & \text{Value} \\
\hline \text{Red} & 0 \\
\hline \text{Blue} & 1 \\
\hline \text{Green} & 2 \end{array}$$

$$ $$

And considering 3 items of different colors, we would have:

$$\begin{array}{c|c} & \text{Color} \\
\hline \text{Item 1} & 0 \\
\hline \text{Item 2} & 1 \\
\hline \text{Item 3} & 2 \end{array}$$

$$ $$

Meaning that: the Item 1 is blue, the Item 2 is red and the Item 3 is green.

But this is not a good choice because we are imposing implicitly some numerical properties, like “green is more than red”.

A good way to represent a multi-class variable numerically is by using the One-Hot Encoding.
We use a number of variables equal to the number of possible classes, assigning a 1 to the variable that represents the class which the item belongs to and 0 to all the other classes:

$$\begin{array}{c|c|c|c} & \text{Red} & \text{Blue} & \text{Green} \\
\hline \text{Item 1} & 0 & 1 & 0 \\
\hline \text{Item 2} & 1 & 0 & 0\\
\hline \text{Item 3} & 0 & 0 & 1 \end{array}$$

$$ $$

In this way we will have a column for each possible class and each item will have only a 1 in the column related to the class to which it belongs to and 0 in all of the other columns.

Now we can easily input these data into an algorithm that works with numbers.

Fabrizio Cacicia
Artificial Intelligence and Robotics student

Passionate about Computer Vision and Artificial Intelligence, experience in mobile development

Related