Numerical algorithms need numbers as inputs.
But how to feed categorical data into numerical algorithms?
Binary Data
In the case of binary data, it’s relatively easy to understand that a good way to represent them in numerical form is by using 1 and 0.
For example, let’s consider that our algorithm works on data about generic things and we have a variable that represents the information of “being a gift”. We can use 1 if the item is a gift, 0 otherwise:
$$\begin{array}{c|c} & \text{Value} \\
\hline \text{Gift} & 0 \\
\hline \text{No Gift} & 1 \end{array}$$
Multi-Class Data
What if there are more classes?
Suppose we have a variable that represents a color of an item, and the possible colors are: red, blue and green.
We could represent the colors with numbers: 0 for the red, 1 for the blue, 2 for the green:
$$\begin{array}{c|c} & \text{Value} \\
\hline \text{Red} & 0 \\
\hline \text{Blue} & 1 \\
\hline \text{Green} & 2 \end{array}$$
$$ $$
And considering 3 items of different colors, we would have:
$$\begin{array}{c|c} & \text{Color} \\
\hline \text{Item 1} & 0 \\
\hline \text{Item 2} & 1 \\
\hline \text{Item 3} & 2 \end{array}$$
$$ $$
Meaning that: the Item 1 is blue, the Item 2 is red and the Item 3 is green.
But this is not a good choice because we are imposing implicitly some numerical properties, like “green is more than red”.
A good way to represent a multi-class variable numerically is by using the One-Hot Encoding.
We use a number of variables equal to the number of possible classes, assigning a 1 to the variable that represents the class which the item belongs to and 0 to all the other classes:
$$\begin{array}{c|c|c|c} & \text{Red} & \text{Blue} & \text{Green} \\
\hline \text{Item 1} & 0 & 1 & 0 \\
\hline \text{Item 2} & 1 & 0 & 0\\
\hline \text{Item 3} & 0 & 0 & 1 \end{array}$$
$$ $$
In this way we will have a column for each possible class and each item will have only a 1 in the column related to the class to which it belongs to and 0 in all of the other columns.
Now we can easily input these data into an algorithm that works with numbers.