Some problems in Machine Learning can be solved with answers like “yes/no”, “1/0”, “true/false”. Some others need values from multiple classes: for example, what is the color of an object or what animal is in a picture.
A Multi-Class Classification example
Suppose we want to classify what animal is in a picture among some possible animals.
What we need is a model that gives us a probability that the animal is of a certain type. For example, suppose we have 3 categories: dog, cat and bird, we want to know the probability of being a dog, a cat, and a bird.
Now, suppose we have a linear model that based on some inputs, like the number of teeth, the number of feathers, the presence of hair, etc. calculates the result of a certain linear equation from the values of these inputs and outputs a score (a number).
We need to translate these scores into probabilities.
First, we know that probabilities need to sum to 1.
Someone may not understand why “probabilities need to sum to 1” and I will explain this briefly for them with an example.
Let’s start by considering the probability to receive a gift or not in a specific day. We can imagine to use information like if it is our birthday, if it is Christmas, if we have been generous, etc, to predict it.
Suppose that it is Christmas and we love our parents, we could predict a 80% probability to receive a gift. I think that almost all can understand that the probability to not receive a gift in the same day it’s 20%. If we sum 80% + 20% we obtain 100% and this can be translated in words by saying that “among all the possibilities to receive or not a gift (the 100%, we have a 80% chance to receive it and 20% chance not”.
Another way to write these numbers is by representing the 100% as 1.00 and consequently represent the 80% as 0.80 and the 20% as 0.2. Now we can for the first time verify that the probabilities sum to 1: $0.80+0.20 = 1.00$
So, suppose that our linear model receives the inputs for an animal to categorize and gives us the following scores:
$$\begin{cases} Dog: 2 \\
Cat: 1 \\
Bird: 0 \end{cases}$$
We need to convert these numbers so that they sum to 1. We can easily do it by converting them into fractions over the total sum of them:
$$\begin{cases} Dog: \frac{2}{2+1+0}=\frac{2}{3} \\
Cat: \frac{1}{2+1+0}=\frac{1}{3} \\
Bird: \frac{0}{2+1+0}=\frac{0}{3} \end{cases}$$
They clearly sum to 1:
$$\frac{2}{3}+\frac{1}{3}+\frac{0}{3}=\frac{3}{3}=1$$
And once we converted them into fractions, we have that:
$$\begin{cases} Dog:\frac{2}{3}=0.67 \\
Cat:\frac{1}{3}=0.33 \\
Bird:\frac{0}{3}=0 \end{cases}$$
(obviously, they also sum to 1 in decimal form: $0.67+0.33+0=1$)
Another property that we need is that if the linear model assigns a higher number this will have to be translated in a higher probability. And we can see that also this property is respected.
But this solution doesn’t actually work. What if the linear equation returns a negative score? How can we convert a negative score into a probability? Or what if the scores are 1, 0, -1? This will cause the total sum to be 0 and then we are dividing to 0 to compute the probabilities.
We need a function that converts all the negative numbers to positive and preserves the relationship between high scores and high probability, i.e. we need a function always positive and strictly increasing monotonic. One could think to use the Absolute Value function, which for sure will convert negative numbers into positive, but it does not preserve the second property.
The Softmax Function
A function that has both the properties is the Exponential function, which is always positive and strictly increasing monotonic.
Now that we know that we can use the Exponential function, let’s repeat the same reasoning we did before, but instead of using directly the scores we’ll use the exponential of those numbers.
$$\begin{cases} Dog: 2 \rightarrow e^2 \\
Cat: 1 \rightarrow e^1 \\
Bird: 0 \rightarrow e^0 \end{cases}$$
So:
$$\begin{cases} Dog:\frac{e^2}{e^2+e^1+e^0}=0.67 \\
Cat:\frac{e^1}{e^2+e^1+e^0}=0.24 \\
Bird:\frac{e^2}{e^2+e^1+e^0}=0.09 \end{cases}$$
We can check that $0.67 + 0.24 + 0.09 = 1$ and that $P(Dog)>P(Cat)>P(Bird)$, so both the properties are respected.
What we used to convert the scores into probabilities is called Softmax function.
To summarize
We have the problem of classifying an item i into one of n classes.
We have a linear model that gives us a score of being of a certain class among the n classes for that item.
$$scores = z_1, \dots, z_n$$
We can obtain the probability that the item $i$ belongs to a certain classes $c$ by applying the Softmax function:
$$P(c)=\frac{e^{z_c}}{e^{z_1}+\dots+e^{z_n}}=\frac{e^{z_c}}{\sum_{j=1}^{n}e^{z_j}}\ \text{for}\ c=1, \dots, n$$
$$ $$
$$\begin{cases} P(c=1):\frac{e^{z_1}}{e^{z_1}+\dots+e^{z_n}}=\frac{e^{z_1}}{\sum_{j=1}^{n}e^{z_j}} \\
\qquad \vdots \\
P(c=n):\frac{e^{z_n}}{e^{z_1}+\dots+e^{z_n}}=\frac{e^{z_n}}{\sum_{j=1}^{n}e^{z_j}} \end{cases}$$
The last one is the actual output of the Softmax function, i.e. the probability to belong to each class.