Entropy, cross-entropy, and kl divergence
February 19, 2020
Let’s say we are trying to create a model that distinguishes between cats, dogs, horses, and birds - these are states that form our state space. Furthermore, let’s say we have labels for each which creates a “ground-truth” probability distribution for each state. For example, if we have a vector where each element is a state, and the first element represents a bird, then the bird state-probability distribution can be represented as a vector, [1,0,0,0]. This is named “ground-truth” since one state contains nearly or complete certainty, so predicted distributions are compared to this distribution. If we name the “ground-truth” probability distribution vectors for each state as , then we can write entropy as
However, a cleaner notation will help down the line, as so the same equation can be written in the following way
Let’s add to our example by saying we have an untrained model that attempts to predict the state-probability distributions P. The entropy can be calculated in the same manner,
What we would like to do is find a way to compare these two distributions in order to use learning (ie. gradient descent or otherwise) to update our model to output vectors closer to P. Notice has , which is the predicted probability of the state i. This is not favorable since we want to compare the information of the prediction, , relative to the ground-truth probability vectors. The entropy of predictions Q with respect to P is given the named Cross-Entropy and written as
The entropy in is at it’s lowest since our state probability vectors have maximum certainty, and so if we found the entropy difference between the predicted state probability vectors $$\mathbf{H}(Q | P)\mathbf{H}(P)$$ as |
The given name for this entropy expression is the Kullback Leibler Divergence, , and comes from the names of the mathematicians who pioneered this idea.
A few properties of KL are the following:
-
$$\mathbf{H}(P Q) - \mathbf{H}(P) \geq 0 \mathbf{H}(P)$$ is proposed as the smallest amount of entropy of that system assuming P to be the ground truth. -
$$\mathbf{KL}(P Q) \neq \mathbf{KL}(Q P)\sum_{i \in \text{states}} I(Q_i) \cdot P_i \neq \sum_{i \in \text{states}} I(P_i) \cdot Q_iI(P_i) \neq I(Q_i)$$. Conceptually, the information from the predicted values Q relative to the ground truth vectors P is not the same as the information from vectors P (which contains the most minimal entropy to begin with) relative to the predicted vectors Q; the latter almost makes no sense.
Notes:
- Vectors labeled as ground truth, have a single element with a 1. Each element signifies probability, and so the reduced sum equals 1. These single non-zero element vectors are colloquially named “One-Hot-Encoded” vectors.
Index |