Intellectus EX-Machina

Main Idea:

The Kernel Trick is a way to find distances in a higher dimensional space without needing to explicitely span that space. In essence, the Kernel allows you find the dot product distance between points in a different space without needing to find the function that spans that space explicitely.

The Kernel function can be said to be the inner product between points in the linear space onto a higher dimensional space.

The Kernel Trick (or Kernel Method) is often associated with Support Vector Machines (SVMs), and have been known for being incredibly effective at classification tasks.

Definition:

The Kernel Trick finds the inner product of a different space without needing to find the function(or transormation) that spans that space. Let's say we have input $ \mathbf{X} \in \mathbb{R}^D $, if we wanted to span this input into a different dimentional space, we can find a transformation $ \phi(\mathbf{X}) : \mathbb{R}^D \rightarrow \mathbb{R}^M$ that changes $ \mathbf{X} $ to a different dimensional space. The challenge here is: what if we don't know the transformation $ \phi $?

A Look Back at Regression:

Let's begin our journey by stating the usual method of regression. If we have a input $ \mathbf{X} $, with weights $ \mathbf{W} $, predictions $ \hat{\mathbf{Y}} $ and labels $ \mathbf{Y} $, we can write,

\begin{align} \hat{\mathbf{Y}} &= \mathbf{X}\mathbf{W} \\ \hat{\mathbf{Y}} &= \mathbf{\phi(X)} \; \mathbf{W} \end{align}

To find the optimal parameters that fit $ \mathbf{Y} $, $ \mathbf{W}^* $, we find the minimum of an object function (or loss function). The objective function here is the residual error, or ordinary least squares, between the label $ \mathbf{Y} $ and the predicted $ \hat{\mathbf{Y}} $ as follows. To do this, we find the minimum value of this loss function. One way to do this is to find the point at which the gradient is lowest. This can be written as follows:

\begin{align} J(W) &= (\mathbf{Y} - \phi \; \mathbf{W})^2 \; + \; \frac{\lambda}{2} \vert\vert {\mathbf{W}} \vert\vert \end{align} \begin{equation} \frac{\partial{}}{\partial{\mathbf{W}}} J(\mathbf{W}) = 0 \end{equation} \begin{equation} \mathbf{W}^* = \underset{x}{\mathrm{argmin}} \; J(\mathbf{W}) \end{equation}

Solving for $ \mathbf{W}^* $, we find the solution below:

\begin{align} \mathbf{W^*} &= (\mathbf{X^T}\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^T\mathbf{Y} \\ &= (\mathbf{\phi}^T\mathbf{\phi} + \lambda \mathbf{I})^{-1}\mathbf{\phi}^T\mathbf{Y} \end{align}

Problem:

One major issue with this solution is that we are assuming that we know the transformation $ \mathbf{\phi} $ that spans the dimensional space we are attempting to find distances from. In order to overcome this problem, we MUST make our solution independent of this transformation.

To do this we look back at the object function $ J(W) $ once again and rewrite the exact closed solution in a different way:

\begin{align} J(W) &= (\mathbf{Y} - \phi \; \mathbf{W})^2 \; + \; \frac{\lambda}{2} \vert\vert {\mathbf{W}} \vert\vert \end{align} \begin{align} \mathbf{W^*} &= (\mathbf{\phi}^T\mathbf{\phi} + \lambda \mathbf{I})^{-1}\mathbf{\phi}^T\mathbf{Y} \end{align} \begin{align} J(W) &= \sum_{i=1}^N (\mathbf{Y}_i - \phi_i \; \vec{\mathbf{W}_i})^2 \; + \; \frac{\lambda}{2}\sum_{i=1}^N \vert\vert {\mathbf{W}_i} \vert\vert \end{align} \begin{align} \mathbf{W^*} &= \frac{1}{\lambda} \sum_{i=1}^N (\mathbf{Y}_i - \mathbf{\phi}_i \mathbf{W}_i) \: \mathbf{\phi}_i \\ &= \sum_{i=1}^N \alpha_i \: \mathbf{\phi}_i \\ &= \mathbf{\phi} \; \mathbf{\alpha} \\ \end{align}

To make things more organized, let's just have:

\begin{align} \alpha &= \frac{1}{\lambda} \sum_{i=1}^N (\mathbf{Y}_i - \mathbf{\phi}_i \mathbf{W}_i) \end{align}

Now that we have the optimal weights are $ W = \phi \; \alpha $, we can put this value back into our original objective function $ J(W) $, which is now $ J(\alpha) $, and differentiate with respect to $ \alpha $ and find the minimum. Again, to find the minimum point of the gradient function, you take the gradient and set it to zero.

\begin{align} J(W) &= (\mathbf{Y} - \phi \; \mathbf{W})^2 \; + \; \frac{\lambda}{2} \vert\vert {\mathbf{W}} \vert\vert\\ &= (\mathbf{Y} - \mathbf{\phi} \mathbf{W}))^T (\mathbf{Y} - \mathbf{\phi} \mathbf{W})) + \frac{\lambda}{2} \mathbf{W}^T \mathbf{W} \\ &= \mathbf{Y}^T\mathbf{Y} - \mathbf{Y}^T \mathbf{\phi}\mathbf{W} - \mathbf{W}^T\mathbf{\phi}^T\mathbf{Y} + \mathbf{W}^T \mathbf{\phi}^T \mathbf{\phi} \mathbf{W} + \frac{\lambda}{2} \mathbf{W}^T \mathbf{W}\\ &= \mathbf{Y}^T\mathbf{Y} - \mathbf{Y}^T \mathbf{\phi}(\mathbf{\phi}^T\mathbf{\alpha}) - (\mathbf{\phi}^T\mathbf{\alpha})^T\mathbf{\phi}^T\mathbf{Y} + (\mathbf{\phi}^T\mathbf{\alpha})^T \mathbf{\phi}^T \mathbf{\phi} (\mathbf{\phi}^T\mathbf{\alpha}) + \frac{\lambda}{2} (\mathbf{\phi}^T\mathbf{\alpha})^T (\mathbf{\phi}^T\mathbf{\alpha})\\ &= \mathbf{Y}^T\mathbf{Y} - \mathbf{Y}^T \mathbf{\phi}\mathbf{\phi}^T\mathbf{\alpha} - \mathbf{\alpha}^T\mathbf{\phi}\mathbf{\phi}^T\mathbf{Y} + \mathbf{\alpha}^T\mathbf{\phi} \mathbf{\phi}^T \mathbf{\phi} \mathbf{\phi}^T\mathbf{\alpha} + \frac{\lambda}{2} \mathbf{\alpha}^T\mathbf{\phi} \mathbf{\phi}^T\mathbf{\alpha} \end{align}

Notice how whenever $ \phi $ appears now, it only appears in the context of $ \mathbf{\phi} \mathbf{\phi}^T $. This expression is typically called the Kernel Matrix, or Gram Matrix, and is typically denoted as K, $ K = \mathbf{\phi} \mathbf{\phi}^T $. This matrix has two important properties:

It is symmetric, so $ K = K^T $
It is positive semi-definite, so $ \alpha^T K \alpha \geq 0 $

These two properties fulfill the requirements for Mercer's Theorem as follows:

Mercer's Theorem:

A symmetric function $ K(x,y) $ can be expressed as an inner product $$ K(x,y) = \langle \phi(x), \phi(y) \rangle$$ for some $ \phi $ if and only if $ K(x,y) $ is positive semi-definite.

In other words, we can use a kernel function k that computes the distance between two points in the space of $ \phi $ without needing to know $ \phi $ itself.

\begin{align} \phi \phi^T &= \begin{bmatrix} \langle \phi(x_1), \phi(x_1) \rangle & \dots & \langle \phi(x_1), \phi(x_n) \rangle \\ \vdots & \ddots & \vdots \\ \langle \phi(x_n), \phi(x_1) \rangle & \dots & \langle \phi(x_n), \phi(x_n) \rangle \end{bmatrix} \\ &= \begin{bmatrix} k(x_1,x_1) & \dots & k(x_1,x_n) \\ \vdots & \ddots & \vdots \\ k(x_n,x_1) & \dots & k(x_n,x_n) \end{bmatrix} \\ \end{align}

Since Kernel functions only reflect the inner product of a function, they cannot tell us too much about the space $ \phi $ which they come from. Kernel function are therefore named after the characteristics of that function rather than that of $ \phi $ itself. However, they can indicate whether certain data may be better fit for certain Kernel functions (eg. A polynomial decision boundary may be better fit for a Polynomial Kernel).

\begin{align} \text{Linear: } K(X_i, X_j) &= (X_i^T X_j + r)^r \\ \text{Polynomial: } K(X_i, X_j) &= X_i^T X_j \\ \text{Gaussian: } K(X_i, X_j) &= \exp{\bigg(-\frac{\vert \vert X_i - X_j \vert \vert}{2 \sigma^2} \bigg)} \\ \text{Laplace: } K(X_i, X_j) &= \exp{\big(-\alpha \vert \vert X_i - X_j \vert \vert \big)} \end{align}

Note:

Here we write $ \mathbf{\phi(X)} $ as just $ \mathbf{\phi} $ in order to keep the operations more readable.

Putting everything together:

The final step now is to take the objective function $ J(\alpha) $ in terms of our Kernel matrix and find the minima of $ J(\alpha) $ by differentiating it with respect to $ \alpha $:

\begin{align} J(\alpha) &= \mathbf{Y}^T\mathbf{Y} - \mathbf{Y}^T K\mathbf{\alpha} - \mathbf{\alpha}^T K \mathbf{Y} + \mathbf{\alpha}^T K^2 \mathbf{\alpha} + \frac{\lambda}{2} \mathbf{\alpha}^T K \mathbf{\alpha} \\ &= \mathbf{Y}^T\mathbf{Y} - \mathbf{Y}^T K\mathbf{\alpha} - \mathbf{Y}^T K \mathbf{\alpha} + \mathbf{\alpha}^T K^2 \mathbf{\alpha} + \frac{\lambda}{2} \mathbf{\alpha}^T K \mathbf{\alpha} \\ &= \mathbf{Y}^T\mathbf{Y} - 2 \mathbf{Y}^T K\mathbf{\alpha} + \mathbf{\alpha}^T K^2 \mathbf{\alpha} + \frac{\lambda}{2} \mathbf{\alpha}^T K \mathbf{\alpha} \end{align} \begin{align} \frac{\partial}{\partial{\alpha}} J(\alpha) &= 0 - 2 \mathbf{Y}^T K + 2 \mathbf{\alpha}^T K^2 + \lambda \mathbf{\alpha}^T K \\ &= -2 \mathbf{Y}^T + 2 \mathbf{\alpha}^T K + \lambda \mathbf{\alpha}^T \end{align} \begin{align} 2 \mathbf{\alpha}^T K + \lambda \mathbf{\alpha}^T &= 2 \mathbf{Y}^T \\ \mathbf{\alpha}^T (2K + \lambda \mathbf{I}^T) &= 2 \mathbf{Y}^T \\ \mathbf{\alpha}^T (K + \frac{\lambda}{2} \mathbf{I}) &= \mathbf{Y}^T \end{align} \begin{align} \mathbf{\alpha}^T &= \mathbf{Y}^T (K + \frac{\lambda}{2} \mathbf{I})^{-1} \\ \mathbf{\alpha^*} &= (K + \frac{\lambda}{2} \mathbf{I})^{-1} \mathbf{Y} \end{align}

Finally, remembering that $ \mathbf{W} = \mathbf{\phi^T} \mathbf{\alpha} $, we have a value for $ \mathbf{W} $ we can use during prediction,

\begin{align} \mathbf{W} &= (\mathbf{\phi^T}\mathbf{\phi} + \lambda \mathbf{I})^{-1}\mathbf{\phi}^T\mathbf{Y} \end{align} \begin{align} \mathbf{W} &= \mathbf{\phi}^T \mathbf{\alpha} \\ &= \mathbf{\phi}^T (K + \frac{\lambda}{2} \mathbf{I})^{-1} \mathbf{Y} \\ &= \mathbf{\phi}^T (\mathbf{\phi}\mathbf{\phi}^T + \frac{\lambda}{2} \mathbf{I})^{-1} \mathbf{Y} \\ &= \mathbf{\phi}^T (K + \frac{\lambda}{2} \mathbf{I})^{-1} \mathbf{Y} \end{align}

For prediction, we can rewrite everything in terms of the Kernel Matrix

\begin{align} \mathbf{Y} &= \mathbf{\phi} \; \mathbf{W} \\ &= \mathbf{\phi} \mathbf{\phi^T} (K + \frac{\lambda}{2} \mathbf{I})^{-1} \mathbf{Y} \\ &= \mathbf{Y}^T (K + \frac{\lambda}{2} \mathbf{I})^{-1} (\mathbf{\phi} \mathbf{\phi}^T)^T \\ &= \mathbf{Y}^T (K + \frac{\lambda}{2} \mathbf{I})^{-1} \mathbf{\phi}^T \mathbf{\phi} \\ &= \mathbf{Y}^T (K + \frac{\lambda}{2} \mathbf{I})^{-1} K_x \end{align}

Here, $ K_x $ is simply the Kernel function with parameters of every value of x previously seen, and a new unseen value $ x_{\text{new_input}} $.

\begin{align} K(x) = \phi^T \phi_x &= \begin{bmatrix} k(x_1,x_{\text{new_input}}) \\ \vdots \\ k(x_n,x_{\text{new_input}}) \end{bmatrix} \\ \end{align}

Note:

Here we write $ \mathbf{\phi(X)} $ as just $ \mathbf{\phi} $ in order to keep the operations more readable.

Foundations of Linear Algebra: Kernel Trick

I. Introduction:

II. Theory Pt.1:

III. Theory Pt.2: