Foundations of Probability: Generative Adverserial Netwoks (GANs)

I. Introduction:

Main Idea:

The general idea of a Generative Adverserial Networks (GANs) can be described as two neural nets being pitted against each other. These two neural nets differ from one another in that one is a generator and the other is a discriminator. In more detail:

Two neural networks where a generator neural network G attempts to estimate the data distribution X and generate samples, while a discriminator neural network attempts to estimate whether a sample came from G or the original data X.

Within the history of Deep Learning, most prominent successes have stemmed from discriminative networks. According to Goodfellow, this is because generative networks have "difficulty of approximating many intractable probabilistic computations that arise in maximum likelihood estimation and related strategies, and due to difficulty of leveraging the benefits of piecewise linear units in the generative context."

To mitigate these issue, Goodfellow proposes an "Adversarial Nets Framework", which consists of a "generative model is pitted against an adversary: a discriminative model that learns to determine whether a sample is from the model distribution or the data distribution."

This is reminiscent to Game Theory, and so when the two networks reach Nash Equilibrium, the generator G is producing perfect samples from the training data distribution, and the discriminator D would be completely uncertain on whether a generated sample is real or fake (eg. D says a generated sample has a 50% probability of being real or fake.)

II. Theory Pt.1:

Discriminator:

Given the output of the Distriminator D, we can utilize a binary cross-entropy loss as usual discriminators would. We can simply the loss function if we knew which inputs were generated, \( \hat{x} \), and which came from the original data distribution \( x \).

\begin{equation} J_D = - [t \log(D(x)) + (1-t) \log(1-D(x))] \end{equation}

We can simplify the loss function if we knew which inputs were generated, \( \hat{x} \), and which came from the original data distribution \( x \).

\begin{equation} J_D = - [\log(D(x)) + \log(1-D(\hat{x}))] \end{equation}

We can rewrite this back into a format with expectation values as,

\begin{align} J_D &= - [t \log(D(x)) + (1-t) \log(1-D(x))] \\ &= \mathbb{E}_{X \sim p(x)} \big [ \log{D(x)} \big ] + \mathbb{E}_{Z \sim p(z)} \big [ \log{(1-D(Z))} \big ] \end{align}

We know that since the generated samples come from G(z), then we have,

\begin{align} J_D &= - [t \log(D(x)) + (1-t) \log(1-D(x))] \\ &= \mathbb{E}_{X \sim p(x)} \big [ \log{D(x)} \big ] + \mathbb{E}_{Z \sim p(z)} \big [ \log{(1-D(Z))} \big ] \\ &= \mathbb{E}_{X \sim p(x)} \big [ \log{D(x)} \big ] + \mathbb{E}_{Z \sim p(z)} \big [ \log{(1-D(G(z)))} \big ] \end{align}

Finally, we know that since a discriminator is looking to discriminate between a real and fake image, we are looking to reduce the error; in other words, minimizing the error as.

\begin{align} J_D &= \max_{D} \mathbb{E}_{X \sim p(x)} \big [ \log{D(x)} \big ] + \mathbb{E}_{Z \sim p(z)} \big [ \log{(1-D(G(z)))} \big ] \end{align}

Generator:

We know from the description of a GAN that a Generator G is attempting to fool the Discriminator into believing the generated samples are real

\begin{equation} J_G = - J_D \end{equation}

We can simply the loss function if we knew which inputs were generated, \( \hat{x} \), and which came from the original data distribution \( x \).

\begin{align} J_G &= \min_{G} -J_D \\ &= \min_{G} \mathbb{E}_{X \sim p(x)} \big [ \log{D(x)} \big ] + \mathbb{E}_{Z \sim p(z)} \big [ \log{(1-D(G(z)))} \big ] \end{align}

Discriminator-Generator:

Together we call this is a mini-max loss function.

\begin{align} J_G &= \min_{D} \max_{G} -J_D \\ &= \max_{G} \mathbb{E}_{X \sim p(x)} \big [ \log{D(x)} \big ] + \mathbb{E}_{Z \sim p(z)} \big [ \log{(1-D(G(z)))} \big ] \end{align}

Recall:

Binary Cross-Entropy is defined as the following, \begin{align} H(p,q) &= H(p) - D(p \vert \vert q) \\ &= \mathbb{E}_{X \sim p(x)} \big [ \log{\frac{1}{p(x)}} \big ] + \mathbb{E}_{X \sim p(x)} \big [ \log{\frac{p(x)}{q(x)}} \big ] \\ &= - [t \log(q(x)) + (1-t) \log(1-q(x))] \end{align}

\begin{align} H(p,q) &= H(p) - D(p \vert \vert q) \\ &= \mathbb{E}_{X \sim p(x)} \big [ \log{\frac{1}{p(x)}} \big ] + \mathbb{E}_{X \sim p(x)} \big [ \log{\frac{p(x)}{q(x)}} \big ] \end{align}
We can simply and use J to signify this loss function. \begin{align} J(q) &= \mathbb{E}_{X \sim p(x)} \big [ \log{\frac{1}{p(x)}} \big ] + \mathbb{E}_{X \sim p(x)} \big [ \log{\frac{p(x)}{q(x)}} \big ] \\ &= \mathbb{E}_{X \sim p(x)} \big [ \log{\frac{1}{p(x)} \frac{p(x)}{q(x)}} \big ] \\ &= \mathbb{E}_{X \sim p(x)} \big [ \log{\frac{1}{q(x)}} \big ] \\ &= - \mathbb{E}_{X \sim p(x)} \big [ \log{q(x)} \big ] \\ &= - \sum_x p(x) \log(q(x)) \\ &= \!\begin{aligned}[t] & - [p(x=\mathit{label_1}) \log(q(x=\mathit{label_1})) + \\ & p(x=\mathit{label_2}) \log(q(x=\mathit{label_2}))] \end{aligned} \\ \end{align} Since, the probability of getting a label given the obseved data distribution is 1, and we are dealing with a binary labels, we can rewrite our objective to check if we have the value of the first label; otherwise, we must have the label of the other. We'll call the value of the first label t, for 'target'. \begin{align} J(q) &= - [t \log(q(x)) + (1-t) \log(1-q(x))] \end{align}

Note:

Another way to interpret \( D(x) \) and \( D(\hat{x}) \) is:

  • \( D(x) \) is the probability of getting a real image given a real image as input, \( D(x=\text{real_im}| \text{real_im}) \).
  • \( D(\hat{x}) \) is the probability of getting a fake image given a fake image as input, \(x=D(\text{fake_im} | \text{fake_im}) \).


III. Theory Pt.2:

Problem:

You may have realized, differentiating the objective function for the generator G might seem a bit odd.

\begin{equation} J_D = - [t \log(D(x)) + (1-t) \log(1-D(x))] \end{equation}

\begin{align} \frac{\partial{J_G}}{\partial{\theta_G}} &= - \bigg[ \frac{\partial}{\partial{\theta_G}} \mathbb{E}_{X \sim p(x)} \big [ \log{D(x)} \big ] + \frac{\partial}{\partial{\theta_G}} \mathbb{E}_{Z \sim p(z)} \big [ \log{(1-D(G(z)))} \bigg ] \\ &= - \bigg[ 0 + \frac{\partial}{\partial{\theta_G}} \mathbb{E}_{Z \sim p(z)} \big [ \log{(1-D(G(z)))} \bigg ] \end{align}

Since the gradient (above) is proportional to the amount of change that the parameters update, then the change during the beginning epoch would be very slow. If D gets much better than G, then \( D(G(z)) \rightarrow 0\). In this region, the slope is the smallest and G has little chance to learn.

Solution:

We can use a different cost function that will respect the behavior of the original. Let's recall the loss function for G again.

\begin{align} J_G &= \min_{G} \mathbb{E}_{X \sim p(x)} \big [ \log{D(x)} \big ] + \mathbb{E}_{Z \sim p(z)} \big [ \log{(1-D(G(z)))} \big ] \end{align}

  • The first terms is referring to the expectation of D predicting an input image is a real given a real image. This term has no dependence on the parameters of G, so differentiating over them leads to a gradient of zero. This terms cancels out and we have seen.
  • The second term is referring to minimizing D predicting an input image is a fake given a fake image generated by G. Here's where we can try to rephrase this while keeping the essence intact: maximize the expectation of D predicting an input image is real given a fake image generated by G.

This solution is often named a Non-Saturating Heuristic. Non-Saturating in that the gradient is not saturating - the gradient isn't coverging onto some single value. This is good, because we the gradient to be large when the error is large. Heuristic because this is a non-theoretical solution that instead is derived by numerical means.

Notice also that this solution removes all theoretical analysis using Game Theory, since we no longer have a zero-sum-game. Namely, the loss of D and G don't add to 0, \( J_D + J_G \neq 0 \)

\begin{align} J_G &= \max_{G} \mathbb{E}_{Z \sim p(z)} \big [ \log{(D(G(z)))} \big ] \\ &= \min_{G} -\mathbb{E}_{Z \sim p(z)} \big [ \log{(D(G(z)))} \big ] \end{align}

IV. Analysis:

Model Types:

One of the more prevalent GAN types is the Deep Convolutional GAN (DCGAN). It's name stems from the fact its architecture stems from the All-Convolutional Network, which revolves around a simplified LeNet; rather than involving the convolutional and pooling layers of LeNet, All-Convolutional only using convolutional layers and using a convolutional strides of 1 to reduce the dimensionality at each layer.