Techniques improving the performance of Generative Adversarial Networks (GANs)

Kavita Anant
12 min readDec 1, 2020

--

A summary of some of the ground-breaking techniques that majorly contributed to the improvement in GANs performance.

In the past GANs have gained a lot of attention owing to their wide range of applications. GANs have managed to astonish everyone with their ability to generate fake images that appear to look extremely real. This has generated a lot of curiosity among researchers. While we explore the possibilities of applications of GANs, it would be interesting to look into ways to improve their performances. The primary objective of this article is to understand the various techniques that have been employed on GANs since it’s inception in 2014 to boost its performance.

So let’s start with the most obvious question-

What is a GAN?

Basic Concept of GANs¹⁰

Generative Adversarial Networks or GANs are a class of Neural Network architecture that generate images from a zero sum game. It is based on the interaction of 2 major blocks-a generator and a discriminator. The discriminator is fed with real images (data) and fake images generated by a generator randomly. The discriminator has to distinguish between the real and fake images. Essentially, GAN involves adversarial training of the Generator and the Discriminator where the goal of the Generator is to maximize the probability of Discriminator making a mistake in identifying a fake image. The model converges when neither of the Generator or the discriminator is able to reduce the loss.

The concept was introduced by Ian Goodfellow and his team in 2014 in their paper “Generative Adversarial Nets”.

Today, the GANs have become so powerful that they can generate live portraits from still images-

Live portraits from still images

I. Improved Techniques for Training GANs

GAN block diagram: Nash Equilibrium

GANs are designed to reach a Nash equilibrium⁵ at which each player cannot reduce their cost without changing the other players’ parameters. One of the major problems in GANs is achieving this convergence. Algorithms for GAN games with non-convex cost functions and continuous parameters with extremely high-dimensional parameter space, still do not exist.

Improved Techniques for Training GANs proposes few of the following architectural features and training procedures to apply to the GANs framework:

  1. Feature matching
  2. Minibatch discrimination
  3. Historical averaging
  4. One-sided label smoothing
  5. Virtual batch normalization

These techniques are heuristically motivated to encourage convergence.

  1. Feature matching:

It makes sense for the generator, instead of trying to fool the discriminator, to instead generate data that matches the statistics of the real data. The discriminator is used to specify which statistics are indeed worth matching. Feature matching is proven to be effective in situations where regular GAN becomes unstable. The following is the objective function of Feature Matching. Here f(x) indicates the original data and G(z) is the data generated by the generator.

Objective function for Feature Matching¹

2. Minibatch discrimination:

Since the discriminator functions independently, it is unaware that the generator is generating the same samples every time.

Since the discriminator processes each example independently with no coordination between its gradients, there is no mechanism to indicate the need of the outputs of the generator to become more dissimilar to each other which causes the generator to generate outputs from the same point.

Thus, if the discriminator model looks at multiple combination of images rather than looking at them individually, the collapse of the generator can be avoided. This is the concept of Minibatch discrimination. Basically, it helps to model closeness between the samples in a minibatch thus enabling the discriminator to classify single images by using the minibatch as side information.

Minibatch discrimination: Features f(xi) from sample xi are multiplied through a tensor T, and cross-sample distance is computed.

Minibatch discrimination is shown to generate visually appealing samples very quickly and of a superior quality to feature matching.

3. Historical averaging:

As the name suggests, in this technique the historical average of the parameters are computed and updated in an online fashion. In this way, the learning rule scales well to long time series. This approach was able to find equilibria of low-dimensional, continuous non-convex games. The historic averaging term is given by:

Historical Averaging term¹

Here θ[i] is the value of the parameters at past time i.

4. One-sided label smoothing:

This technique uses the traditional label smoothing as follows but with one change. Here we smooth only the positive labels to α, leaving negative labels set to 0. This is done to avoid instabilities and large, erroneous samples from p_model when p_data is approximately 0.

Traditional label smoothing by Szegedy el where α=0.1 and β=0.9

5. Virtual batch normalization

In this technique each example x is normalized based on the statistics collected on a reference batch of examples that are chosen once and fixed at the start of training, and on x itself. The reference batch is normalized using only its own statistics. This helps solve the problem with Batch Normalization where the output of a neural network for an input example x is highly dependent on several other inputs x’ in the same minibatch.

Note: Virtual Batch Normalization is computationally expensive

Experiments and Results:

The authors applied different combinations of the above technique to 2 major applications of GANs: (A) semi-supervised learning and (B) the generation of images that humans find visually realistic on MNIST, CIFAR-10 and SVHN.

(i) MNIST

(Left) Samples generated without minibatch discrimination (clearly distinguishable from MNIST dataset images). (Right) Samples generated with minibatch discrimination (indistinguishable from MNIST dataset images)¹.
Number of incorrectly classified test examples for the semi-supervised setting on permutation invariant MNIST¹

(ii) CIFAR-10

Samples generated on CIFAR-10 with feature matching (left) and minibatch discrimination (right)¹
Test error on semi-supervised CIFAR-10¹

(iii) SVNH

Samples from the generator for SVHN¹
Error rate on SVHN¹

Various combinations of the 5 techniques were used to generate images. The following were the images generated.

Inception scores for samples generated by various models for 50,000 images. (VBN: Virtual Batch Normalization, HA: Historical Averaging, LS: Label Smoothing MBF: Minibatch Features)

The authors have provided the code for the above techniques here: : https://github.com/openai/ improved_gan

Despite the above mentioned techniques, the GANs suffered from a tradeoff between the size and quality of the images generated. To address this, in 2018 ‘Large Scale GAN Training for High Fidelity Natural Image Synthesis’ was published.

II. Large Scale GAN Training for High Fidelity Natural Image Synthesis

Successful generation of large and diverse samples from complex datasets remained a distant goal until the introduction of Big GANs. Large GANs or Big GANs is the routine generation of both high-resolution (big size) and high-quality (large variety) images. Scaling up GAN training leverages the performance benefits of huge models and huge batches.

HOW?

In Large GANs each batch covers more modes, providing better gradients for both networks. This helps in the generation of large and high-fidelity images.

Scaling up GANs:

The authors of ‘Large Scale GAN Training for High Fidelity Natural Image Synthesis’ have proposed the following way to scale-up the GANs:

-As a baseline, Self- Attention GAN⁶ (SA-GAN) architecture which uses the hinge loss GAN objective is employed.

-Class information is provided to the Generator(G) with class-conditional Batch Normalization and project is provided to the Discriminator (D)

-Spectral Normalization⁷ is applied and the learning rates are halved (two D steps per G step)

-Model is evaluated by employing G averages with a decay of 0.9999

-Orthogonal Initialization³ is used as the initialization technique.

-Increase the batch size (improved IS by 46%) and the width of each layer (improved IS by 21%).

-Shared class embeddings is used which is linearly projected to each layer’s gains and biases. This reduces computation and memory costs and improves training speed (by 37%)

-Direct skip connections⁴ are added from noise vector to multiple layers of G rather than just the initial layer (improved performance by 4% and training speed by 18%)

GANs benefit greatly from the above scaling. The models were trained with two to four times the parameters and eight times the batch size compared to the prior art. Although this is a dramatic improvement in the performance, the scaling up has some adverse side affects on the model and the generated images.

Trading off Variety and Fidelity with Truncation Trick

For generator’s training process, since the GANs use an arbitrary prior p(z) instead of using backpropagation for calculation of latents, the authors have explored various distributions other than the Normal and Uniform distribution for drawing the value of the noise vector ‘z’. Irrespective of the distribution chosen, this z-vector is truncated using what author’s defined as the ‘Truncation Trick’

Truncation Trick: Truncating a z vector by resampling the values with magnitude above a chosen threshold leads to improvement in individual sample quality at the cost of reduction in overall sample variety This technique allows fine-grained, post-hoc selection of the trade-off between sample quality and variety for a given G

But some models (especially the larger models) do not respond well to the above truncation trick. For this, we use Orthogonal Regularization.

Orthogonal Regularization: Some larger models are not amenable to truncation, leading to saturation in the output when fed truncated noise using the Truncation Trick. To counteract this, G is conditioned to be smooth, so that the full space of z will map to good output samples using Orthogonal Regularization. It is represented by the below formula. Here, W is a weight matrix and β is a hyperparameter.

Orthogonal Regularization
As the threshold is reduced, and elements of z are truncated towards zero (the mode of the latent distribution),individual samples approach the mode of G’s output distribution.

Despite these improvements, the models undergo training collapse, causing early stopping in practice. To analyze this, spectral analysis of G and D was performed.

Spectral Analysis:

Instabilities specific to large scale GANs are discovered as a result of upscaling. These instabilities are observed for settings which are stable at small scale. Thus, to analyze these instabilities, direct analysis was performed at large scale.

Before we jump to the analysis of instabilities in the generator model, we have to understand the term ‘Spectral Normalization’.

Spectral Normalization: Spectral Normalization is used to normalize the weights in G to stabilize the training in D. We use spectral norm σ(W) to regularize each layer of the discriminator. This spectral norm is the is the largest singular value of the weight vector W.

The details regarding the calculation of spectral norms can be found here. Now, we are good to proceed to understanding the instabilities in the BigGAN models.

Plot of the first singular value σ0 in the layers of G and D before Spectral Normalization. Most layers in G have well-behaved spectra, but without constraints a small subset grow throughout training and explode at collapse. D’s spectra are noisier but otherwise better behaved. Colors from red to violet indicate increasing depth².

Instabilities observed in the Generator Model:

As observed in figure (a) above, most G layers have well-behaved spectral norms but some layers like the first layer in G, which is over-complete and not convolutional, are ill-behaved. Their spectral norms grow throughout the training and explode at collapse. On further analysis of this collapse, it is observed that while conditioning G might improve stability, it is insufficient to ensure the same as the explosion is eminent with or without Spectral Normalization.

Instabilities observed in the Discriminator Model:

Unlike G, the spectra of D are noisy and the singular values grow throughout training but only jump at collapse, instead of exploding. On further analysis it is observed that D’s loss approaches zero during training, but undergoes a sharp upward jump at collapse due to overfitting. Thus D is memorizing the training set which means D’s role is not explicitly to generalize, but to distill the training data and provide a useful learning signal for G

We observe that it is the interaction of the discriminator and the generator through which the stability of the GANs can be achieved. A collapse cannot be stopped- it is inevitable. However, we can delay the collapse to a point such that the model is trained enough to give good results.

On performing Spectral Analysis on G and D it was observed that stability does not come from G or D alone, but from their interaction through the adversarial training process. While the symptoms of their poor conditioning can be used to track and identify instability, ensuring reasonable conditioning proves necessary for training but insufficient to prevent eventual training collapse. It is possible to enforce stability by strongly constraining D, but doing so incurs a dramatic cost in performance. Better final performance can be achieved by relaxing this conditioning and allowing collapse to occur at the later stages of training, by which time a model is sufficiently trained to achieve good results.

Using insights from this analysis, the instabilities can be reduced through a combination of novel and existing techniques. However, complete training stability can only be achieved at a high cost to performance.

Experiments and Results

Evaluation of models at different resolution:

First, FID⁹/IS⁸ values at the truncation setting which attains the best FID are reported.

Second, FID at the truncation setting for which the model’s IS is the same as that attained by the real validation data, reasoning that this is a passable measure of maximum sample variety achieved while still achieving a good level of ‘objectness’ is reported.

Third, FID at the maximum IS achieved by each model is reported, to demonstrate how much variety must be traded off to maximize quality

Previous best IS and FID values:

IS: 52.52

FID:18.65

IS and FID values by this paper:

IS: 166.5

FID: 7.4

The BigGAN model was evaluated on JFT-300M model at 256x256 resolution. The results are shown in the table below-

The FID and IS columns report these scores given by the JFT-300M-trained Inception v2 classifier with noise distributed as z ∼ N (0, I) (non-truncated). The (min FID) / IS and FID / (max IS) columns report scores at the best FID and IS from a sweep across truncated noise distributions ranging from σ = 0 to σ = 2. Images from the JFT-300M validation set have an IS of 50.88 and FID of 1.94.

Interesting developments in GANs:

In recent times, GANs have improved greatly from their first version- all thanks to the techniques that have contributed to the improvement in its performance. We now have many variants/types of GANs that are used in different applications. Some of the most interesting variants are as follows:

StackGANs¹¹: StackGANs are used to generate photo-realistic images from textual data.

Text to image using StackGANs¹²

StyleGANs¹¹: StyleGANs use Nvidia’s CUDA software, GPUs and TensorFlow to generate fake yet life-like human faces.

Fake human faces generated using StyleGANs

SinGANs¹¹: SinGANs are a variant of GAN that is used to generate images from a single images.

This allows generating new samples of arbitrary size and aspect ratio, that have significant variability, yet maintain both the global structure and the fine textures of the training image¹¹

Results from training a SinGAN model on a single image¹¹

Conclusion

GANs are a promising class of generative models that had been held from being used to their maximum potential by unstable training and lack of a proper evaluation metric. The first paper thus proposed some techniques that were heuristically motivated to improve the performance and stability of GANs. The improvement in the GAN performances gave birth to a variety of GANs, one of which is the BigGAN. GANs have highly benefitted from scaling up both in terms of fidelity and variety of the generated samples. Keeping in mind the measures suggested by the second paper, BigGANs have opened doors to a plethora of possibilities and applications using GANs.

--

--

Kavita Anant
Kavita Anant

Written by Kavita Anant

Graduate student at Columbia University|Pursuing MS in Electrical Engineering with a focus on Data Driven Analysis

No responses yet