Deep learning for Image and Video Super-Resolution

Published in

Mingle Sport

10 min readSep 28, 2022

Mingle Blog Post
September 28, 2022
By François Bertholom

Introduction

This work was done as part of an internship at Mingle Sport between June and August 2022, and this blog post aims to present the obtained results in a simple way, to make it accessible to a broad audience.
From a computer’s perspective, images are nothing more than collections of numbers, represented as a myriad of small squares called pixels. For each pixel, three numbers indicate the intensities of red, green, and blue, at the corrosponding specific spot in the picture. The more pixels are used to represent an image, the higher resolution it is — indeed more pixels means we can describe finer details.
There are different ways of degrading a high-resolution image into a low-resolution one, and the process can generally be described as a series of operations including optical or motion-related distorsions, blurring, noise, and undersampling.
The problem of super-resolution is to retrieve a plausible high-resolution version of a low-resolution input, i.e. to reverse the generic degradation process we just described. In this post, we use deep neural networks to perform super-resolution. They are a specifid kind of statistical learning method, which means that they are trained and learn to carry out a specific task, based on example data. We briefly describe their functioning and base principles in the first part.

Video 1: Left: Original ; Right: Super-Resolution

No matter how powerful neural networks are, we cannot but notice that the super-resolution problem is ill-posed, since it does not have a unique solution. In-deed, multiple high-resolution images can lead to the same low-resolution version post-degradation.
Thus, we are not trying to retrieve an exact image, rather an element of the
manifold of realistic high-resolution images.
Defining “realistic” is not an easy thing to do. We will see that pixel-
based metrics lead to models producing images that are faithful to the original low-resolution version, but that lack details. We will use more sophisticated criteria to obtain more satisfactory results.
Once we have built a solid single-image super-resolution algorithm, the next step is to adapt it to videos. This is easier said than done, though. Not only do we need to add artificial details to a low-resolution image like we would for single images, but we must also ensure that the video is temporally coherent. To put it another way, videos have the advantage of offering more information to process compared to single images, but guaranteeing time-coherence is a difficult challenge.

1 How do neural networks learn?

1.1 Neural networks are (very) complicated functions

Before introducing neural networks, we need to understand the idea behind the notion of function. Mathematically, a function is simply a mapping between two sets. For instance, there exists a function that maps every human being to their first name. If we are trying to classify images, we want to approximate the function that perfectly maps an image to its class. This is quite a broad definition, and anything can be modeled using a function. In the case of super-resolution, we are trying to retrieve a function that maps a low-resolution image to a plausible and visually
pleasing high-resolution version. We make the assumption that such a function exists, and try to reconstruct it in the most accurate way possible.

How do we do this in practice? Let’s look at a numerical example of a function, say the square function that maps any real number to its square. The square of 2 is 4, 10 squared is 100, etc. We can add parameters to make it a bit more complex ; say, we multiply it by a weight w and add a bias b. If we apply this new parametric function to 2, we get w × 2² + b. We can set w and b to any number, for instance w = 5 and b = −3. Then if we apply our function to 2 again, we get 5 × 2² − 3 = 17.
A neural network is nothing more than parametric function, but one that has a lot of parameters. In the previous example, we only had two parameters — w and b -, whereas a neural network can have thousands or even millions. The way neural networks learn is by adjusting these parameters, in order to simulate an extremely sophisticated function that would perform the given task perfectly.

1.2 The learning process

When we train a neural network, we show it examples from a training set, which includes real data and a target output. In the case of super-resolution, the data is a low-resolution image, while the target is a ground truth reference, a high-resolution version of that image. For every couple of low- and high-resolution image in the dataset, the neural network is shown the low-resolution input, and is asked to make a high-resolution version of it. The super-resolution image that comes out of the neural network is compared to the ground truth, and the weights and biases (parameters)
are adjusted to make the output closer to the reference. In Figure 1, a simplified neural network is represented, with 3 neurons managing red, green and blue color channels, and various other operations are symbolized by the yellow dot. The output looks too blue and not red enough, so the weights in the neurons for blue and red are adjusted accordingly. In a real neural network, we do not predefine operations
like this, instead we let the network find its own way to perform the given task.

Figure 1: How weights and biases are adjusted when training a neural network.

One challenge is to learn the training set properly, while being able to generalize what has been learnt to new, never-seen-before examples. Hence when training a neural network, we check every few backpropagations how well the algorithm performs on a validation set, which is does not see during the training phase. If the algorithms has similar levels of performance on both sets, it means that it should
be capable of generalizing well — supposing that the training and validation set are unbiased, which is a strong hypothesis in practice.

1.3 Training a neural network for super-resolution

At the beginning of the training process, the super-resolution output of our neural network will look even worse than what we would get using basic upscaling methods. The colors will be off, the image could look very blurry, we might not even be able to recognize the original picture. But that is the point of any training: you start with bad performance, and progressively get closer to perfect execution by refining the process.
To train our super-resolution network, we need to have a way of translating the intuitive ideas: ”this image looks better”, or ”this image is closer to the reference” into something the computer can understand, i.e. a loss function. The simplest way to tell how close two images are is to compare them pixel by pixel, for each color channel. We compute the square of the difference between the values of each pixel in the images, and sum everything to get the Mean Squared Error (MSE) of the super-resolution image, compared to the ground truth reference. This way we get a training objective, in that we want to minimize the MSE — we have a perfect result when it is null.

2 Single-image super-resolution

After trying different architectures, we found that two networks could produce nice-looking images with relatively short processing times. The Super Resolution Residual Network (SRResNet) by Ledig et al. is quite fast, with a processing time of 300+ FPS to perform 480 × 270 → 1920 × 1080 px² upscaling, and gives very contrasted images. Wang et al. introduced the Residual-in-Residual Dense Blocks Net (RRDBNet), which is slower — averaging 12 FPS for 480 × 270 → 1920 × 1080 px² super-resolution — but able to produce images that look more natural than those obtained with SRResNet.

2.1 Pixel-based training

Pixel-based training is the most straightforward way to train a neural network for super-resolution. It can already give acceptable results in terms of image quality, however the networks trained with the MSE loss will never be able to retrieve the finer details in the pictures.

Figure 2: Pixel-based training sample results.

2.2 Generative Adversarial Networks

Since the MSE loss is too simple to give satisfactory results, we need to find another way of telling the computer how good-looking the images produced by our network are. Generative Adversarial Networks (GANs) were introduced by Goodfellow et al. in 2014, and have since been used to successfully perform many complicated tasks — including super-resolution. The idea is to setup a game between two neural networks: a generator, that produces super-resolution images, and a discriminator, which has to tell artificial images apart from real ones. We can think of the generator as a counterfeiter that makes fake bank notes, while the discriminator is the police, trying to catch the criminal. In the end, we want the generator to produce fake notes — or super-resolution images — that are indistinguishable from real money — real high-resolution images.
The discriminator training consists in showing it both real and fake images, and attributing it a score based on the proportion of correct predictions. The generator’s loss function becomes a sum of different pixel-based criteria, with an adversarial term which corresponds to the number of fake images that the discriminator has correctly identified.

Figure 3: Sample results of the GAN-trained network. Left: low-resolution, right: super-resolution.

Figure 4: Sample results of the GAN-trained network. Top: LR ; bottom: SR.

Figure 5: Details produced by the GAN-trained network.

The images look quite natural, and we were able to retrieve a lot more details — and finer ones — than with the sole MSE loss. However, networks trained as part of a GAN tend to produce specific details, that are usually invisible unless we zoom into the picture (c.f. figure 5), but that can be disconcerting in some cases (c.f. figure 6). Additionally, faces and complicated logos are sometimes badly rendered. Specific training could solve these issues.

Figure 6: An example of failure for the GAN-trained network.

3 From image to video super-resolution

Once we have satisfactory results for the single-image super-resolution task, the next step is to make video super-resolution. This new challenge is harder, because we need to take temporal coherence into account, hence we cannot just pass every frame through a super-resolution network. Since we produce artificial details when performing super-resolution, we need to make sure that these details remain the same over the course of the video. Fortunately, with more information to use — we have multiple frames at our disposal -, we should be able to tackle this issue. For the Mingle use case, we want to keep the processing time as short as possible, so we will use Sajjadi’s frame-recurrent network. The inputs are the currently upscaled low-resolution frame, the previous low-resolution frame, and the previously generated super-resolution image. We compute the optical flow between the low resolution frames, i.e. we look at the way each pixel moved between the two frames, combine this new information to the previous super-resolution frame, and perform super-resolution for the current frame (figure 7). It is GAN-trainable, and it is possible to train the two networks separately before fine-tuning the whole model

Figure 7: The frame-recurrent super-resolution network.

There is a trade-off between image quality and temporal coherence. The frame-recurrent network acted as an anti-aliasing filter, but was not able to recover fine details in the crowd. Frame-recurrent super-resolution runs at about 55 FPS with a SRResNet-like super-resolution network, and slightly above 3 FPS with an RRDBNetinspired architecture.

Figure 8: Top: frame-recurrent ; middle: frame-by-frame ; bottom: low-resolution.

Conclusion

Super-resolution is a difficult task, but modern neural networks can be used to upscale and enhance low-resolution images. For training, sole pixel-based criteria produce overly smooth images, whereas adversarial training helps to retrieve fine details in the pictures. On complicated cases like severely degraded faces and logos, specific training is required to obtain usable results. Video super-resolution can be performed by processing the entire video frame-by-frame using a single-image super-resolution model, however we may encounter temporal coherence problems. Frame-recurrent models can solve this issue, but the image quality is not as good as the single-image models’. Once again, meticulous training is the key to achieving the best results. Newer techniques like U-Nets and attention are starting to be used more and more in computer vision, and with the recent boom of diffusion models, super-resolution performance will likely skyrocket in the years to come. They can already produce insanely detailed images, sometimes even better than the ground truth. Inference time remains their main weakness, though.