This blog post was originally published at Visidon’s website. It is reprinted here with the permission of Visidon.
Noise is frustrating whether it is in the images or in the videos. The reasons behind this are low resolution, hardware constraints, inaccurate equipment, the nature of photons, or severe weather and environmental conditions that ruin a perfectly taken shot. To put it in another way, noise compromises the quality. It is very prominent in smaller sensors used in devices like smartphones or action cameras and in other edge devices. Though there are some significant advancements in camera sensors, real-time image, or video denoising still remains one of the most challenging tasks when it comes to processing mobile photo and video data. This article will explore what is noise and how Visidon denoise algorithm works to remove the noise in real-time.
What is Noise
Noise is the grain-like colorful feature that changes from image to image and from frame to frame in an image or video. It is a stochastic or semi-stochastic process which means that it is random or semi-random. When there is a signal, there is also noise in a physical system. The amount of noise is usually described as signal-to-noise ratio or SNR. This describes the amount of signal (light that is related to what we are trying to capture) and noise (which is the “extra stuff” we usually do not want to see). The higher the number of SNRs, the less noise occurs when it is compared to the actual image that we want to see. A relative number is used because in most cases, the noise amount is dependant on the number of captured photons.
|Figure 1: Noisy image
|Figure 2: Noiseless image
Despite coming from a very similar source, noise usually looks different in videos and images. It is because images do not have the temporal dimension that is used in most videos for compression and other processes. The time-domain processing affects the appearance of the noise compared to images. Compression and other processes are also relevant to the images as well, but they are strictly bound to that one image. Videos and images have a similar capturing pipeline, and they usually go through the same processing mostly when being captured.
|Figure 3: Uncompressed
|Figure 4: Compressed
Noise Reduction/ Denoising
Noise reduction, also known as “denoising” is a popular research field and it is not limited to image/video processing. Where there is a signal, there is a noise, and it needs to be reduced or removed. Video denoising is the removal of noise from a video signal to enhance the visual quality of the video.
There are different methods of noise reduction for image/video signals.
Figure 5: Typical imaging processes
Most of the methods depend on the statistical properties of noise in comparison to signal. This means that noise, regardless of the signal, usually appears visually very similar. However, this is not always the case. In Image 1 and Image 2, we see that it is difficult to tell what is noise and what is signal. And this is one of the limitations of the statistical methods. It causes noise reduction methods to blur the image which is not visually appealing. There are more sophisticated methods that attempt to solve this problem with local and non-local statistics using different areas of the images and wider fields of view, and while they significantly improve the results compared to simpler methods, they still have the issue of over-smoothing or under-smoothing. Also, often these traditional noise reduction methods create a blocky effect. A noisy image and some examples of traditional noise reduction algorithms are provided below.
|Figure 6: original noisy image
|Figure 7: total variance NR
|Figure 8: bilateral NR
|Figure 9: non-local means
We see some problems in the above images. Some are over-smoothing (total variance NR Image 7), some are under-smoothing (bilateral NR Image 8), and others have edge and color problems (non-local means Image 9).
For a video, there is also the problem of the temporal dimension. For a single image, it is sufficient if the algorithm produces a visually appealing image, but for a video, the algorithm must provide a visually attractive and consistent video. It is not an easy task! Using the adjacent frames can be used to improve the noise reduction outcome though it is not a simple thing and not always effective. It frequently results in unwanted effects such as, “ghosting”.
Noise reduction with neural networks
Image/Video denoising plays an important role in modern image/video processing systems. It is important to recover meaningful information from noisy clips in the noise removal process to obtain high-quality videos. Visidon has created a neural network-based noise reduction technique to address the main issues in traditional noise reduction algorithms. It is fast enough for actual use in edge devices including smartphones, televisions, laptops, set-top-boxes, some surveillance systems, and conference call systems. And, as with all neural networks, the main problem is balancing data, training, speed, and quality.
Speed is the most obvious limiting factor. We design our architecture to be quick enough on the target device because that device always has a certain amount of computational power. However, because speed is a significant limiting factor in real-time processing, the networks can only do specialized tasks. That means they are unable to generalize.
As previously stated, the noise source is primarily physical and is simple to measure and model when dealing with unprocessed data. The image goes through a variety of processes after capturing and all these affect the noise as well as making it much more complex and difficult to remove. This can be avoided by performing noise reduction prior to any processing, but this is not always possible due to technical constraints.
The device has been designed to work in a certain way and adding a new process (our noise reduction neural network) to the beginning of the pipeline is simply not possible. Therefore, most of the noise reduction tasks deal with noise that has been altered in a very complicated way. This leads to the problem of having a huge amount of different noise models. Because a small network cannot handle that, it must be specific. This means that the data used to train the network must closely resemble the actual noise of the target devices.
It is also difficult to achieve fast neural networks. Mobilenet, for example, is a popular architecture for mobile devices. It is indeed very fast for many processes, but as it is a classification network which outputs much smaller feature vector (1000 features) compared to the input image size, it cannot be used for real-time high-resolution image/video processing which is image to image translation. One frame of 1080p YUV420p video has ~ 3e6 pixels. For this size input, the Mobilenet does ~10e9 operations (MACs) per frame. For a typical 30fps video, this means that the network must perform 300e9 operations per second of the video. Modern edge-devices are capable of running Mobilenet but the problem is that in image to image translation the spatial resolution needs to be the same as in the input image, which means we need to bring the feature vector back to image space. This requires a lot more operations with the same amount of parameters in the network. So, our networks need to be around 100 times smaller than typical “fast” networks, to achieve same operations per second budget, while still producing good results.
When we have a fast enough architecture and data that matches the target device, we need to train the model. Previously, we mentioned temporal consistency. That is very important when dealing with videos.
To produce consistent adjacent frames, the training must be made in a way that considers the temporal dimension. It makes the training very slow and data-heavy. We begin to optimize for quality once all of these have been fine-tuned. Sometimes, we are able to get away with much smaller networks than we initially began with, so we decrease the size of the network. We do not want to do any more calculations than we need to.
When all things are done, what is left is a real-time, high-quality, and temporally consistent video noise reduction that is tailored to the target device. And designing such a network takes far longer than only 15 minutes or 5 lines of code.
To conclude, noise can be annoying in both images and videos. The noise present in the images/videos are mainly produced by physical phenomena. Some are related to the imperfections of the electrical components, and some are related to processing which is the most difficult to deal with. Traditional noise reduction methods often smooth the images/videos or give a spooky effect. Here, Visidon’s solution comes in.
Our solution is based on a neural network that addresses the main problems of traditional noise reduction algorithms and networks. We achieve this with detailed training data, an optimized inference pipeline, camera-specific calibration, and extremely fast architectures. Our solution can clearly improve noise level in both objective and subjective manners. With objective measurement we can observe from this table that it reduces noise very well by improving SNR while preserving image details and texture quantified by Dead Leaves statistics. Dead Leaves is a random pattern of circles with varying diameter and color, and it is used to measure texture loss in the image.
In our test, de-noiser was applied to low-cost web camera originally having lower noise characteristics compared to Apple MacBook Pro camera but after de-noising it can reach and even outperform its noise performance.
Table 1: Metrics Related To Quality and Speed
Visidon’s solution is useful to all edge devices which have a dedicated digital signal processor (DSP) or a neural processing unit (NPU) in real-time.