HDR Sensors for Embedded Vision

By Michael Tusch
Founder and CEO
Apical Limited

At the late-March 2012 Embedded Vision Alliance Summit, Eric Gregori and Shehrzad Qureshi from BDTI presented a helpful overview of CCD and CMOS image sensor technology. I thought it might be interesting to extend this topic to cover so-called HDR (High Dynamic Range) / WDR (Wide Dynamic Range) sensors. HDR and WDR mean the same thing –it’s just a matter of how you use each axis of your dynamic range graph. I’ll employ the common terminology "HDR" throughout this particular article.

I think that this is an interesting topic because many embedded vision applications require equivalent functionality in all real-scene environments. We know that conventional cameras, even high-end DSLRs, aren’t able to capture as much information in very high contrast scenes as our eyes can discern. This fact explains why we have rules of photography such as “make sure the sun is behind you”. Indeed, conventional image sensors do have problems in such conditions, but the industry has devoted significant work over many years to HDR sensors which extend raw capture capability far beyond what is available in conventional consumer and industrial cameras. The reliability of the image capture component is of course one key element of the overall system performance.

The dynamic range (DR) of the sensor is the ratio of the brightest pixel intensity to the darkest pixel intensity that the camera can capture within a single frame. This number is often expressed in deciBels (dB), i.e.

DR in dB = 20 * log10 (DR)

The human eye does very well and, depending on exactly how the quantity is measured, is typically quoted as being able to resolve around 120-130 dB in daytime conditions.

Image sensors are analog devices that convert pixel intensities to digital values via an analog-digital converter (ADC). The bit depth of the output pixels sets an upper limit on the sensor dynamic range, as shown in Table 1.

Type of sensor	Bits/pixel	Maximum intensity levels recorded	Maximum sensor dynamic range (dB)
Very low-cost standard	8	256	48
Average standard	10	1024	60
Higher quality standard	12	4096	72
HDR	16	65536	96
HDR	20	1048576	120

Table 1. Dynamic range potential of various image sensor types

In reality, the maximum dynamic range is never quite achieved, since in practice the noise level takes up to ~2 bits off the useful pixel range.

Standard CMOS and CCD sensors achieve up to ~72 dB dynamic range. This result is sufficient for the great majority of scene conditions. However, some commonly encountered scenes exist which overwhelm such sensors. Well-known examples are backlit conditions (i.e. a subject standing in front of a window), outdoor scenes with deep shadows and sunsets, and nighttime scenes with bright artificial lights (Figure 1).

Figure 1. This backlit scene has a dynamic range of around 80 dB.

Such scenes typically exhibit a dynamic range of around 100 dB and, in rare cases, up to 120 dB (Figure 2). If captured with a conventional sensor, the image either loses detail in shadows or has blown-out (i.e. clipped) highlights.

Figure 2. This high-contrast scene has a dynamic range of around 100 dB.

Numerous attempts have been made to extend standard CMOS and CCD technology, overcoming the limitations of pixel sensitivity and ADC precision, in order to capture such scenes. Pixim developed the first really successful HDR sensor, based on CCD technology, and it was the industry standard for many years. However the technology, which effectively processes each pixel independently, is somewhat high cost. More recently, other vendors have concentrated on sensors constructed from more conventional CMOS technology. Numerous different solutions are available; the remainder of this article will survey the main vendors and the techniques they use.

Multi-frame HDR is an HDR method that does not rely on custom CMOS or CCD technology. Acting as a sort of video camera, the sensor is programmed to alternate between a long and a short exposure on a frame-by-frame basis, with successive images blended together by the image sensor processor (ISP) in memory to produce a single HDR image (Figure 3). If the blending algorithm is robust, an exposure ratio multiple of around 16 is comfortably achievable, adding an extra 4 bits to the single-exposure dynamic range. For example, a 12-bit sensor can produce images characteristic of a 16-bit sensor.

Figure 3. Blending together short- and long-exposure versions of a scene creates a multi-frame HDR result.

As with all HDR technologies, there is a catch. In this particular case, it is the potential generation of motion artifacts, most noticeable as "ghosting" along the edges of objects that have moved between the two frames. Such artifacts are very expensive to eliminate even partially, although specific processing in the ISP can significantly suppress their appearance. Further, the effective frame rate is reduced. If the input frame rate is 60 fps, the output can remain at 60 fps, but highlights and shadows will exhibit an effective frame rate closer to 30 fps, and mid-tones will be somewhere between 30 and 60 fps depending on how clever the blending algorithm is.

The Altasens A3372 12-bit CMOS sensor uses a “checkerboard” pixel structure, wherein alternating Bayer RGGB pixel quad clusters are set to long- and short-exposure configurations (Figure 4). In HDR scenes, the long-exposure pixels capture dark information, while short-exposure pixels handle bright details.

Figure 4. The Altasens A3372 checkerboard array devotes alternating quad-pixel clusters to capturing dark and light scene details.

Long exposure delivers improved signal-to-noise but results in the saturation of pixels corresponding to bright details; the short exposure pixels conversely capture the bright details properly. Dynamic range reaches ~100 dB. The cost of HDR in this case is in the heavy processing required to convert the checkerboard pattern to a normal linear Bayer pattern. This reconstruction requires complex interpolation because, for example, in highlight regions of an HDR image, half of the pixels are missing (clipped). An algorithm must estimate these missing values. While such interpolation can be done with remarkable effectiveness, some impact on effective resolution inevitably remains. However, this tradeoff is rather well controlled, since the sensor only needs to employ the dual-exposure mode when the scene demands it; the A3372 reverts to non-HDR mode when it's possible to capture the scene via the standard 12-bit single-exposure model.

A very different HDR method is the so-called “companding” technique employed by sensors such as Aptina's MT9M034 and AR0330, along with alternatives from other vendors. Such sensors use line buffers to accumulate multiple exposures (up to 4, in some cases) line-by-line. The output pixels retain a 12-bit depth, set by the ADC precision, but those 12 bits pack in up to 20 or more effective bits of linear intensity data. Companding is conceptually similar to the way gamma correction is used to encode 2 bits of additional data in a color space such as sRGB. Inverting this non-linear data structure enables obtaining an HDR Bayer image.

This method produces the highest dynamic ranges; one vendor claims 160 dB. But it again comes with associated costs. First, the data inversion relies on a very accurate and stable knowledge of where the various exposures begin and finish. In practice, imperfections lead to noise at specific intensity levels that can be hard to eliminate. Second, the sequential exposures in time create motion artifacts discussed earlier. These can be suppressed but are difficult to remove. Standard techniques for flicker avoidance (such as "beating" with the 50Hz or 60Hz flicker of indoor lighting) also don’t work when more than one exposure time exists.

Yet another HDR sensor implementation is the dual-pixel structure employed by Omnivision in sensors such as the OV10630. It consists of a non-Bayer array of pixels made up of two physically different types: a “dark” pixel and a “bright” pixel, which can be of different sizes. The dark pixels are more sensitive to light and therefore handle dark areas well, with good signal-to-noise. Conversely, the bright pixels are less light sensitive and therefore don't saturate as readily in bright regions. In principle, the dual-pixel approach is a very "clean" HDR technology. It avoids motion artifacts and requires no complex non-linear processing. Penalties include the fact that two pixels are blended into one, so the effective resolution is half of the actual resolution. The dual-pixel structure is also more costly on a per-pixel basis, and the output raw pixel pattern cannot be processed by standard ISPs.

More generally, each of the sensor types discussed in this article requires a different image-processing pipeline to convert its captured images into a standard output type. This fact means that it is not typically possible to directly connect an HDR sensor to a standard camera DSP and obtain an HDR result. Figure 5 below shows the pipelines for Bayer-domain processing of the multi-frame, Altasens and Aptina-type HDR sensor raw inputs. Standard processing is possible subsequent to color interpolation.

Figure 5. The image processing flow varies depending on what type of HDR sensor is being employed.

Obtaining genuine HDR imagery is also not just a matter of leveraging an HDR sensor coupled with an HDR ISP. For scenes with dynamic range beyond 100 dB, optics also plays a central role. Unless the lens is of sufficient quality and the optical system has the necessary internal anti-reflection coatings to prevent back-reflection from the sensor, it is impossible to avoid flare and glare in many HDR scenes, creating artifacts that effectively negate much of the sensor's capture capabilities. To put it simply, building a HDR camera suitable for the full range of scene conditions is not inexpensive.

In conclusion, a variety of sensor and ISP technologies exist for capturing and processing HDR imagery. They all involve some kind of image quality trade-off in exchange for the extended dynamic range, either in resolution or in time. It is worth remembering that although the technology may be elaborate, the purpose is simply to extend effective pixel bit depth and reduce noise. To see this, compare the images shown in Figure 6.

Figure 6. A comparison of two images reveals HDR shadow strengths.

The upper image was captured using a 12-bit CMOS sensor in normal mode. The image below it harnesses the exact same sensor but employs the multi-exposure mode discussed earlier. The effect of the HDR mode closely resembles that of noise reduction. In the left-hand image, strong local tone mapping is used to increase the digital gain so that shadows are visible, while exposure is kept low enough to avoid highlight clipping. This technique in effect captures the window area at ISO 100 and the shadow area at ISO 3200, and it does not require any non-standard capture technology. The HDR image, conversely, obtains the same exposure values for shadows and highlights, but this time by varying the exposure times, leading to greater sensitivity and lower noise in the shadow region.

High-performance temporal and spatial noise reduction technology can extend dynamic range by up to ~12 dB. And high-performance dynamic range compression technology can map input dynamic range to a standard output without loss of information. So a standard 12-bit CMOS sensor with good NR can achieve around 84 dB, which is “pretty good HDR”, while a 14-bit CMOS sensor with good NR can achieve nearly 100 dB, which is “mainstream HDR”. However, specific HDR sensors are required for truly high dynamic range scenes.

If you're building AI or vision-enabled products, you've come to the right place.

HDR Sensors for Embedded Vision

Pages

Topics

Contact

Address

Phone