Tensilica Vision Q8 and P1 DSPs, More AND Less

This blog post was originally published at Cadence’s website. It is reprinted here with the permission of Cadence.

President George H. W. Bush famously said that he didn’t do “the vision thing”. Well, here at Cadence we definitely do the vision thing. In fact, the Tensilica Vision DSP product line is the market leader in the vision thing.

Until this week, the primary products in the portfolio were the Vision P6 DSP at the low end, and the Vision Q7 DSP at the high end (there are also older products). This week, at the Linley Processor Conference, Cadence is announcing two new Tensilica Vision Processors, the Tensilica Vision Q8 DSP at the high end and the Tensilica Vision P1 DSP at the low end. So the overall portfolio looks like this:

For details on the two existing processors, see my posts:

Vision has an enormous spread in requirements, ranging from IoT devices like smart doorbells, to the cameras involved in autonomous driving. Or in 3D capture, where multiple cameras are involved, which requires an even more heavyweight computational capability. The breadth of this spread of requirements means that new processors are required at both the low end and the high end. That’s where the Vision Q8 DSP and the Vision P1 DSP come in.

The Vision Q8 DSP is targeted at the high end, with very high performance. The Vision P1 DSP is targeted at always-on applications, with very low area and power. It is worth emphasizing that these days “vision” is about a lot more than processing images, although that remains an important application. Increasingly, Vision DSPs are used to analyze the images and identify objects (such as pedestrians in an autonomous vehicle). As such, in addition to the image processing datapath, there are also arrays of MACs for neural network processing. Even a smart doorbell has to be smart. The camera pipeline is typically something like this, although it will vary depending on the exact application:

All Tensilica Vision DSPs have a similar architecture. Obviously, things like the bus widths and the number of exeecution units and MACs vary, but the structure does not. All the VIsion DSPs also support TIE, the Tensilica Instruction Extension language, which allows custom instructions to be added. For details on TIE, see my post Custom Instructions in Tensilica: Wearing a TIE Makes You Smarter.

I should say that the Vision P1 DSP does not obsolete the Vision P6 DSP, nor the Vision Q8 DSP obsolete the Vision Q7 DSP. They are spread out over a wide range of performance/power/area points so there are still applications for which the Vision P6 DSP or the Vision Q7 DSP is the sweet spot. On the other hand, if an application needs two Vision Q7 DSP processors, then it makes more sense to move up to the Vision Q8 DSP instead. And at the lower end, there are certainly applications for which the limited power of the Vision P1 DSP will be inadequate. So the portfolio really does look like the image above, with four processors. Let’s take a look at the two new processors and how they stack up against the existing processors in the portfolio.

Tensilica Vision Q8 DSP

The image above shows the characteristics of the Vision Q8 DSP. It is a 1024-bit SIMD (single-instruction, multiple-data) which is twice the width of the Vision Q7 DSP or Vision P6 DSP. It is no good having a high-performance processor if it is starved of data, so the memory interface has been increased to 2048-bits. There are new data types added: FP64, and complex numbers (based on any of FP16, FP32, or FP64). There are built-in power measurement features that can be used for power optimization, varying the performance of the clock dynamically through dynamic voltage and frequency scaling (DVFS).

There are also improvements in the AI performance, with a MAC array that can be configured either as 1024 8-bit MACs or as 256 16-bit MACs. There are also enhancements for non-convolutional neural network layers, such as leaky or parametric ReLU. There are other enhancements, too, resulting in the Vision Q8 DSP having twice the AI performance of the Vision Q7 DSP on widely-used AI benchmarks.

In the most demanding applications, such as autonomous driving, one Vision Q8 DSP may not be enough. But multiple processors can be linked up with the Cadence Multicore Connect, as in the above diagram.

Tensilica Vision P1 DSP

The diagram above shows the capabilities of the Vision P1 DSP. It has deliberately limited capabilities since it is targeted primarily at always-on applications. However, it still offers up to 400 Giga Operations Per Second (GOPS). It has a 128-bit SIMD architecture (¼ the width of the Vision P6 DSP) and just a 256-bit memory interface. Its AI unit has 128 8-bit MACs (so while it’s one-fourth the SIMD compared to the Vision P6 DSP, the MACs are just reduced by half). It is one-third of the area and power of the Vision P6 DSP but runs at 20% higher frequency. It is completely instruction-set compatible with the Vision P6 DSP, and uses the same compilers and libraries as other Vision DSPs. It also supports TensorFlow Lite Micro, the implementation of Tensor Flow targeted at microcontrollers.

Software

I won’t say much about software since the toolchain and libraries are basically the same as every other Tensilica processor. You can program all Tensilica processors directly in C++, Halide, OpenCL, OpenVX Graph, and more, as shown in the above diagram. For AI applications, the Cadence AI Software Ecosystem can be used, as shown in the diagram below:

Summary

Here is a summary of the two new processors. The processors are based on the successful Vision P6 and Vision Q7 cores, with the same instruction set, tool flows, software, and so on. The Vision Q8 DSP is optimized for the most demanding vision applications in terms of performance. The Vision P1 DSP is optimized for the most demanding vision applications in terms of power (in particular, always-on applications which require very low standby power). The overall performance of the two cores is summarized in the two tables below: