fbpx

Transformer Models and NPU IP Co-optimized for the Edge

Transformers are taking the AI world by storm, as evidenced by super-intelligent chatbots and search queries, as well as image and art generators. These are also based on neural net technologies but programmed in a quite different way from more commonly understood convolution methods. Now transformers are starting to make their way to edge applications. A very clear motivation is the universality of these methods across diverse applications: ViT (vision transformers), audio, and natural language processing (NLP), unlike more restricted applications for conventional CNN/RNN-based models.

OEMs see obvious cost, training, and maintenance advantages in adopting a single compute platform to serve multiple needs, from say pedestrian detection for ADAS to voice-based control for infotainment applications. The effectiveness of the vision transformer network is a key test in any replacement strategy since CNN-based vision is already well established although limited to pre-defined patterns.

Additional motivations for system builders include a deluge of transformer research over the last couple of years, indicating already very rapid advances in capabilities. Add to that indications that these systems may be amenable to self-supervised learning, very much like we already see in large language models (LLMs), and it becomes clear why system OEMs are conveying urgency in need to jump on this bandwagon.


Transformers could be a key to implementing text, audio and vision based Generative AI application at the Edge, by enabling lower power and compressed model generation compared to conventional CNNs.

Market potential

Nobody is forecasting an end to convolutional models (CNNs). These are already well established in many applications from home automation to cars and industrial applications, among others. But they are not as broadly versatile in emerging applications as transformers. There is an obvious advantage in developing 10-year product development plans around a clear technology front-runner, including transformer options, while still reserving CNNs for applications in which they are already well-proven.

The global edge computing market was forecast at $44.7B for 2022 and is expected to grow at a 17.8% CAGR to over $140B by 2030. This presents a significant opportunity to Edge AI system builders but also a challenge given the diversity of edge applications unless they can unify much of their development under a common compute platform. There are some vision and language transformer edge applications today, and more transformers are moving to the edge, such as Qualcomm’s recent announcement of on-device support for the open-source Llama2 language model, competitive with OpenAI’s GPT4. Which suggests opportunity is ripe for strategic OEM leaders.

A key challenge in adapting transformers to the edge

The cloud-based transformer models we usually hear about are massive and inappropriate for edge deployment. Practical models for the edge are much smaller but, just like models for CNNs, must be compressed to deliver effective performance within an acceptable power envelope. However, transformer accelerator structures are very different from convolution structures and a different approach to compression is required.

At CEVA we were fortunate to work with CERN in prototyping Neural Networks for particle jet detection in the CMS detector used in the Large Hadron Collider (LHC). In our joint research, we evaluated both CNNs and Transformer-based models. This application demands ultra-low latencies and therefore highly efficient models to avoid missing events. Our joint research describes a mathematically grounded method to model pruning and quantization to meet that goal.

Quantization (replacing floating point operations with 16-bit, 8-bit or even 4-bit fixed point) is a familiar method from CNN optimization. Pruning recognizes that many network parameters are redundant or contribute little to network performance, therefore selectively removing unnecessary connections or parameters.

Typically, a fixed hardware platform constrains options for such tuning. Our research allows for software/hardware co-optimization through both AI processor tuning and transformer model tuning at each layer to achieve the optimal performance possible. We concluded based on evaluations on various computer vision and natural language processing benchmarks that this optimization method outperforms existing state-of-the-art methods, achieving a superior compression-performance trade-off.


Implementing a flexible software/hardware co-optimization approach through both AI Processor tuning and transformer model tuning at each layer can outperform existing state-of-the-art methods, achieving a superior compression-performance trade-off.

A transformer serving many everyday applications

The CERN paper also points to using this principled co-optimization technique to deliver state of the art performance in everyday edge devices and meet their operational requirements such as low latency and low power. It is already apparent that a transformer model built along these lines can deliver performance comparable to a CNN-based system, or better for larger datasets. Research indicates such systems may also be more immune to distortion and attacks, thanks to their use of global attention. Also promising are applications using self-supervised learning (SSL), to predict what may be in the blocked part of an image or to replace a photobomb with a natural background.

For audio, research is already active in more general acoustic scene analysis, for recognizing significant sounds and speech (for speech to text for example) and for speech synthesis. Transformer-based natural language processing, already widely recognized as a major step forward, naturally follows speech recognition. Imagine being able to provide some or all these capabilities in an edge device without the need to go to the cloud!

In their continuous search for competitive advantage in intelligent edge applications over the next decade, product OEMs need both flexibility in performance tuning options and stability in the base compute platform, not being dependent on switching NPU core architectures or training to keep up with evolving demand. Co-optimization of transformer models together with scalable and configurable NPU hardware ensures that flexibility and stability. Quite a bargain.

Tal Kopetz
Machine Learning Software Senior Team Leader, CEVA

Here you’ll find a wealth of practical technical insights and expert advice to help you bring AI and visual intelligence into your products without flying blind.

Contact

Address

1646 N. California Blvd.,
Suite 360
Walnut Creek, CA 94596 USA

Phone
Phone: +1 (925) 954-1411
Scroll to Top