Vision Transformers vs CNNs at the Edge

This blog post was originally published at Embedl’s website. It is reprinted here with the permission of Embedl.

“The Transformer has taken over AI”, says Andrej Karpathy, (Former) Director of AI at Tesla, in a recent episode on the popular Lex Fridman podcast. The seminal paper “Attention is All You Need” by Vaswani and 7 other authors from 2017 introduced the Transformer and since then it has taken the AI world by storm. Indeed, it is behind all the dramatic new advances that have made the headlines recently, including the amazing ChatGPT and its successor based on GPT-4.

Transformers vs CNNs

CNNs have an inductive spatial bias baked into them with convolutional kernels whereas vision transformers are based on a much more general architecture. In fact, the first vision transformers used an architecture from NLP tasks without change and simply chopped up the input image into a sequence of patches in the most naïve way possible. Nevertheless, they beat CNNs by overcoming the spatial bias given enough data. This may be another example of Rich Sutton’s famous “bitter lesson” of AI: “building in how we think we think does not work in the long run … breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning.”

It turns out that vision transformers “see” very differently from CNNs. A team from Google Brain studied the representations produced by the two architectures very carefully. While it is folklore that CNNs start with very low-level local information and gradually build up more global structures in the deeper layers, ViTs already have global information at the earliest layer thanks to global self-attention. As pithily summarized in a Quanta article, “If a CNN’s approach is like starting at a single pixel and zooming out, a transformer slowly brings the whole fuzzy image into focus.”. Another interesting observation to emerge from that study is that skip connections are very important for ViTs.

The Ubiquitous Transformer

Transformers are causing great excitement across many different application areas and tasks. They have started overcoming the performance of CNNs. Because of their general-purpose architecture, they offer the potential for a single uniform solution to all vision tasks at one go, rather than crafting different solutions for different tasks. While previous approaches had to handle different types of relationships – pixel to pixel versus pixel to object or object to object – differently, transformers can handle all these different relationships uniformly in the same way. Another aspect that is becoming increasingly important is that this uniformity means that multi-modal inputs are also very well suited to transformers – so image and text inputs can be handled in the same model.

We are soon entering the era of “Foundation Models” for computer vision and multi-modal applications. These behemoths will be hundreds of billions of parameters dwarfing the previous ResNet models with tens of millions of parameters.

Hardware for Transformers

The great interest in Transformers in different applications has translated into the development of specialized hardware accelerators these architectures. A recent example is SwiftTron, a specialized open-source hardware accelerator for vision transformers. The SwiftTron architecture implements several hardware units to fully and efficiently deploy quantized transformers in edge AI/TinyML devices using only integer operations. To minimize accuracy loss, a quantization strategy for transformers with scaling factors is designed and implemented. The scheme reliably implements linear and non-linear operations in 8-bit integer (INT8) and 32-bit integer (INT32) arithmetic, respectively. There are surely going to many more innovations in the hardware space for Transformers in the near future.

Optimizing Transformers

When it comes to optimizing Transformers, their size and resource requirements can pose significant challenges. The importance of model compression cannot be overstated, especially when aiming to leverage the benefits of these large models on small edge devices. This is where Embedl comes into the picture. Encouragingly, our experiments and the findings from several recent papers indicate that compression methods like pruning and quantization are notably more effective for Vision Transformers (ViTs) compared to Convolutional Neural Networks (CNNs). In fact, our recent pilot projects with one of the world’s leading tier one suppliers have showcased outstanding results in compressing the most widely used ViT models. With the growing significance of Transformers in various applications, the need for efficient optimization techniques becomes increasingly paramount.


Q: What are Vision Transformers and CNNs?
A: Vision Transformers and CNNs (Convolutional Neural Networks) are two different types of neural network architectures used to solve computer vision tasks. Vision Transformers are based on the Transformer architecture, originally designed for natural language processing, but adapted for image analysis. CNNs, on the other hand, are a type of deep learning network specifically designed for image recognition and classification.

Q: What is the main difference between Vision Transformers and CNNs?
A: The main difference lies in their architectural design and the way they process visual information. While CNNs rely on the use of convolutional layers to extract features hierarchically, Vision Transformers utilize self-attention mechanisms to capture global dependencies and relations between image patches directly. This allows Vision Transformers to model long-range interactions within images more effectively than CNNs.

Q: When should I choose Vision Transformers over CNNs for edge computing?
A: Vision Transformers are particularly advantageous for tasks that require modeling global relationships in images or capturing long-range dependencies. If your application demands high-level understanding of visual content or if you’re working with large-scale datasets, Vision Transformers can be a better choice. However, CNNs are still widely used and can be more efficient for smaller datasets or real-time applications due to their simpler architecture.

Q: Are Vision Transformers more computationally expensive than CNNs?
A: Vision Transformers are generally more computationally demanding than CNNs due to their self-attention mechanism, which computes interactions between every pair of image patches. At the same time, the global context provided by self-attention makes Vision Transformers particularly adept at handling tasks where understanding the entire image as a cohesive whole is crucial, a scenario often encountered with higher resolution images. In contrast, CNNs, with their convolutional operations, have a computational complexity that scales more efficiently with increasing image resolution, making them more resource-efficient for such tasks. However, they might struggle with capturing global dependencies in larger images as effectively as Vision Transformers. Therefore, the choice between Vision Transformers and CNNs for high-resolution image tasks may depend on the specific requirements of the task, particularly in terms of the need for global context versus computational efficiency.

Here you’ll find a wealth of practical technical insights and expert advice to help you bring AI and visual intelligence into your products without flying blind.



1646 N. California Blvd.,
Suite 360
Walnut Creek, CA 94596 USA

Phone: +1 (925) 954-1411
Scroll to Top