This blog post was originally published at Embedl’s website. It is reprinted here with the permission of Embedl.
CNNs have long been the workhorses of vision ever since they achieved the dramatic breakthroughs of super-human performance with AlexNet in 2012. But recently, the vision transformer (ViT) is changing the picture.
CNNs have a an inductive spatial bias baked into them with convolutional kernel whereas vision transformers are based on a much more general architecture. In fact, the first vision transformers used an architecture from NLP tasks without change and simply chopped up the input image into a sequence of patches in the most naïve way possible. Nevertheless they beat CNNs by overcoming the spatial bias given enough data. This may be another example of Rich Sutton’s famous “bitter lesson” of AI: “building in how we think we think does not work in the long run … breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning..”
It turns out that vision transformers “see” very differently from CNNs. A team from Google Brain studied the representations produced by the two architectures very carefully. While it is folklore that CNNs start with very low level local information and gradually build up more global structures in the deeper layers, ViTs already have global information at the earliest layer thanks to global self-attention. As pithily summarized in a Quanta article, “If a CNN’s approach is like starting at a single pixel and zooming out, a transformer slowly brings the whole fuzzy image into focus.” Another interesting observation to emerge from that study is that skip connections are very important for ViTs.
ViTs are causing great excitement for several reasons, besides overcoming the performance of CNNs. Because of their general purpose architecture, they offer the potential for a single uniform solution to all vision tasks at one go, rather than crafting different solutions for different tasks. While previous approaches had to handle different types of relationships – pixel to pixel versus pixel to object or object to object – differently, transformers can handle all these different relationships uniformly in the same way. Another aspect that is becoming increasingly important is that this uniformity means that multi-modal inputs are also very well suited to transformers – so image and text inputs can be handled in the same model.
So we are soon entering the era of “Foundation Models” in vision and multi-modal inputs just as the “foundational” GPT style models for NLP. These behemoths will be hundreds of billions of parameters dwarfing the previous ResNet models with tens of millions of parameters. Which means that model compression will be ever more important to get the benefits of these large models on small edge devices.
Enter Embedl! The good news is that our experiments and several recent papers have shown that compression methods such as pruning and especially quantization seem to be much more effective for ViTs than they were for CNNs. In recent pilot projects with the largest tier one suppliers worldwide, we have recently demonstrated very impressive results for compressing the most widely used ViT models.


