Model Compression: Needs and Importance

Smart City connected by IoT (Image by Tumisu from Pixabay)

This blog post was originally published at Xailient’s website. It is reprinted here with the permission of Xailient.

Whether you’re new to computer vision or an expert, you’ve probably heard about AlexNet winning the ImageNet challenge in 2012. That was the turning point in computer vision history because it showed that deep learning models can perform tasks which were considered very difficult for computers, with an unprecedented level of accuracy.

But did you know that AlexNet had 62 million trainable parameters?

Interesting right.

Another popular model VGGNet which came out in 2014 had even more, 138 million trainable parameters.

That’s more than 2 times that of AlexNet.

You might be thinking… I know that the deeper the model is, the better it will perform. So why are you highlighting the number of parameters? Deeper the network, it is obvious that there will be more parameters.

Complexity and accuracy of known convolutional neural network (CNN) networks. #Parameters and #MACCs are in order of millions in the above table [8]

Sure, these deep models have been benchmarks in the computer vision industry. But when you want to create a real-world application, would you choose these models?

I guess the real question we should ask here is: CAN YOU USE THESE MODELS IN YOUR APPLICATION?

Hold that thought for just a minute!

Let me divert here for a bit, before I get to the answer. (But feel free to skip to the end.)

The number of IoT devices is expected to reach 125–500 Billion by 2030 and assuming that 20% of them will have cameras, IoT devices with cameras is a 13–100 billion unit market. [9,10,11]

IoT camera devices include home security cameras (such as Amazon Ring and Google Nest) that open the door when you reach home or notify you if it sees an unknown person, cameras on smart vehicles that assist your driving, or cameras at a parking lot that open the gate when you enter or exit, just to name a few! Some of these IoT devices are already using AI to some extent and others are catching up slowly.

Smart Home system connected with IoT devices (Image by Gerd Altmann from Pixabay)

Many real-world applications demand real-time, on-device processing capabilities. A self-driving car is a perfect example of this. In order for cars to drive down any road safely, they must observe the road in real-time and stop if a person walks in front of the car. In such a case, processing visual information and making a decision needs to be done in real-time, on-device.

So, returning to the earlier question: CAN YOU USE THESE MODELS IN YOUR APPLICATION?

If you’re using Computer Vision, there’s a high chance your application requires an IoT device, and looking at the forecast for the IoT devices, you’re in good company.

The main challenge is that IoT devices are resource-constrained; they have limited memory and low compute power. The more trainable parameters in a model, the bigger its size. Inference time of a deep learning model increases along with the increase in number of trainable parameters. Moreover, models with high parameters require more energy and space in comparison to a smaller network with fewer parameters. The end result is that when the size of the model is big, it’s difficult to deploy on resource-constrained devices. While these models have been successful in achieving great results in a lab, they aren’t usable in many real-world applications.

In the lab, you have expensive and high-speed GPUs to get this level of performance [1], but when you deploy in the real-world the cost, power, heat and other issues preclude the “just throw more iron at it” strategy.

Deploying deep learning models on the cloud is an option as it can provide high computational and storage availability. However, it will have poor response times due to network latency, which is unacceptable in many real-time applications (and don’t get me started on the network connectivity’s impact on overall reliability, or privacy!).

In short, AI needs to process close to the data source, preferably on the IoT device itself!

That leaves us with one option: Reducing the size of the model.

Making a smaller model that can run under the constraints of the edge-devices is a key challenge. And that too without compromising on accuracy. It is just not enough to have a small model that can run on resource-constrained devices. It should perform well, both in terms of accuracy and inference speed.

So how do you fit these models on limited devices? How do you make them usable in real-world applications?

Here are a few techniques that can be used to reduce the model size so that you can deploy them on your IoT device.

Pruning

Pruning reduces the number of parameters by removing redundant, unimportant connections that are not sensitive to performance. This not only helps reduce the overall model size but also saves on computation time and energy.

Pruning (Source: Intel)

Pros:

Can be applied during or after training.
Can improve the inference time/ model size vs accuracy tradeoff for a given architecture [12]
Can be applied to both convolutional and fully connected layers

Cons:

Generally, does not help as much as switching to a better architecture [12]
Implementations that benefit latency are rare as TensorFlow’s only brings model size benefits

Speed and size tradeoff for original and pruned models [13]

Quantization

In DNN, weights are stored as 32-bit floating-point numbers. Quantization is the idea of representing these weights by reducing the number of bits. The weights can be quantized to 16-bit, 8-bit, 4-bit or even with 1-bit. By reducing the number of bits used, the size of the deep neural network can be significantly reduced.

INSERT GRAPHIC
Quantization (Source: Intel)

Pros:

Quantization can be applied both during and after training
Can be applied to both convolutional and fully connected layers

Cons:

Quantized weights make neural networks harder to converge. A smaller learning rate is needed to ensure the network to have good performance. [13]
Quantized weights make back-propagation infeasible since gradient cannot back-propagate through discrete neurons. Approximation methods are needed to estimate the gradients of the loss function with respect to the input of the discrete neurons [13]
TensorFlow’s quantize-aware training does not do any quantization during the training itself. Only statistics are gathered during training and those are used to quantize post-training. So I am not sure if the above points should be included as cons

Knowledge distillation

In knowledge distillation, a large, complex model is trained on a large dataset. When this large model can generalize and perform well on unseen data, it is transferred to a smaller network. The larger model is also known as the teacher model and the smaller network is also known as the student network.

Knowledge Distillation (Source: Towards Data Science)

Selective Attention

Selective attention is the idea of focusing on objects or elements of interest, while discarding the others (often background or other task-irrelevant objects). It is inspired by the biology of the human eye. When we look at something, we only focus on one or a few objects at a time, and other regions are blurred out.

Selective Attention

This requires adding a selective attention network upstream of your existing AI system or using it by itself if it serves your purpose. It depends on the problem you are trying to solve.

Pros:

Faster inference
Smaller model (e.g. a face detector and cropper that’s only 44 KB!)
Accuracy gain (by focusing downstream AI on only the regions/objects of interest)

Cons:

Supports only training from scratch

Low-rank Factorization

Uses matrix/tensor decomposition to estimate the informative parameters. A weight matrix A with m x n dimension and having a rank r is replaced by smaller dimension matrices. This technique helps by factorizing a large matrix into smaller matrices.

Low-rank Factorization (Source: ResearchGate)

Pros:

Can be applied during or after training
Can be applied to both convolutional and fully connected layers
When applied during training, can reduce training time

The best part is, all of the above techniques are complementary to each other. They can be applied as is or combined with one or multiple techniques. By using a three-stage pipeline; pruning, quantization and Huffman coding to reduce the size of the pre-trained model, VGG16 model trained on the ImageNet dataset was reduced from 550 to 11.3 MB. Most of the techniques discussed above can be applied to pre-trained models, as a post-processing step to reduce your model size and increase inference speed. But they can be applied during training time as well. Quantization is gaining popularity and has now been baked into machine learning frameworks. We can expect pruning to be baked into popular frameworks very soon.

In this article, we looked at the motivation for deploying deep-learning based models to resource-constrained devices such as IoT devices and the need to reduce model size so they fit without compromising accuracy. We also discussed the pros and cons of some modern techniques to compress deep-learning models. Finally, we touched on the idea that each of the techniques can either be applied individually or can be combined.

Be sure to explore all the techniques for your model, post-training as well as during training and figure out what works best for you.

References:

https://towardsdatascience.com/machine-learning-models-compression-and-quantization-simplified-a302ddf326f2
Buciluǎ, C., Caruana, R., & Niculescu-Mizil, A. (2006, August). Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 535–541).
Cheng, Y., Wang, D., Zhou, P., & Zhang, T. (2017). A survey of model compression and acceleration for deep neural networks. arXiv preprint arXiv:1710.09282.
http://mitchgordon.me/machine/learning/2020/01/13/do-we-really-need-model-compression.html
https://software.intel.com/content/www/us/en/develop/articles/compression-and-acceleration-of-high-dimensional-neural-networks.html
https://towardsdatascience.com/the-w3h-of-alexnet-vggnet-resnet-and-inception-7baaaecccc96
https://www.learnopencv.com/number-of-parameters-and-tensor-sizes-in-convolutional-neural-network/
Véstias, M. P. (2019). A survey of convolutional neural networks on edge with reconfigurable computing. Algorithms, 12(8), 154.
https://technology.informa.com/596542/number-of-connected-iot-devices-will-surge-to-125-billion-by-2030-ihs-markit-says
https://www.cisco.com/c/dam/en/us/products/collateral/se/internet-of-things/at-a-glance-c45-731471.pdf
Mohan, A., Gauen, K., Lu, Y. H., Li, W. W., & Chen, X. (2017, May). Internet of video things in 2030: A world with many cameras. In 2017 IEEE International Symposium on Circuits and Systems (ISCAS) (pp. 1–4). IEEE.
Blalock, D., Ortiz, J. J. G., Frankle, J., & Guttag, J. (2020). What is the state of neural network pruning?. arXiv preprint arXiv:2003.03033.
Guo, Y. (2018). A survey on methods and theories of quantized neural networks. arXiv preprint arXiv:1808.04752.

Sabina Pokhrel
Customer Success AI Engineer, Xailient

If you're building AI or vision-enabled products, you've come to the right place.

Pruning

Quantization

Knowledge distillation

Selective Attention

Low-rank Factorization

References:

Pages

Topics

Contact

Address

Phone