Open Sourcing the AI Model Efficiency Toolkit: Contributing State-of-the-art Compression and Quantization Techniques from Qualcomm AI Research

This blog post was originally published at Qualcomm’s website. It is reprinted here with the permission of Qualcomm.

QuIC is excited to open source the AI Model Efficiency Toolkit on GitHub to collaborate with other leading AI researchers and to provide a simple library plugin for AI developers to utilize for state-of-the-art model efficiency performance.

At Qualcomm Technologies, we’ve been actively researching^1,2,3 and developing AI solutions with the goal to make artificial intelligence ubiquitous across devices, machines, vehicles, and things. Our focus on power efficiency over the past decade has led to dramatic improvements in AI performance per watt that have enabled a variety of enhanced experiences from on-device virtual assistants and translation to smart security cameras and safety-focused driving.

A driving force behind these improvements in performance per watt has been our leading research in AI model efficiency. By model efficiency, we mean techniques that shrink models, reduce computations, reduce memory traffic, lower latency, and efficiently use hardware. We have traditionally contributed our breakthrough AI research to the rest of the community through papers and workshops at academic conferences like NeurIPS or through commercialization of products, like the Qualcomm Neural Processing SDK.

Now we’re taking a step further. Qualcomm Innovation Center (QuIC) is excited to open source the AI Model Efficiency Toolkit (AIMET) on GitHub to collaborate with other leading AI researchers and to provide a simple library plugin for AI developers to utilize for state-of-the-art model efficiency performance. The goal for this open source project is to help migrate the ecosystem toward integer inference because we believe this is an effective way to increase performance per watt.

AIMET for power-efficient AI at scale

AIMET is a library that supports advanced quantization and compression techniques for trained neural network models. Quantization techniques attempt to systematically reduce the number of bits used for weight parameters and activation calculations without sacrificing model accuracy, such as moving from a 32-bit floating point value to an 8-bit fixed point value. Compression techniques attempt to systematically remove activation nodes and connections between nodes without sacrificing model accuracy. AIMET supports a variety of advanced quantization techniques, such as data-free quantization, and compression techniques, such as spatial singular value decomposition (SVD) and channel pruning.

Compression or quantization reduce the model size of a deep neural network.

Manually optimizing a neural network does not scale since it is time consuming and costly in terms of engineering resources. In designing AIMET, the focus was on developing techniques that can provide significant improvements to model efficiency with simple API calls. AIMET automatically improves run-time performance, latency, power efficiency, and memory requirements of deep learning neural network models while avoiding time-consuming and difficult-to-repeat hand-tuning. The library plugs directly into TensorFlow and PyTorch training frameworks for ease of use, allowing developers to call APIs directly from their existing pipelines.

AIMET includes quantization and compression techniques that allow for simple deployment of AI models at scale.

It was also important to make sure that AIMET can take advantage of common hardware acceleration techniques. AIMET is designed to enable neural networks to run more efficiently on fixed-point AI hardware accelerators, such as those available on Qualcomm Snapdragon platforms.

So why should you be interested in AIMET? It’s the results. The toolkit is based on some of the work published in several Qualcomm AI Research papers, including data-free quantization (DFQ). Through a series of simple API calls, AIMET can quantize an existing 32-bit floating-point model to an 8-bit fixed-point model without sacrificing much accuracy and without model fine-tuning. As an example of accuracy maintained, the DFQ method applied to several popular networks, such as MobileNet-v2 and ResNet-50, result in less than 0.9% loss in accuracy all the way down to 8-bit quantization — in an automated way without any training data. In addition, quantized models that we’ve run on the Qualcomm Hexagon DSP rather than on the Qualcomm Kryo CPU have resulted in a 5x to 15x speedup. Plus, the 8-bit model also has a 4x smaller memory footprint relative to the 32-bit model.

Data-free quantization enables INT8 inference with very minimal loss in accuracy relative to the FP32 model.

Similarly, AIMET can also significantly compress models. For popular models, such as ResNet-50 and ResNet-18, compression with spatial SVD plus channel pruning achieves 50% MAC (multiply-accumulate) reduction while retaining accuracy within 1% of the original uncompressed model.

AIMET compression techniques (spatial SVD and channel pruning) reduce MACs by 50% while retaining accuracy within approximately 1% of the original model.

A simple integration into common AI development workflows

Qualcomm Technologies has been creating tools for developers to more efficiently utilize hardware for many years — from graphics acceleration to computational camera applications. We know how important it is for tools to fit into typical development workflows, abstract complexity, offer compelling benefits, and be easy to use. For example, the Qualcomm Neural Processing SDK is engineered to help developers save time and effort in optimizing performance of trained neural networks on devices with Snapdragon. In fact, our quantization techniques have been shipping with the Qualcomm Neural Processing SDK since Summer 2019.

For the QuIC AIMET project, developers will be able to grab the latest and greatest library, which should seamlessly integrate with their existing training workflows. AIMET inputs a TensorFlow or PyTorch trained model, which can then be compressed, quantized, and fine-tuned. Quantized models can run well on hardware with fixed-point hardware acceleration. As an example, the optimized model is output in ONNX or TensorFlow, which can then be run on Snapdragon via the Qualcomm Neural Processing SDK.

We’re also excited to report that these techniques have been tested in the wild by real developers on real commercial applications with improvements that match our theoretical benchmark results. For example, they have been used for optimizing commercial models that are used for biometrics, speech recognition, and automotive.

Advancing AI model efficiency research through collaboration

AI model efficiency is a critical research area that is of shared importance across the AI community to enable the AI ecosystem and accelerate on-device AI development at scale. QuIC created this project to collaborate with other AI researchers, enhance our state-of the-art model efficiency research, and contribute to the open-source community. QuIC is committed to contributing cutting-edge research to this project on a regular basis. Please join us to work together on advancing AI model efficiency.

At Qualcomm AI Research, we believe that research is not meant to stay in the lab. We quickly commercialize and scale our research breakthroughs across devices and industries — reducing the time between research in the lab and offering advances that enrich lives. The open sourcing of AIMET is further speeding up this innovation cycle.

References:

Markus Nagel, Mart van Baalen, Tijmen Blankevoort, Max Welling. “Data-Free Quantization Through Weight Equalization and Bias Correction.” IEEE International Conference on Computer Vision (ICCV), Seoul, October 2019 (oral presentation).
Markus Nagel, Rana Ali Amjad, Mart van Baalen, Christos Louizos, Tijmen Blankevoort, “Up or Down? Adaptive Rounding for Post-Training Quantization.”
Andrey Kuzmin, Markus Nagel, Saurabh Pitre, Sandeep Pendyam, Tijmen Blankevoort, Max Welling. “Taxonomy and Evaluation of Structured Compression of Convolutional Neural Networks.”

If you're building AI or vision-enabled products, you've come to the right place.

Open Sourcing the AI Model Efficiency Toolkit: Contributing State-of-the-art Compression and Quantization Techniques from Qualcomm AI Research

QuIC is excited to open source the AI Model Efficiency Toolkit on GitHub to collaborate with other leading AI researchers and to provide a simple library plugin for AI developers to utilize for state-of-the-art model efficiency performance.

AIMET for power-efficient AI at scale

A simple integration into common AI development workflows

Advancing AI model efficiency research through collaboration

References:

Pages

Topics

Contact

Address

Phone