This blog post was originally published at Intel's website. It is reprinted here with the permission of Intel.
Intel has recently introduced Intel® Deep Learning Boost (Intel® DL Boost), a new set of embedded processor technologies designed to accelerate deep learning applications. Intel DL Boost includes new Vector Neural Network Instructions (VNNI) that can be used to perform computation in 8-bit precision that essentially reduces memory usage by 4x and increases the rate of arithmetic operations executed per second compared to floating point precision. Given a pre-trained model with floating point precision, we obtained a quantized version of the model to exploit Intel DL Boost instructions and accelerate inference performance. Here, we summarize our inference work with 8-bit precision in TensorFlow* using the Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN).
Quantization in TensorFlow
To enable the Intel DL boost capabilities on 2nd generation Intel® Xeon® Scalable processors, we have enhanced the Intel® Optimization for TensorFlow to support the seamless use of 8-bit inference on models already using 32-bit floating point, with no additional libraries required.
We have also developed the Intel Optimization for TensorFlow Quantization tool, an offline tool that converts a pre-trained 32-bit float model to a quantized model using 8-bit inference. A detailed description and guidelines for model quantization can be found at Intel AI Quantization Tools for TensorFlow. Using these tools, we were able to quantize a number of popular deep learning models, including convolutional and feedforward neural networks while preserving a high level of accuracy, as shown in Table 1. For ready-to-use purposes, we hosted several quantized models in Intel-model-zoo.
We have enabled post-training model quantization, which means that users can take a pre-trained floating point model and quantize it. It involves converting floating point activations and weights into 8-bit integers and replacing floating point operators in the computation graph by their quantized versions. The key steps to obtain an optimized quantized model using our tools are as follows:
- Export fp32 inference model as serialized TensorFlow GraphDef: This includes saving an inference graph in protobuf format and applying graph transformations for removing redundant nodes (e.g, Identity, CheckNumerics etc), folding constants, and folding batch-normalization.
- Convert fp32-graph into a quantized-graph: This step replaces fp32-ops with possible fused quantized ops and adds necessary conversion ops (e.g., ‘QuantizeV2’, ‘Requanitze’ etc) for activations. Weight quantization also takes place during this step.
- Calibrate and optimize quantized graph: This step runs quantized graph obtained from the previous step on a small subset of training data (or calibration data) and freezes the ranges of activations. The resulting graph is further optimized by fusing ‘Requanitze’ ops.
To illustrate the aforementioned steps, we show resulting subgraphs from each of the steps in Figure 1. Note that widely used CNNs (e.g., ResNet, Inception etc) exhibit a repeating pattern of conv2d → batch-norm → relu op sequence. After batch-norm folding this pattern turns into a similar subgraph of Figure 1(a) which is replaced by a fused quantized operator as shown in Figure 1(b). Further optimization, as shown in Figure 1(c), is done after calibration. Since most convolution receives non-negative rectified input due to relu as its preceding op, our quantized convolutions takes unsigned 8-bit integer as input and signed 8-bit integer as filter. This unsigned and signed combination is also important for performance since required arithmetic operations can be done efficiently with the currently available Intel DL Boost instruction VPDPBUSD (for details, see Intel Software Manual).
Figure 1 (a)
Figure 1 (b)
Figure 1 (c): Resulting subgraphs after each of the three steps, (a) fp32, (b) 8-bit quantized, and (c) calibrated 8-bit quantized.
Besides Conv2D and Matmul ops that can exploit Intel DL Boost instructions, we also have quantized pooling and concat ops that reduces memory bandwidth bottleneck significantly and avoid unnecessary quantize and dequantize ops. In fact, to get the best performance it is recommended to have an uninterrupted flow of 8-bit precision ops as much as possible.
Some of the pre-trained models (e.g., MobileNet) show different data distribution in their weight tensors across different channels. Having a single scale parameter for quantizing a weight tensor may exhibit large accuracy loss in such case. We mitigated this shortcoming by introducing new operators, such as `RequantizePerChannel` and `RequantizationRangePerChannel`. With this per-channel extension of our model quantization tool, we were able to recover accuracy loss of Mobilenet related models.
Accuracy and Performance
We have enabled 8-bit inference for several popular deep learning models for image classification, object detection, and recommender systems. Table 1 reports some of CNN models’ accuracy and performance speed up on 2nd gen Intel Xeon Scalable processors. As can be seen, Intel DL Boost speeds up the inference significantly while keeping the accuracy very close to that of fp32 models. 
|Model||Top 1 Accuracy (%)||Throughput Speedup|
|FP32 (Intel Xeon Scalable)||INT8 (2nd Gen Intel Xeon Scalable)||2nd Gen Intel Xeon Scalable|
Table 1: Accuracy and performance of floating point and quantized models.
To conclude, Intel DL Boost on 2nd gen Intel Xeon Scalable processors delivers promising results for accelerating deep models used for computer vision, natural language and speech processing. With our developed toolset, you can quantize fp32 models for improved inference performance in TensorFlow without any other library dependency. Check out our quantization tools and examples at intel-quantization-tool.
- Intel.AI blog post: Lowering Numerical Precision to Increase Deep Learning Performance
- How to Quantize Neural Networks with TensorFlow
- Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
 Performance were measured with synthetic data and a minibatch of size 128.
Ashraf Bhuiyan, Mahmoud Abuzaina, Niranjan Hasabnis, Niroop Ammbashankar, Karen Wu, Ramesh AG, Clayne Robison, Bhavani Subramanian, Srinivasan Narayanamoorthy, Cui Xiaoming, Mandy Li, Guozhong Zhuang, Lakshay Tokas, Wei Wang, Jiang Zhoulong, Wenxi Zhu, Guizi Li, Yiqiang Li, Rajesh Poornachandran, Rajendrakumar Chinnaiyan.
Huma Abidi, Jayaram Bobba, Banky Elesha, Dina Jones, Moonjung Kyung, Karthik Vadla, Wafaa Taie, Jitendra Patil, Melanie Buehler, Lukasz Durka, Michal Lukasziewski, Abolfazl Shahbazi, Steven Robertson, Preethi Venkatesh, Nathan Greeneltch, Emily Hutson, Anthony Sarah, Evarist M Fomenko, Vadim Pirogov, Roma Dubstov.
Tatiana Shpeisman, Thiru Palanisamy, and Penporn Koanantakool from Google.
Notices and Disclaimers
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information, visit www.intel.com/benchmarks.
Performance results are based on testing as of 3/1/2019 and may not reflect all publicly available security updates. No product or component can be absolutely secure.
2nd Gen Intel Xeon Scalable Processor Platform:
2 socket Intel® Xeon® Platinum 8280 Processor, 28 cores HT On Turbo ON Total Memory 384 GB (12 slots/ 32GB/ 2933 MHz), BIOS: SE5C620.86B.0D.01.0271.120720180605 (ucode:0x4000013),CentOS 7.6, 4.19.5-1.el7.elrepo.x86_64, Deep Learning Framework: https://hub.docker.com/r/intelaipg/intel-optimized-tensorflow:PR25765-devel-mkl (https://github.com/tensorflow/tensorflow.git commit: 6f2eaa3b99c241a9c09c345e1029513bc4cd470a + Pull Request PR 25765, PR submitted for upstreaming), Compiler: gcc 6.3.0,MKL DNN version: v0.17, Datatype: INT8
Intel Xeon Scalable Processor Platform:
2 socket Intel® Xeon® Platinum 8180 Processor, 28 cores HT On Turbo ON Total Memory 384 GB (12 slots/ 32GB/ 2633 MHz), BIOS: SE5C620.86B.0D.01.0286.121520181757, CentOS 7.6, 4.19.5-1.el7.elrepo.x86_64, Deep Learning Framework: https://hub.docker.com/r/intelaipg/intel-optimized-tensorflow:PR25765-devel-mkl (https://github.com/tensorflow/tensorflow.git commit: 6f2eaa3b99c241a9c09c345e1029513bc4cd470a + Pull Request PR 25765, PR submitted for upstreaming) Compiler: gcc 6.3.0,MKL DNN version: v0.17, Datatype: FP32
Optimization Notice: Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Software Engineer, AIPG, Intel
Senior Machine Learning Software Engineer, Intel