This blog post was originally published at Intel’s website. It is reprinted here with the permission of Intel.
The popularity of convolutional neural network (CNN) models and the ubiquity of CPUs means that better inference performance can deliver significant gains to a larger number of users than ever before. As multi-core processors become the norm, efficient threading is required to leverage parallelism. This blog post covers recent advances in the Intel® Distribution of OpenVINO™ toolkit threading support and associated performance improvements.
The Intel Distribution of OpenVINO toolkit (a developer tool suite for high-performance deep learning on Intel® architecture) offers a multi-threading model that is portable and free from low-level details. It’s not necessary for users to explicitly start and stop any (e.g. inference) threads, or even know how many processors or cores are being used. This results in optimized CPU performance for CNNs, on any supported target, that is easily deployed out of the box.
Real-life applications require more than a single CNN, and combining multiple networks into a single dynamic pipeline requires different components to efficiently coordinate, share, and synchronize. Performing certain pre-processing (e.g. color conversion, resizing or cropping), and inference post-processing (e.g. parsing output results) tasks with third party components may present complications when it comes to reuse of processing threads. As detailed below, the Intel Distribution of OpenVINO toolkit has established a solution to address these composability problems by using the Intel® Threading Building Blocks library (Intel® TBB) as a common threading layer.
Memory management and thread organization are even more important in systems with multiple cores and large memory capacity. On NUMA systems, the location of a thread compared to the memory it accesses significantly affects the performance, so hardware awareness is crucial. Below, we also discuss how the Intel Distribution of OpenVINO toolkit addresses scheduling strategy and memory allocation while handling NUMA.
Composability and Performance Pillars
Prior to the 2019 R1 release, the Intel Distribution of OpenVINO toolkit was equipped with OpenMP runtime, which keeps processing threads active to facilitate the rapid start of parallel regions. The toolkit was initially designed to have a separate pool for every network executed on the CPU so that networks can be inferenced simultaneously, and configured to use different numbers of threads. With OpenMP, these pools could result in heavy oversubscription, negatively impacting performance even for less complex pipelines with just two or three components.
With scheduling based on the Intel TBB library available beginning with the Intel Distribution of OpenVINO toolkit 2019 R1 release, composability was enabled. The two features of the Intel TBB library primarily responsible for composability are the global thread pool and individual (e.g. per-network) task arenas, while Intel TBB decides at runtime how to map (inference) tasks to hardware (threads).
In fact, the Intel TBB library can orchestrate other parallel tasks in the application, beyond inference. For example, one specific way to further optimize application performance is to let the Intel Distribution of OpenVINO toolkit perform image preprocessing. The supported preprocessing includes resizing, color format, data type conversions, mean subtraction, and so on. Although not part of the inference itself (since it is executed before feeding an image tensor to the network), the preprocessing is also threaded. And since threading pool implementation is now fully based on the Intel TBB, the worker threads are shared with the inference run-time, avoiding over-subscription.
Moreover, libraries are often not easy to compose in parallel. Here, the Intel Distribution of OpenVINO toolkit follows a simple rule of thumb by using the same threading model for all included modules or libraries. For example, OpenCV, a popular and supported computer library that uses pthreads by default, is another potential source of additional thread pools and oversubscription. To avoid this, the Intel Distribution of OpenVINO toolkit will now ship OpenCV compiled with Intel® TBB to support full composability between components.
Putting all together, the below diagram is an example of the Security Barrier Demo pipeline usingOpenCV, preprocessing and multi-model inference:
Figure 1. Example pipeline of the Security Barrier Demo. Notice that Classification parts are both 1) conditional (e.g. Vehicle can be detected, but the License Plate not, and 2) iterational, e.g. loop over detections from the first network. All blue and dark blue components are using the Intel TBB and share the pool of threads.
Using these demos on various use cases, from security to smart classrooms, as example pipelines, the use of Intel TBB library provides significant gains in performance. The more CNN networks are used (and potentially executed in parallel) in a pipeline, the larger the gain. Performance benchmarks are regularly tested on different Intel architecture-based platforms to demonstrate these claims. Stay tuned for benchmarks demonstrating performance gains using the Intel TBB library here.
One specific example where the OpenMP code-path is potentially more performant than Intel TBB is in a completely static single-network scenario that runs the same (single) network a zillion times and requires no dynamic adjustments. For this reason, traditional High-Performance Computing (HPC) style benchmarks may find the dynamic nature of Intel® TBB to be a slight reduction in performance. The latest release, Intel Distribution of OpenVINO toolkit 2020.1, brings significant performance improvements for the Intel TBB code-path. Now, on all topologies the Intel TBB is completely on par or just marginally slower than OpenMP.
Therefore, we are planning to remove the OpenMP code path from future releases of the Intel Distribution of OpenVINO toolkit. We argue that any advantage previously seen from completely static scheduling is becoming rare for real-world use cases, like inference pipelines used in these demos.
NUMA Awareness: More Performance on the Table
An important challenge faced by inference application developers for CPU-based platforms is per-socket memory controllers, which leads to Non-Uniform Memory Access (NUMA). Leveraging the performance on NUMA systems requires for:
- discovering what your platform topology is,
- controlling at which NUMA node your data is placed, and
- controlling where your work executes (node affinity).
In response, the Intel Distribution of OpenVINO toolkit introduced a run-time scheduling that is NUMA-aware. Temporary data, such as intermediate buffers for individual CNN layers, is allocated in the NUMA-aware method, while read-only data, such as weights, is carefully cloned per NUMA node. As a result, all execution takes place on NUMA-local memory (with exception of network inputs that can come from any NUMA node).
Initially, the NUMA support was implemented as a combination of
- TBB::task_arena that limits concurrency by providing limited number of slots for worker threads, and
- TBB::task_scheduler_observer that assigns NUMA nodes affinity to worker threads joining the arenas.
The main (or master) threads were also forced to load the CNN graphs through corresponding arenas, so that graphs memories were also allocated (and first-touched) on the correct NUMA node. This approach was introduced with the Intel Distribution of OpenVINO toolkit 2019 R2 release. This initial implementation was available only for Linux, yet showed promising performance. Therefore, we implemented official support for this method in the initial Intel TBB 2020 release. Specifically, the following new classes were introduced:
- TBB:: Intel TBB::task_arena::constraints that limits the arena to the specific NUMA node
- TBB::info::numa_nodes that reports available (i.e. with respect to the process’s mask) NUMA nodes, internally using the hwloc.
These classes greatly simplified the code, as ad-hoc system configuration parsing, process mask handling, custom task_scheduler_observer and etc. are no longer required.
Finally, based on these new Intel TBB library features, the new Intel Distribution of OpenVINO toolkit 2020.1 includes NUMA support for all OS targets (i.e. beyond Linux). Most importantly, it enables out-of-the box NUMA support for Windows Server (2016 or 2019), an OS that is widely used in many industry deployments, such as those in healthcare.
Therefore, with more and more use cases enabled with deep learning, such as medical imaging, inference using the toolkit becomes faster—eventually, enabling new and advanced algorithms.
The latest release of the toolkit also introduces a new NUMA threads binding mode. It is more adaptive and lightweight compared to the (default) binding of threads-to-cores, leaving more room for the OS to schedule threads. We recommend using this mode in any heavily contended scenarios involving CPU inference on the NUMA systems.
In this post, we discussed how Intel TBB makes the Intel Distribution of OpenVINO toolkit such a reliable solution for complex applications with many dynamic inference pipelines. In addition, real-life applications are designed with multiple components that execute concurrently. To be effective in these deployments, the Intel Distribution of OpenVINO toolkit has to get composability right — and it does!
Secondly, for large machines, the main challenge is controlling where the memory for data is allocated with respect to the code that leverages it. The Intel Distribution of OpenVINO toolkit is a NUMA-aware deep learning software tool suite that automatically determines the topology of the system, purposefully allocates the memory, and manages the threads to ensure that the data is being manipulated by threads running on the local CPU.
As a result, the toolkit offers new levels of CPU inference performance, now coupled with dynamic task scheduling and efficient mapping to current and future multi-core platforms, and fully adaptive to dynamic changes in the load and system resources at runtime.
As a further benefit, it’s easier than ever to access performance gains on Windows Server based systems when using the latest release of the toolkit. Now, CPU-based inference is producing better images much, much faster. We are committed to ongoing improvements; we encourage users to give it a try and send us your feedback!
To learn more about making your code future-ready and sharpen your technical skills, sign up for free technical webinars on the Intel Distribution of OpenVINO toolkit and the Intel TBB library.
Senior Software Engineer, Intel Architecture, Graphics and Software Group, Intel
Principal Engineer, Internet of Things Group, Intel