This blog post was originally published at Vision Systems Design's website. It is reprinted here with the permission of PennWell.
Last time, I showcased new videos and articles from the Embedded Vision Alliance that provide tips for those using convolutional neural networks (CNNs) and other deep learning techniques. This time, I'll highlight additional in-depth resources that focus on using various vision processor types. And for those who want a hands-on technical introduction to deep learning for computer vision, see the details about next week's live tutorial at the end of this column.
If you're interested in learning how to accelerate deep learning vision processing on DSPs, vision processors and CNN processors, I recommend the article "Deep Learning for Object Recognition: DSP and Specialized Processor Optimizations," co-authored by experts at BDTI, Cadence, Movidius, NXP Semiconductors, and Synopsys. The authors provide an overview of CNNs and then dive into optimization techniques for object recognition and other computer vision applications. The DSPs, vision processors and CNN processors discussed in the article are well matched for both initial neural network training and (especially) inferencing tasks, since the structure of CNN algorithms offers lots of opportunity for parallelization, and the types of computation operations used, such as repetitions of MACs (multiply-accumulates), are very uniform.
For another perspective on the topic, check out "A Design Approach for Real Time Classifiers," a technical article authored by two lead engineers at design consultancy firm PathPartner Technology. Object detection and classification, they note, is often done via a supervised learning process. The offline classifier training process fetches sets of selected images containing objects of interest, extracts features from this input, and maps them to corresponding labeled classes in order to generate a classification model. Real-time inputs are then categorized based on the trained classification model in an online process. The authors provide design considerations for porting the real-time online algorithms to SoCs such as Texas Instruments' ADAS processors, which contain a variety of heterogeneous processing elements: ARM CPU cores, DSP cores, GPU cores, and specialized vision processor cores.
If GPUs are your deep learning coprocessor of choice, take a look at "Using SGEMM and FFTs to Accelerate Deep Learning," a recent Embedded Vision Summit talk presented by Gian Marco Iodice, a software engineer at ARM. With the emergence of deep learning, Iodice observes, matrix multiplication and the fast Fourier transform are becoming increasingly important, particularly as use cases extend into mobile and embedded devices. After a brief introduction to the nature of CNN computations, Iodice explores the use of SGEMM (single-precision floating point general matrix multiplication) and mixed-radix FFTs to accelerate 3D convolution. He shows example OpenCL implementations of these functions, and highlights their advantages, limitations and trade-offs. Central to the techniques explored is an emphasis on cache-efficient memory accesses and the crucial role of reduced-precision data types. Here's a preview:
For a deeper understanding of deep learning techniques for vision, attend next week's hands-on tutorial "Deep Learning for Vision Using CNNs and Caffe," on September 22, 2016 in Cambridge, Massachusetts. This full-day tutorial is focused on convolutional neural networks for vision and the Caffe framework for creating, training, and deploying them. Presented by the primary Caffe developers from the U.C. Berkeley Vision and Learning Center, it takes participants from an introduction to the theory behind convolutional neural networks to their actual implementation, and includes hands-on labs using Caffe.Visit the event page for more tutorial details and to register.
Editor-in-Chief, Embedded Vision Alliance