This blog post was originally published at Vision Systems Design's website. It is reprinted here with the permission of PennWell.
Convolutional neural networks (CNNs) and other deep learning techniques are one of the hottest topics in computer vision today, as you can tell by the number of columns I've devoted to the subject. Most recently, I discussed three talks from May's Embedded Vision Summit, all of which covered processors for deep learning, each delving into a different co-processor type: GPUs, FPGAs, and DSPs. Today, I'd like to delve into more detail, showcasing videos and an article that provide in-depth implementation tips once you've made your processor architecture selection. And for those who want a hands-on technical introduction to deep learning for computer vision, see the information about an upcoming live tutorial at the end of the article.
The first presentation, "Semantic Segmentation for Scene Understanding: Algorithms and Implementations," was delivered at the Embedded Vision Summit by Nagesh Gupta, CEO of middleware provider Auviz Systems (recently acquired by Xilinx). Many people are familiar with the use of CNNs to classify an image by identifying the dominant object in the image. A more challenging problem is understanding a complex scene, including identifying which regions of the image are associated with each object in the scene, known as "semantic segmentation." Recent research in deep learning, according to Gupta, provides powerful tools to address the problem of automated scene understanding. Modifying deep learning methods such as CNNs to classify pixels in a scene with the help of the neighboring pixels has provided very good results in semantic segmentation.
These techniques provide a good starting point towards understanding a scene. But deploying the algorithms on embedded hardware at the performance required for real-world applications can be challenging. Gupta's talk provides insights into deep learning solutions for semantic segmentation, focusing on current state-of-the-art algorithms and implementation choices. Gupta discusses the effect of porting these algorithms to fixed-point representations, along with the pros and cons of implementing them on FPGAs. Here's a preview:
For more on this topic, check out a recently published article, "FPGAs for Deep Learning-based Vision Processing," authored by three engineers in Intel's Programmable Solutions Group (formerly Altera). FPGAs, they note, have proven to be a compelling solution for solving deep learning problems, particularly when applied to image recognition. According to the authors, FPGAs' advantages for deep learning primarily derive from three key factors: their massively parallel architectures, efficient DSP resources, and large amounts of on-chip memory and bandwidth. The article delves into detail on each of these points, and also includes suggestions on how to efficiently up- and down-scale your design based on the desired performance and the available on-chip resources.
Also presenting at the Embedded Vision Summit was Sofiane Yous, Principal Scientist in the machine intelligence group at Movidius (also in the process of being acquired by Intel), who delivered the tutorial "Dataflow: Where Power Budgets Are Won and Lost." Yous rightly points out that trading off between power consumption and performance in deep learning, as well as other embedded vision and computational imaging applications, can often be described as a "battle," and his talk showcases stories from the front lines in this conflict.
Yous begins by demonstrating why good dataflow is so critical to performance and energy efficiency; he then shows why modern techniques and APIs are critical for fast time-to-market, and summarizes relevant academic work. He compares the usage models and benefits of emerging APIs such as TensorFlow versus classic approaches for deep learning. He also presents specific examples such as the GoogleNet implementation under Caffe and TensorFlow. Here's a preview:
For a deeper understanding of deep learning techniques for vision, attend the hands-on tutorial "Deep Learning for Vision Using CNNs and Caffe," on September 22, 2016 in Cambridge, Massachusetts. This full-day tutorial is focused on convolutional neural networks for vision and the Caffe framework for creating, training, and deploying them. Presented by the primary Caffe developers from the U.C. Berkeley Vision and Learning Center, it takes participants from an introduction to the theory behind convolutional neural networks to their actual implementation, and includes hands-on labs using Caffe.Visit the event page for more tutorial details and to register.
Editor-in-Chief, Embedded Vision Alliance