The Caffe Deep Learning Framework: An Interview with the Core Developers


Spend any amount of time researching the topic of deep learning and you'll inevitably come across the term Caffe. This convolutional neural network (CNN) framework, originally named DeCAF, was initially developed by Yangqing Jia (now a research scientist at Google), during his Ph.D. program at the University of California, Berkeley. It is now maintained by U.C. Berkeley's Vision and Learning Center (BVLC), whose faculty members include Trevor Darrell, Pieter Abbeel, Jitendra Malik, and the founder of Lytro, Ren Ng.

Deep learning has rapidly become a leading method for object classification and other functions in computer vision, and Caffe is a popular platform for creating, training, evaluating and deploying deep neural networks. (In response to the popularity of deep neural networks and Caffe, the Embedded Vision Alliance is organizing a half-day tutorial on CNNs and Caffe to be held on May 2, 2016 in Santa Clara, California as part of the upcoming Embedded Vision Summit.)

The Embedded Vision Alliance recently conducted an interview with Evan Shelhamer, Jeff Donahue and Jonathan Long, the core Caffe developers at U.C. Berkeley, to understand the history, current status and future plans for Caffe and the developers' aspirations for the upcoming workshop. An edited transcript of the conversation follows.

What were the motivations for you to begin work on Caffe?

Prior to the development of standardized, open-source toolkits such as Caffe, deep learning had delivered results that were compelling but elusive and not reliably repeatable. Alex Krizhevsky's AlexNet was the first "open" tool set used in academia and industry; Caffe is both "cleaner" and more extensible, allowing for support of both CPUs and GPUs (for example) via setting of a single flag. Sharing of existing network models is also a straightforward process with Caffe, which is key to its widespread adoption, enabling both rapid reuse and evolution.

How has Caffe , and open-source deep learning more generally, evolved?

  • December 2012: AlexNet, the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winner, is presented at the Conference on Neural Information Processing Systems (NIPS). Alex Krizhevsky's cuda-convnet is also made public at this time, and DNNResearch is acquired by Google shortly thereafter. These are all pivotal moments for image classification.
  • Spring and Summer 2013: Many academic and industry efforts attempt to reproduce the AlexNet results on ILSVRC by "hacking" cuda-convnet. Yangqing Jia and Jeff Donahue succeed; Professor Trevor Darrell's group at U.C. Berkeley publishes its DeCAF paper, code, and models, demonstrating that deep learning features deliver general improvements for visual recognition and can be fine-tuned for various specific tasks. For the first time, the development community has a public, do-it-yourself deep learning model.
  • December 2013: Caffe v0, a C++/CUDA-based framework for deep learning with a full toolkit for defining, training, and deploying deep networks, is released at NIPS. Caffe is more general-purpose than DeCAF, not to mention faster.
  • Spring 2014: Caffe incorporates new solvers, general network graphs (multi-input, -path, and -output) and weight sharing to encompass a large range of potential models.
  • Summer 2014: Caffe is available for deployment in CPU-only systems.
  • June 2014: CVPR 2014 brings a "wave" of gradients. Ross Girshick's (then-U.C. Berkeley) R-CNN achieves breakthrough accuracy for object detection, providing proof that fine-tuning can improve visual tasks beyond just object classification. The effort to address all recognition tasks via deep learning is now well underway.
  • September 2014: The Caffe Model Zoo is created, to enable model sharing across research groups and industry. ILSVRC 14 is held, with winners including VGG (developed in Caffe) and GoogLeNet (reproduced in Caffe shortly thereafter). The first public Caffe tutorial is also held, at the European Conference on Computer Vision (ECCV). Finally, a coordinated Caffe update release with NVIDIA enables acceleration on GPUs via cuDNN v1.
  • Winter 2014: Vision, language and pixel-wise tasks are now achieving state-of-the-art results through deep learning. Long-Term Recurrent Networks (LRCN) and Fully Convolutional Networks (FCN) are now available in open-source form with reference models in Caffe.
  • Spring 2015: Nets can now be described through Python code, by using NetSpec. Caffe is generalized to N-dimensional data to support 3D images, along with modalities beyond vision.
  • June 2015: A "do-it-yourself" deep learning tutorial is held at CVPR, to attendance standing-room-only crowd. It includes a vision, language, and robotics "highlight" reel from around the world, and provides how-to’s for classification, detection, captioning and language modeling, and semantic segmentation.
  • Fall 2015: Network training in Caffe is parallelized to multiple GPUs, in order to both speed up learning and enable training using ever more data. Near-linear acceleration is now possible for certain configurations. For deployment, I/O is now optional, in order to support "lean" builds.

What is Caffe's current status, and plans for the future?

The current build of Caffe incorporates the latest ILSVRC and Microsoft Common Objects in Context (COCO) layers and models, and new, state-of-the-art reference networks are on the way. Caffe is in a much more mature state than it was a year ago, but there are still a lot of things to fix (which right now is our primary focus). We're honing in on a v1.0 release for the types of things that people currently use Caffe for every day; then, we'll use that stable foundation as the basis for a return to experimentation.

What heterogeneous computing architectures are currently supported by Caffe?

  • GPUs?
  • FPGAs?
  • Dedicated CNN processors?
  • Others?

How will this support breadth and depth evolve over time?

Caffe development to date has been focused on base functionality on x86 CPUs, along with optional acceleration on NVIDIA GPUs. Intel has provided optimized x86 code which we've incorporated; going forward, support for MICs (Intel's Many Integrated Core SoCs, i.e. Xeon Phi) is also likely. "Unofficial" (not, at least yet, integrated into the main Caffe build) OpenCL version of Caffe are under development from AMD, Intel, ETH Zurich, and the open source community. IBM supposedly also has low-power "fork" of Caffe in-house, although we don't have any details on it. We're not aware of any active FPGA-based Caffe development, although it may exist. With that all said, there's nothing that fundamentally precludes framework support on other platforms… ARM, MIPS, FPGAs, etc. The design of the Caffe framework is hardware-agnostic, as is the network model format; what changes, depending on the underlying hardware, is the implementation layer that executes the model.

How does Caffe compare to and coexist with other open source frameworks and tools such as Minerva, Theano and Torch, as well as industry-supplied open-source tools like NVIDIA's CuDNN and the recent releases from companies like Facebook and Google?

CuDNN is in a different category from the other tools you mention. It is a companion library, not a framework itself. Its various function calls were developed in collaboration with us, and directly leverage the Caffe interface in some respects. In vision applications, Caffe seems to be the most popular of the open-source frameworks you mention, even though it's younger than Theano, for example. Partly we know this from our partnership with NVIDIA. CuDNN requires that you select a target framework before beginning the download, since it supports multiple framework options; NVIDIA tells us that Caffe is the most popular selection. We also track how many times Caffe itself is downloaded, as well as how many times models are downloaded from the Zoo (tens of thousands, if not hundreds of thousands, to date).

Outside of vision, Torch and Theano are also quite popular. The community aspect is, however, a differentiator; it seems to us that model sharing is much more common with Caffe than with alternatives. Language preferences are one reason for the diversity of open-source framework alternatives; Theano is Python-friendly, for example, while Torch may be preferable for Lua programmers, and Caffe is ideal for those fluent in C++ along with Mathlab (note: Caffe also offers a Python interface option). Theano is also a very general way to do computation, without a heavy focus on performance, whereas Caffe focused on performance from the beginning. As with operating systems and other software packages, philosophical preferences, versus technical factors, sometimes drive selection. And different preferences can even exist within a single company; Facebook uses Torch widely for research, but Caffe predominantly for deployments.

Facebook has open-sourced extensions to both Torch and Caffe. Facebook AI Research (FAIR) initially released fbcunn, the optimized and tested GPU edition of its deep learning components. More recently the company has released Caffe extensions for efficient inference and converting models from Torch to Caffe for deployment. We're happy to see the company contributing to these open frameworks.

Google's TensorFlow is general-purpose in nature and offers a clear, flexible interface to many kinda of models and optimizations. However it is currently beset by performance issues, with slow execution time and high memory consumption. There is also a divide between the Google-proprietary and public versions of the toolset (although, to be clear, it's great to see it open-source at all!).

How big is your core development team at U.C. Berkeley (and how is it subdivided), versus independent contributors? How  is the total effort divided between the two groups, and how will this evolve?

Responsibility for day-to-day maintenance and enhancements lies with U.C. Berkeley; the three of us are the core developers (i.e. with commit rights to merge changes). With that said, there are currently 162 contributors listed in Github, and we get more than one "pull" request per day on average from the outside. Admittedly, however, the contributor network has a significant "long tail" aspect to it; a small percentage of the total developers do a majority of the activity.

Right now, there's no formalized hierarchy as is the case with Linux, however a more federated (diffused) development effort is definitely desirable, if for no other reason than to broaden the talent expertise "pool". Finding these "lieutenant" developers, to handle builds for a portion of the package (for example) while we retain overall "commit" responsibilities, is a key objective for the year ahead.

What industries and applications is Caffe currently being used in, and how do you see this expanding over time? Can you share any specific examples?

Caffe is being used extensively across U.C. Berkeley and more generally across the U.C. network of campuses (such as UCSD and UCSF). Focusing only on the U.C. Berkeley Vision and Learning Center, for example, projects it's currently employed in involve various vision, robotics, and language applications. Other universities and university networks using Caffe that we're currently aware of include MIT, Stanford, CMU, Oxford, Princeton, UIUC, MPI (Germany), INRIA (France), and Tsinghua (China).

Established companies using Caffe include  Facebook, Adobe, Microsoft (including Microsoft Research Asia and China), Samsung, Yahoo! Japan, NVIDIA, Intel, Yahoo! and its Flickr subsidiary, and Tesla. Startups using Caffe include Yelp, Pinterest , PlanetLab (satellite imagery), Vision Factory (acquired by Deep Mind), SmartSpecs (an Oxford startup to assist the blind) and Bonsai (a machine learning toolkit).

One of us (Jeff Donahue) works part-time at Pinterest. Flashlight is a newly released project of ours. It's a large-scale visual search tool that lets you zoom in on a specific object in a particular Pin’s image and automatically discover visually similar objects, colors, patterns, etc in related Pins. And it's all done using Caffe.

What are your aspirations for the upcoming February tutorial?

We hope that through our efforts at this event, attendees will better understand what deep learning is, get good ideas of where they may be able to apply it to solve their design problems,  and of course actually be able to apply it! In addition, we want to better understand the problems people are trying to solve and how Caffe developers can help them. It's gratifying to see the improvements that developers have been excited about delivering, and equally exciting to see and hear new ideas from the audience.

What do you need from the deep learning community in general, and the computer vision community specifically, to realize your aspirations for Caffe?

Keep doing what you're doing…and do more. Extend the framework, use it to solve your problems, and (this is key) don't neglect to also make your results public. Right now, there are a large number of people using Caffe but not sharing their work, which is contrary to the open-source spirit.

We hope to see you in May!

Want to learn more about deep learning and Caffe?  Attend Introduction to Caffe for Designing and Training Convolutional Neural Networks: A Hands-on Tutorial

On May 2, 2016 from 1:30 PM to 5:45 PM, the primary Caffe developers from U.C. Berkeley's Vision and Learning Center will present a half-day tutorial focused on convolutional neural networks for vision and the Caffe framework for deep learning. Organized by the Embedded Vision Alliance and part of the Embedded Vision Summit, the tutorial will take place at the Santa Clara Convention Center in Santa Clara, California.

This live tutorial takes participants from an introduction through the theory behind convolutional neural networks to their actual implementation, and includes multiple hands-on labs using Caffe. It begins with an introduction to the structure, operation, and training of CNNs and how they are used for computer vision. It explores the strengths and weaknesses of CNNs, and how to design and train them. The tutorial then introduces the Caffe open source framework for CNNs, and provides hands-on labs in creating, training, and deploying CNNs using Caffe.

For more tutorial details, including online registration for the Embedded Vision Summit, please visit the event page.

Here you’ll find a wealth of practical technical insights and expert advice to help you bring AI and visual intelligence into your products without flying blind.



1646 N. California Blvd.,
Suite 360
Walnut Creek, CA 94596 USA

Phone: +1 (925) 954-1411
Scroll to Top