This blog post was originally published at Syntiant’s website. It is reprinted here with the permission of Syntiant.
Syntiant recently submitted its NDP120 into the MLPerf Tiny category for keyword spotting and our internal measurements pegged our solution at around 60 uJ/inferences at a 6ms latency. As a comparison, the one other published MCU-based benchmark came in at over 7300 uJ/inference, almost 120x more energy for the same network.
Once again, we are seeing how our strategy to build custom neural decision processors is paying off in extreme low energy to run neural workloads at the edge. Moving to Syntiant Core 2 with the NDP120 silicon gives us even more of the layers and constructs needed to attack a wide variety of networks, like convolutional workloads, RNN structures such as LSTM and GRU, as well as dense layers and custom activations.
Even with the NDP101, we often receive questions from customers about our 140uW power number running always-on-voice (AOV) detection. “Is that your standby number?” “Is that with a voice activity detect?” They always are surprised that this is full power inference on our 570k params neural network running our Alexa wake word in a noisy environment. The very first thing we do is ship out a development board and encourage our customers to independently measure the result.
So, what is Syntiant’s secret to such low power? I started my Ph.D. in 1997 studying low-power VLSI design. At the time there was a new-found academic push that CMOS design was not just about pure speed and area (most of the considerations at the time), but that design can actually influence power. Here is the basic equation for power in CMOS circuits, which gives you everything you need to know:
With this simple equation, you can immediately determine five variables to reduce power in a CMOS design:
- Reducing the activity factor, α
- Reducing the switching frequency, F
- Reducing the total capacitance, C
- Reducing the supply voltage (quadratic gains in power), V
- Reducing the static power in the system
Moore’s law scaling has been the dominant driver for reducing both the voltage and capacitance factors in circuits through each generation of CMOS technology. We’ve watched technology evolve with typical core supplies dropping from 1.1V to 0.9V, and even more aggressive 0.8V and 0.75V system supplies. On Syntiant’s side, we go a step further, and typically under-drive the nominal supply voltage for a technology node, and we tap into the quadratic voltage relationship. Normally this is accompanied with a reduction in the max speed of the circuit, but with the heavy parallelism in our ML circuits, the reduced speeds are not affecting the max performance. We have circuits that allow SRAM retention at lower voltages, so during the “quiet” periods between inference, we further reduce energy leakage.
We can go further in reducing capacitance in the design with our ASIC physical designs flows. However, it’s important to understand the Pareto curve that is inherent to logic synthesis, of which you can easily drive circuits too hard. A high-end processor company will push that curve to the maximum to win the GHz game, but we always continuously look at the speed/area curves for synthesis and try to be on the “knee of the curve” rather than push for max performance. In the synthesis and then backend flows, striving for max performance can easily double the size of your design — and it comes with doubling the capacitance as well, plus a static power penalty no matter your operating frequency.
When it comes down to the core of our custom designed neural networks processors, my favorite factor is the activity factor. It is determined by how many bits are toggling to accomplish the neural network math. We are building custom at-memory architectures, so we can consume data right at the storage elements (which in turn reduces the activity factors on logic cells). We can partition memories, so the retrieval of params in one part of the circuit doesn’t light-up all the bit-lines in the memory. Compared to processors and DSPs where so much energy goes to moving data around through traditional memory buses, we have an enormous switching factor activity advantage.
Even deeper, we consider neural networks layers as atomic units, allowing us to construct the optimal controllers to compute the math. We see many of the alternative solutions in the neural network space to tear apart the networks into millions of tiny little operands. As a result, you’ve immediately lost all the context around the processing. What is shared, reusable and ignorable? The flip-side of presenting a layer as an atomic unit is that our NDPs become much easier to program and the relationship between model architecture, speed and power becomes very predictable. We are able to achieve extremely low power, while at the same time letting the machine learning scientists know exactly what is going to happen when their networks load on our devices. There is no intermediate compiler or “fitting” step or going back to the ML to change the architecture to get the power down.
To sum it up, what is Syntiant’s secret sauce to lower processing? It turns out it really is a comprehensive understanding of low power chip design by applying all the lessons we have learned when it comes to circuits, architecture, systems software and ML training. We need to be fanatical about reducing power at every point. And it shows in the outcome.
Dave Garrett, Ph.D.
Vice President of Hardware, Syntiant