This blog post was originally published at BrainChip’s website. It is reprinted here with the permission of BrainChip.
In the rapidly evolving field of artificial intelligence, edge computing has become increasingly vital for deploying intelligent systems in real-world environments where power, latency, and bandwidth are limited: we need neural network models to run efficiently. For most of the AI field, that means developing new model architectures or training techniques to achieve sufficient accuracy from ever smaller models. As hardware providers, however, we have an extra strategy to leverage when we look for more efficient ways to carry out the computations themselves. Crucially, standard CNNs naturally show high levels of sparsity, particularly in their activations, meaning many of the values involved in computation are zero. This creates the opportunity for hardware designs like Akida’s, which can skip unnecessary operations and dramatically improve efficiency without sacrificing model fidelity.
BrainChip’s Akida is principally an accelerator for Convolutional Neural Networks (CNNs). Of course, it includes many optimizations, from co-localization of memory and computing in addition to smaller quantization, enabling int8 or even lower bit-width calculations throughout. The single greatest differentiating factor is the ability to exploit “sparsity”: that is, to avoid doing the computation altogether, when possible. Where there are zeros in the values to be multiplied (Figure 1), the multiplication is simply not scheduled. Most accelerators must compute this calculation regardless of the fact it will produce a zero output since they are formed by tightly coupled multiplier accumulator arrays. While exploiting sparsity can yield significant efficiency gains, it does require additional hardware logic to detect and skip zero-valued operations. This introduces some complexity in design and verification, though the trade-off is often favorable in power-constrained environments.
Of course, for that strategy to work, there must be a significant amount of sparsity in the CNN model being inferenced. Fortunately, that’s not something the average user needs to be concerned with as we’ll see, standard models already include very significant levels of sparsity. For engineers, researchers, and developers working on AI at the edge, understanding how Akida leverages sparsity can be a route to achieving even greater efficiency with your models. By embracing sparsity, Akida not only improves computational efficiency but also opens new pathways for designing intelligent systems that operate where conventional AI solutions fall short.
Figure 1: Left – Conventional hardware for running deep models processes all values. ON the righthand side of the figure, Akida hardware skips those operations where multiplying by zero (i.e., taking advantage of event sparsity) thus leading to better efficiency (i.e., lower latency with no impact on accuracy, because the output from the calculation is unchanged).
What is Sparsity?
- Sparsity, in the context of neural networks, refers to the presence of zero-valued elements in the data or parameters involved in computation, specifically in inputs, activations, or weights. Rather than being a single metric, sparsity manifests in different forms depending on where these zeros occur.
- Activation sparsity: This is the type of sparsity that Akida principally exploits, precisely what we want to focus on here. Its naming is a little counterintuitive, so let’s take a moment to unwrap that: for every layer of a neural network model after the first, the input to that layer is the output of the preceding layer. Those outputs have historically been called activations, hence “activation sparsity”. It will be demonstrated that activation sparsity is often high in models, typically because the commonly used “ReLU” activation function rectifies all negative values to zero.
- Input sparsity: The inputs to the first layer (and thus, to the model) must be considered a special case. For a typical model receiving, say, RBG image inputs, we do not expect there to be any appreciable sparsity at all. However, it’s easy to produce preprocessing schemes (e.g. difference of frames in video input) that can generate significant input sparsity. Equally, there are specialized sensors (e.g. dynamic vision sensors or “event-based” cameras) specifically designed to generate sparse input signals. One key takeaway here is that Akida does not need this kind of input sparsity to do well! The activation sparsity described above will naturally arise between layers in the model regardless of the input to the first layer. That said, Akida is naturally placed to exploit the input sparsity from those specialized sensors, so we will return to consider those cases.
- Weight sparsity: This refers to zeros in the model weights. These arise naturally during training and are typically “unstructured” (have no spatial pattern within the weights matrices) although various schemes exist to establish structured sparsity patterns (e.g. 2:4 sparsity): those can be more efficiently exploited than unstructured weight sparsity but nonetheless require dedicated hardware. Programmers can take advantage of weight sparsity with offline preprocessing of the model by pruning model weights that are near zero and compressing the network. For the sake of clarity, this is often what other manufacturers are calling sparsity, and this is also not the type exploited by Akida hardware.
Figure 2: Maps of randomly generated input sparsity at different levels of sparsity. Black pixels indicate zeros. This is used to measure the sparsity of a single layer CNN. See text.
The main distinction is that input sparsity comes from the data itself, weight sparsity is a built-in characteristic of the model, and activation sparsity (a.k.a. event sparsity) is a dynamic trait of the model’s activations, varying with the input from one sample to another.
Here, our interest is in exploiting sparsity to reduce the computation required through the model. Since the key computation in neural network layers boils down to a series of multiplications between the inputs and weights of a layer, those are going to be the sparsity values we care about.
Exploiting Sparsity with Akida: Event-based Convolution
If exploiting sparsity was trivial, then everyone would do it. It comes with a cost: standard efficient implementations of the 2D convolution operations common in CNNs run via vector and matrix instructions. You simply cannot take that approach if trying to exploit unstructured sparsity. Instead, Akida uses a neuromorphic-inspired “event-based convolution” approach: broadly, rather than multiplying input by weights kernel for each position of the output space, the algorithm iterates over the input values and only projects the multiplied kernel to the output space where the input is non-zero. This means computations are triggered only when needed, a strategy that aligns closely with neuromorphic principles, where processing mimics the brain’s efficiency by activating only in response to relevant stimuli. If there is no sparsity, this would not be the most efficient approach. The bet is that for real CNNs there is enough sparsity to make this worthwhile.
How does that work out in practice? We can directly measure the behavior of individual layers in hardware. The following are results for a very typical CNN layer, running a standard convolution with kernel size 3×3, input height and width of 32 with 64 input channels and 64 filters mapped to a single Neural Processing node (NP) on Akida 2 hardware. Processing duration was tested using a set of artificially generated inputs with sparsity at controlled levels, much like Figure 2.
The first thing to note about the results is the almost perfectly linear decrease in processing duration as sparsity increases: every zero in the input really does get skipped (Figure 3). The second point is that, when sparsity reaches 100% the processing time for the layer is very close to zero. That’s important: it means that the processing is really dominated by the inputs to be processed; there is no large input-independent overhead for the layer which could otherwise limit our benefit from exploiting sparsity. One final point is that this is a scatter plot; the measurements at each sparsity level are not averaged but show 10 repeats that are perfectly superimposed: the processing time is extremely repeatable.
Figure 3: Inference duration vs incoming Activation Sparsity for a Convolutional layer (input height and width 32, input channels and output filters 64), mapped to a single Neural Processing Node (NP) on Akida 2 hardware. Duration is reported as the ticks of the hardware clock (in this case, and FPGA running at 50 MHz). Note that the scatter plot does not show averages at each sparsity level, rather measurements from 10 repeats are shown but superimpose perfectly. Inference duration shows a very linear decrease with increasing sparsity. As sparsity approaches 100%, processing duration for the layer approaches zero (that is, there is almost no input-independent overhead).
There are some subtle advantages to this approach that should be mentioned. The algorithm is extremely scalable: unlike some other approaches, it does not require large layer or batch sizes to be optimal (actually, the algorithm runs natively at batch size 1, a distinct advantage in the edge setting).
Sparsity in Actual Models
It should be clear by now that Akida can be extremely efficient, but that it needs models to have significant activation sparsity. Fortunately, that is not a problem: it turns out that models are naturally sparse. If there is just one thing to take away from this blog it should be that: standard CNNs naturally show high levels of sparsity.
To make the point, we’ll turn to that most iconic of models, ResNet50 processing standard images from the ImageNet dataset (we have used the pretrained version of ResNet50 provided via tensorflow.keras.applications). Here, we have measured the sparsity (very simply, the proportion of zeros) in the outputs from each layer, averaging over 1000 input images. Please see Figure 4 for the results.
The results are impressive: except for a very few layers, the model shows around 50% sparsity from even the first blocks. That increases steadily, such that by the final stage, layers are approaching 80% sparsity!
Figure 4. Mean activation sparsity per layer for ResNet50 processing natural images. Each bar shows the average sparsity (proportion of zeros) in the output of a single layer, averaging over 1000 images. Layer names are indicated below; for legibility, only the final layer of each block is labeled.
How does that happen? The initial contribution comes from the commonly used Rectifying Linear Unit (ReLU) activation function, applied to the output of each layer. As its name suggests, this rectifies its inputs (sets all negative values to zero). On average, if the distribution of values entering a ReLU was normally distributed with zero mean (thus, with half the values negative), we would expect to see 50% sparsity coming out of the ReLU. In practice, in early layers of typical CNNs, we see slightly lower values than that: there are few filters in early model layers, and they encode very general, low-level features with high spatial precision. In subsequent layers, very high sparsity levels can naturally arise: there are many more filters, and they encode much higher-level features, only a minority of which will be present in any given image.
ResNet50, while a classic, is not optimized to be efficient and thus is not a good target for Akida. What about some more edge-appropriate models? The following plot (Figure 5) shows sparsity in a selection of models from our Model Zoo (indicated by model name / dataset).
Figure 5. Activation sparsity per layer for a selection of models from the BrainChip Model Zoo (Ready-to Use Akida Neural Network Models – BrainChip). Detailed code and examples are available in the Developer Hub. The selection includes models spanning different tasks, such as object classification, object detection, and keyword spotting, to study how activation sparsity varies depending on both the task and the dataset. This diversity helps illustrate the generality of sparsity across real-world applications. The pattern of sparsity described above is evident repeatedly here. In nearly every case, early network layers show lower sparsity, but always over 20% and typically rising to over 50% within a few layers. By the later network layers, sparsity is high, often greater than 80%. The notable exception to that is CenterNet/VOC, an object detection model with an “hourglass” shape: in the later layers, with a decrease in the number of filters and a requirement for precise spatial information to resolve the task, we see sparsity reduction again.
The final example, Akidanet/Visual Wake Word is of note: sparsity is remarkably high in this case. This is the only case in which extra measures were taken during training to increase activation sparsity. This was achieved by adding “regularization” to the training loss function, to encourage the model to learn with reduced activation values. This is remarkably successful. You can read more about this in our educational materials in our Developer Hub education tab.
Takeaways
As AI matures, it penetrates our daily life. For that to happen, AI models should be able to run on edge devices. This means they should be small, fast, and consume low power. At BrainChip, we are targeting a regime called extreme low power, in which models should run in microwatts or milliwatts. Our approach is a combination of building both hardware and software. Sparsity sits at the core of our technology and is inherent in neural networks. It is here to stay!
To see how sparsity is applied to a real-world case scenario, please see our blog post on Akida in Space where we show how Akida is used to optimize a Satellite Workflow. Please also visit our website for general Use Cases of models we build at BrainChip.
Authors
Doug McLelland holds a doctorate in Computational Neuroscience from the University of Oxford, which he received in 2006. That was followed by two post-doctoral positions, first at Oxford, and then at the University of Toulouse from 2011; studying mechanisms of visual processing and attention. In 2017, he joined BrainChip, where he has been focused on making sure that popular models can make the most of Akida’s hardware advantages.
Ali Kayyam is a Principal Research Scientist at BrainChip, with a Ph.D. in Computational Neuroscience from the Institute for Studies in Fundamental Sciences (IPM) in Tehran, and earlier degrees in Computer Engineering from the Petroleum University of Technology and Shiraz University. He has held academic and research roles at the University of Southern California, University of Wisconsin–Milwaukee, and University of Central Florida. His work focuses on computer vision, machine learning, and neuroscience, particularly in visual attention, active learning, neural networks, and biologically inspired vision models.