This blog post was originally published at Arm’s website. It is reprinted here with the permission of Arm.

Enabling secure and ubiquitous Artificial Intelligence (AI) is a key priority for the Arm architecture. The potential for AI and machine learning (ML) is clear, with new use cases and benefits emerging almost daily – but alongside this, computational requirements for AI have been growing at an exponential rate and require new hardware and software innovation to continue to balance memory, compute efficiency and bandwidth. The training of Neural Networks (NNs) is critical to the continued advancement of AI capabilities, and today marks an exciting step in this evolution with Arm, Intel and NVIDIA jointly publishing a whitepaper on a new 8-bit floating point specification, ‘FP8’.

FP8 is an interchange format that will allow software ecosystems to share NN models easily, and the collaboration between Arm, Intel and NVIDIA to support this one standard is significant. It means models developed on one platform may be run on other platforms without encountering the overhead of having to convert the vast amounts of model data between formats while reducing task loss to a minimum. FP8 minimizes deviations from existing IEEE floating formats, allowing developers to leverage existing implementations, accelerate adoption across platforms and improve their productivity.

Adopting reduced precision floating-point formats brings a number of benefits. Until a few years ago, training of Neural Networks was computed mostly with IEEE standard 32-bit floating-point numbers. Larger networks with more and more layers were found to be progressively more successful at NN tasks but in certain applications, this success came with an ultimately unmanageable increase in memory footprint, power consumption, and compute resources. It became imperative to reduce the size of the data elements (activations, weights, gradients) from 32 bits, and so the industry started using 16-bit formats, such as Bfloat16 and IEEE FP16. As the number of diverse applications requiring greater accuracy grows, Neural Networks are once again facing challenges around memory footprint, power consumption and compute resources. As a result, there is a rising demand today for a novel and simple 8-bit floating-point representation (alongside Bfloat and IEEE FP32) to enable even greater NN efficiency.

Considerable experimentation has demonstrated that FP8 shows comparable model performance to using 16- and 32-bit precision for transformer-based AI models as well as on models for computer vision and Generative Adversarial Networks (GANs). While FP8 has a somewhat limited dynamic range due to having only a small number of exponent bits, this can be compensated for by software-proprietary, per-tensor scale factors which adjust the representable range so that it better matches the values (weights, activations, gradients, etc) being handled instead of relying solely on the FP8 format. In addition, a model can be trained and deployed under the identical format of FP8, whereas fixed-point formats, notably int8, require carefully derived estimation based on statistics during the deployment phase in order to maintain accuracy, not to mention the calibration and conversion overhead.

Here at Arm, we are planning to add FP8 support to the Armv9 ISA as part of Armv9.5-A in 2023 and exploring the best way to integrate this support across all of our ML platforms. We firmly believe in the benefits of the industry coalescing around one 8-bit floating point format, enabling developers to focus on innovation and differentiation where it really matters. We’re excited to see how FP8 advances AI development in the future.

Read more about FP8 in this new technical paper.

Neil Burgess
Senior Principal Design Engineer, Arm

Sangwon Ha
Staff Software Engineer, Arm

Here you’ll find a wealth of practical technical insights and expert advice to help you bring AI and visual intelligence into your products without flying blind.



1646 N. California Blvd.,
Suite 360
Walnut Creek, CA 94596 USA

Phone: +1 (925) 954-1411
Scroll to Top