Realizing the Benefits of GPU Compute for Real Applications with Mali GPUs


This article was originally published at ARM's Community site. It is reprinted here with the permission of ARM.

I have just returned from a fortnight spent hopping around Asia in support of a series of ARM hosted events we call the Multimedia Seminars, which took place in Seoul (27th June), Taipei (1st July) and Shenzhen (3rd July). Several hundred attendees joined in each location, a quality-dense cross-section from the local semiconductor and consumer industries, including many silicon vendors, OEMs, ODMs and ISVs. All of them were able to hear the great progress made by the ARM ecosystem partners who are developing the use of GPU Compute on ARM® Mali™ GPUs. In this blog I will try to summarise some of the highlights.

The Benefits of GPU Compute

In my presentation at the three sites I was able to illustrate the benefits of GPU Compute using Mali. This was an easy task as at the event many independent software vendors were demonstrating and promoting a vast selection of middleware ported and optimized for Mali.

But what are the benefits of GPU Compute?

  • Reduced power consumption. The architectural characteristics of the Mali-T600 and Mali-T700 series of GPUs enable computation of many parallel workloads much more efficiently than alternative processor solutions. GPU Compute accelerated applications can therefore benefit by consuming less energy, which translates into longer battery life.
  • Improved performance and user experience. Where raw performance in the target, the computation of heavy parallel workloads can also be significantly accelerated through the use of the GPU. This may translate in increased frame rate, or the ability to carry out more work in the same temporal/power budget, and can result in benefits such as improved UI responsiveness, more robust finger detection for gesture UIs in challenging lighting conditions, more accurate physics simulation, the ability to apply complex pre-/post-processing effects to multimedia on-device and in real-time. In essence: a significantly improved end-user experience.
  • Portability, programmability, flexibility. Heterogeneous compute APIs such as OpenCL™ and RenderScript, are designed for concurrency. They allow the developer to migrate some of the load from the CPU to the GPU or other accelerator, or to distribute it between processors in order to enable better load-balancing across system resources. For example a video codec may offload motion vector calculations to the GPU, enabling the CPU to operate with fewer cores and at lower frequencies, or to be available to compute additional tasks, for example video analytics.
  • Reduction of cost, risk and time to market. System designers may be influenced by various cost, flexibility and portability concerns when considering migrating functionality from dedicated hardware accelerators to software solutions which leverage the CPU/GPU subsystem. This approach is made viable and compelling due to the additional computational power provided by the GPU, now exposed through industry standard heterogeneous compute APIs.

Over the last few years ARM has worked very hard to create and develop a strong GPU Compute ecosystem. Collaborations were established across geographies, use-cases and applications, working with partners at all levels of the value chain. These partners were able to translate the benefits of GPU Compute into reality, to the ultimate avail of the end users, and were proudly showcasing their progress at the Multimedia Seminars.

Demonstrating Reduced Power Consumption

Software codec vendors such as Ittiam Systems have been demonstrating for some time HEVC and VP9 optimized ports that make use of GPU Compute on Mali-T600 series GPUs. A software solution leveraging the CPU+GPU compute subsystem can be useful for reducing TTM, reducing risk in the adoption of new standards, but most importantly, it can help to save power.

For the first time ever Ittiam Systems publically demonstrated how a software solution leveraging on the CPU+GPU compute subsystem is able to save power compared to a solution that does not make use of the GPU. Using an instrumented development board and power probing tools connected to a National Instruments DAQ unit they were able to demonstrate a typical reduction in power consumption of over 30% for 1080p30 video playback.

Mukund Srinivasan, Director and General Manager of the Consumer and Mobility Business Unit at Ittiam, said: "These paradigm shifts open a unique window of opportunity for focused media-related Intellectual Property providers like Ittiam Systems® to offer highly differentiated solutions that are not only compute efficient but also enhance user experience by way of a longer battery life, thanks to offloading significant compute to the ARM Mali GPU. The main hurdle to cross, in these innovative solutions, comes in the form of how to manage the enhanced compute demands of a complex codec standard on mobile devices, since it bears significantly more complex coding tools, for codecs like H.265 or VP9, as compared to VP8 or H.264. In order to gain the maximum efficiencies offered by the GPU technology, collaborating with a longstanding partner and a technology pioneer like ARM enabled us to generate original solutions to the complex problems posed in the design and implementation of consumer electronic systems. Working closely with ARM, we have been able to showcase not just a prototype or a demo, but a real working product delivering significant power savings, of the order of 30-35% improvement in energy efficiencies, when measured on the whole subsystem, using the ARM Mali GPU on board a silicon chip designed for use in the Mobile market"

Demonstrating reduced CPU Load

Another benefit of GPU Compute was illustrated by our partner Thundersoft, who have implemented a gender-based real-time facebeautifier application and improved its performance using RenderScript on a Mali-T600 GPU. The algorithm first detects the subject’s face and determines its gender, and based on the gender applies a chain of complex image processing filters that enhances the subject’s appearance. This include face whitening, skin tone softening, de-blemishing effects.

The algorithm is very computational intensive and very taxing on the CPU resource, which can at times result in poor responsiveness. Furthermore, a performance of 20+ FPS is required in order to deliver a good user experience and this is not achievable on the CPU alone. Fortunately, the heavy level of data parallelism, and large proportion of floating point and SIMD-friendly operations, make this use case great for GPU acceleration. Using RenderScript on Mali, Thundersoft were able to improve the performance from a poor 10fps to over 20fps, and at the same time reduce the CPU load from fluctuating between 70-100% to a consistent < 40%. Dynamic power reduction techniques are therefore able to disable and scale down operational points of the CPUs in order to save power.

Delivering improved Performance and User Experience

Image processing is proving to be a very fertile area for GPU Compute processing. In their keynote and technical speech, ArcSoft illustrated how they utilised Mali GPUs to improve many of their algorithms including JPEG, photo filters, beautification, Video HDR (NightHawk), and HEVC. A “nostalgia effect” filter, based on convolution, was optimized using OpenGL® ES. For a 1920×1080 camera preview image, the rendering time was reduced from 80ms down to 20ms using the Mali GPU. This means going from 12.5fps to 50fps.

Another application that benefited is ArcSoft’s implementation of a face beautifier. The camera stream was processed by the CPU, whilst colour-conversion and rendering was moved from the CPU to the GPU. Processing time for a 1920×1080 frame was therefore reduced from 30ms to just 10ms. In practice this meant that the face beautification frame rate was improved from 16fps to 26fps!

Another great example is JPEG processing. OpenCL was used for reconstructing inverse quantization and IDCT modules. Compared with ArcSoft’s Neon based JPEG decoder, performance in decoding 4000×3000 resolution images inproved 25%. Compared with OpenCL based open-source project JPEG-OpenCL, the efficiency of IDCT increased as much as 15 times.

Improved User Experience for Computer Vision Applications

You may have previously seen our partner eyeSight Technologies demonstrate how they have been able to improve the robustness and reliability of their gesture UI engine. Gesture UIs are particularly challenged when lighting conditions are poor, as this adds a lot of noise to the sensor data, and reduce accuracy of detection of gestures. As it happens, poorly lit situations is common when gesture UIs are typically used, such as inside a car, or in a living room. GPU Compute significantly increases the amount of useful computation that the gesture engine can carry out on the image within the same temporal and energy budget, this enables a significant improvement of reliability of gestures when lighting is poor.

eyeSight machine vision algorithms make extensive use of machine learning (using neural networks). The capability to learn from a vast amount of data, at a reasonable amount of time is a key element for success. However, the required
computational resources of neural networks, are beyond the capabilities of standard CPUs. eyeSight’s utilization of deep learning methods can greatly benefit from running on GPU processors.

eyeSight have used their extensive knowledge of machine vision technologies and the ARM Mali GPU Compute architecture to optimized their solutions using OpenCL on Mali.

Alva demonstrate video HDR and real-time video stabilization

In its presentation, Oscar Xiao, CEO of Alva Systems, discussed the value of heterogeneous computing for camera based applications, using two examples: real-time video stabilization and HDR photography. Alva optimized their solutions for Mali-400, Mali-450 and Mali-T628. Implementations of their algorithms are available using OpenGL ES and OpenCL APIs. Through the use of the GPU, image stabilization can be carried out comfortably for 1080p video streams at 30fps and above. Alva Systems have also implemented an advanced HDR solution that corrects image distortion (common in multi-frame processing algorithms), removes ghosting and carries out intelligent tone mapping (to enable a more realistic result). Of course all of these features increase the computational requirements of the algorithm. GPU Compute enables real-time computation. Alva were able to measure performance improvement of individual blocks of around 13-15x compared to the reference CPU implementation of the same algorithm.

In Conclusion

Modern compute APIs enable efficient and portable heterogeneous computing. This includes enabling the use of the best processor for the task, the ability to balance workloads across system resources and to offload heavy parallel computation to the GPU. GPU Compute with ARM Mali brings tangible advantages for real world applications, including reduced cost and time to market, improved performance and user experience, and improved energy efficiency (measured on consumer devices). These benefits are being enabled by our ecosystem partners who use GPU Compute on Mali for a variety of applications including: advanced imaging, computer vision, computational photography and media codecs.

Here you’ll find a wealth of practical technical insights and expert advice to help you bring AI and visual intelligence into your products without flying blind.



1646 N. California Blvd.,
Suite 360
Walnut Creek, CA 94596 USA

Phone: +1 (925) 954-1411
Scroll to Top